The RedPajama project, a collaborative initiative focused on the development of state-of-the-art open-source models, announced the highly anticipated release of the RedPajama-INCITE models. The release features models trained on the RedPajama base dataset, which was introduced a few weeks ago, based on the groundbreaking LLaMA paper. Since its launch, the base dataset has already made a significant impact on the open-source community, with the 5 terabyte dataset being downloaded hundreds of times and used to train models such as MPT, OpenLLaMA, and OpenAlpaca.
Among the models released today are a 3 billion (3B) and a 7 billion (7B) parameter base model, designed to replicate the LLaMA recipe as closely as possible. Additionally, fully open-source instruction-tuned and chat models are included, providing a wide range of options for developers and researchers. The RedPajama project has observed several key takeaways from the performance of these models:
- The 3B model ranks as the strongest in its class, with its small size making it exceptionally fast and accessible, even running on an RTX 2070 released over five years ago.
- Instruction-tuned models achieve robust performance on HELM benchmarks, with the 7B model outperforming the base LLaMA model by 3 points. These models are recommended for downstream applications, including few-shot, entity extraction, classification, or summarization tasks.
- The 80%-trained 7B model already surpasses the Pythia 7B model in performance, highlighting the significance of a larger dataset and the value of the RedPajama base dataset.
- The project team sees a clear path for creating an improved version of the RedPajama dataset, which they plan to release in the coming weeks. This new dataset is expected to go beyond the quality of LLaMA 7B, with larger scale models built upon it.
The RedPajama project demonstrates that high-performing large language models (LLMs) can be built swiftly by the open-source community. This impressive feat builds on several components, including the 1.2 trillion token RedPajama dataset, Pythia training code from EleutherAI, FlashAttention from Stanford, HELM benchmarks from Stanford CRFM, and generous support from MILA, EleutherAI, and LAION for compute time on the Summit supercomputer. The project team firmly believes that these kinds of open collaborations, at larger scales, will lay the foundation for the best AI systems of the future.
Today’s release includes six models under the permissive Apache 2.0 license, which allows for use in research and commercial applications. These models range from a base 3B model to an early preview of the 7B model, with different variations for chat and instruction-tuned applications. Each model has been meticulously designed and fine-tuned to provide exceptional performance in various tasks.
The open-source community’s support, suggestions, and feedback for RedPajama has been truly incredible. Based on these learnings, the project team is already working on the next version of the RedPajama base dataset, which will be nearly twice the size of the original v1 dataset. This new dataset is expected to further enhance the performance and capabilities of future models.
During the RedPajama model training, the project team shared regular updates, and both the 3B and 7B models have now been trained on 800 billion tokens. The 3B model has stabilized at 800 billion tokens, while the 7B model continues to improve as it completes training to 1 trillion tokens. The team is excited to see the progress and improvements made by both models during this training phase.
The 3B RedPajama Models
RedPajama-INCITE-Base-3B-v1 is trained over the RedPajama v1 dataset, adopting the same architecture as the popular Pythia model suite. The project team opted for the Pythia architecture to evaluate the benefits of training with the much larger RedPajama dataset compared to the current leading open-source dataset, the Pile. Training on Summit utilized the DeeperSpeed codebase developed by EleutherAI.
Remarkably, at 800 billion tokens, RedPajama-Base-INCITE-3B has demonstrated better few-shot performance (measured in HELM, as the average score over 16 core scenarios) and superior zero-shot performance (measured in Eleuther’s LM evaluation harness) compared to open models of similar size, including the highly-regarded GPT-Neo and Pythia-2.8B. On HELM, RedPajama-Base-INCITE-3B outperforms these models by 3-5 points, and on a subset of tasks from the lm-evaluation-harness, it surpasses these open models by 2-7 points.
Additionally, the project team is excited to release an instruction-tuned version of the 3B model, RedPajama-INCITE-Instruct-3B-v1, trained following Together’s GPT-JT recipe and removing any data in HELM benchmarks to ensure no contamination with respect to HELM. This model displays outstanding performance on few-shot tasks, even nearing the quality of LLaMA 7B in a much smaller model.
RedPajama-INCITE-Chat-3B-v1 is an open-source chat model built with RedPajama-INCITE-Base-3B-v1 and fine-tuned over the OASST1 dataset by Open Assistant and the Dolly v2.0 dataset by DataBricks. The datasets are mixed equally and fine-tuned for three epochs.
Preview of RedPajama 7B
The 7B model is still in training (at 800 billion tokens), and the training loss continues to decrease consistently. Consequently, the team will continue training it to 1 trillion tokens. However, this checkpoint is quite useful and interesting for the community to better understand the training process. Therefore, the project team is releasing three intermediate checkpoints as a “preview” of the final models:
- RedPajama-INCITE-Base-7B-v0.1 is a base model trained over 800 billion tokens.
- RedPajama-INCITE-Chat-7B-v0.1 is its chat counterpart, trained over Dolly 2.0 and Open Assistant.
- RedPajama-INCITE-Instruct-7B-v0.1 is instruction-tuned for few-shot applications, following the recipe for GPT-JT but eliminating all datasets that overlap with the HELM benchmark.
Each of these checkpoints is released under the Apache 2.0 license. Even at 800 billion tokens, promising results are already evident. On HELM, the base model outperforms open models such as GPT-J and Pythia-6.9B by 0.5-2.2 points, and on EleutherAI’s lm-evaluation-harness, it surpasses these models by 1-3 points on average.
However, compared to LLaMA 7B, there is still a quality gap of 4.3 points on HELM at the moment. For few-shot applications (like those in HELM), the instruction-tuned RedPajama-INCITE-Instruct-7B-v0.1 model has already demonstrated significant improvements. This model is closing the gap to LLaMA 7B, with a difference of only 2.2 points on HELM. The team expects that further training will help narrow this gap and may even surpass LLaMA 7B in performance.
The Future of RedPajama
With the current progress and the remarkable performance of the RedPajama models, the project team is excited to continue its mission to create highly effective, open-source AI models. The team is already planning to work on the next iteration of the RedPajama dataset, which will be almost twice the size of the original v1 dataset.
This expanded dataset will provide an even stronger foundation for future models, further enhancing their capabilities and performance. Additionally, the project team will explore new methods to improve efficiency, training times, and overall model effectiveness while maintaining the open-source nature that has made RedPajama such a successful project.
As the RedPajama project moves forward, the team remains committed to working collaboratively with the open-source community. They believe that sharing resources, knowledge, and expertise are essential in creating the best AI systems of the future. By working together, the RedPajama project and the open-source community can overcome the challenges of advanced AI research, enabling everyone to access and benefit from powerful AI models.
In conclusion, the RedPajama project’s release of the RedPajama-INCITE models marks a significant milestone for the open-source AI community. These cutting-edge models have demonstrated exceptional performance across various tasks, offering a powerful alternative to proprietary AI systems. By making these models freely available under the Apache 2.0 license, the RedPajama project reinforces its commitment to advancing open-source AI research and fostering collaboration within the community. As the project continues to evolve and expand, there is no doubt that the RedPajama models will play a crucial role in shaping the future of AI.
"original_prompt": "Open-Source Community Releases state-of-the-art open-source AI models - Surpassing Leading Benchmarks, deepleaps.com",
"prompt": "Open-Source Community Releases state-of-the-art open-source AI models - Surpassing Leading Benchmarks, deepleaps.com, Fantasy, Realistic, Photo, Surrealist, Excited, Surprised",