In a revolutionary step forward for open-source artificial intelligence (AI), the announcement of OpenOrca, an open-source dataset and series of instruct-tuned language models, is being heralded as a significant milestone. This pioneering project is set to redefine how we understand and access AI technology.
OpenOrca emerged from the examination of “Orca: Progressive Learning from Complex Explanation Traces of GPT-4” by Mukherjee et al. from Microsoft, a project rooted in deep, complex AI. The inspiring work conducted by Microsoft researchers, particularly their innovative LLaMA-13b based model, prompted the conception of OpenOrca, amidst concerns that Microsoft’s dataset may not be publicly released.
OpenOrca is not just another AI initiative; it is an ambitious undertaking that seeks to reproduce and improve upon the earlier efforts by Microsoft. This enterprise required substantial resources and specialized expertise. However, with the support of a dedicated team of open-source AI/ML engineers, the OpenOrca dataset has been completed.
This compilation of data represents a milestone achievement. It includes about one million data entries from FLANv2, augmented with GPT-4 completions, and approximately 3.5 million data entries from FLANv2, augmented with GPT-3.5 completions. These figures embody countless hours of labor and underline the scale of the OpenOrca project.
In developing this dataset, the team adhered to the submix and system prompt distribution described in the Orca paper, with some key modifications. Notably, all 75k of CoT was incorporated into the FLAN-1m dataset, and any duplicate entries identified were removed, resulting in a refined collection of 3.5 million instructs in the ChatGPT dataset.
Work on OpenOrca is progressing steadily, with fine-tuning of the model on the foundation of LLaMA-13b currently underway. This will facilitate direct performance comparison with Microsoft’s model once it is released. OpenOrca-LLaMA-13b is scheduled to be unveiled in mid-July 2023, accompanied by detailed evaluation findings and the dataset.
The scale of the OpenOrca project is vast, and the team is actively seeking sponsorship for GPU compute time for training the OpenOrca model on several platforms, including Falcon 7b, 40b, LLaMA 7b, 13b, 33b, 65b, MPT-7b, and 30b. Open to other possibilities, the team is also inviting sponsorship for any other targets, such as RWKV, OpenLLaMA.
Computing cost estimates for each model size were provided, ranging from 1k GPU-Hours for a 7b model to 10k-15k GPU-Hours for a 65b model. Sponsors will receive recognition on the project’s platform and model cards.
Several sponsors have already pledged their support, including Chirper.ai, which has made a financial contribution and offered mentorship. Other sponsors include preemo.io for LLaMA 7b and latitude.sh for LLaMA 33b.
OpenOrca’s success thus far would not have been possible without the dedicated efforts of an impressive team of open-source AI/ML engineers. The team includes members from OpenAccess AI Collective and Alignment Lab AI, such as Wing “Caseus” Lian, NanoBit, AutoMeta, Entropi, AtlasUnified, and neverendingtoast. Also essential to the team’s progress were Rohan, Teknium, Pankaj Mathur, and Tom “TheBloke” Jobbins, who provided valuable support in quantizing and amplifying the project.
"original_prompt": "OpenOrca: A New Chapter in Open-Source AI, deepleaps.com",
"prompt": "OpenOrca: A New Chapter in Open-Source AI, deepleaps.com, Photo, Realistic, Dramatic, Cinematic, Surrealist",