RedPajama

Creating fully open-source language models
RedPajama is a project to create a reproducible, fully open, leading language model by reproducing LLaMA training dataset of over 1.2 trillion tokens. The project aims to remove limitations such as commercial APIs and provide a more transparent pipeline for research. RedPajama has three key components: pre-training data, base models, and instruction tuning data and models. The pre-training data is the first component that is already released, which includes a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper. The project is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.