Introducing Tinyllama, an open-source small language model. All relevant information is published, including pre-training, all intermediate model checkpoints, and details of the data processing steps. With its compact architecture and satisfying performance, Tinyllama enables end-user applications on mobile devices and serves as a lightweight platform for testing various innovative ideas related to language models. **Drawing on the extensive experience gained from this open-ended, real-time phase project, the aim is to develop an improved version of Tinyllama, giving it various capabilities to improve its performance and versatility in a variety of tasks. **Further findings and detailed results will be documented in future reports.
Large language models (LLMs) pre-trained on large-scale text corpora have demonstrated their effectiveness in a variety of tasks. Some experimental studies have demonstrated emergent capabilities in LLMs that may only be apparent in models with sufficiently large parameter quantities, such as few-shot prompts and logistic chain inference. Other studies have focused on modeling the scaling beh**ior of LLMs. Hoffmann et al. (2022) showed that in order to train a compute-optimal model, the model size and the amount of training data should increase at the same rate. This provides guidance for choosing a model size and allocating the amount of training data with a fixed compute budget.
Although these efforts favor large models, the potential to train smaller models on larger datasets remains unexplored. Touvron et al. (2023a) emphasize the importance of inference budgets, rather than focusing solely on training computationally optimized language models. Inference-optimized language models for specific inference constraints are designed to achieve optimal performance. This is achieved by training the model with more tokens than the scaling law recommends. Touvron et al. (2023a) demonstrated that smaller models trained on more data can match or even outperform their larger counterparts. In addition, Thaddée (2023) argues that existing scaling laws may not be accurate when smaller models are trained for longer periods of time.
Inspired by these new discoveries,This article looks at exploring when the number of tokens far exceeds the law of scaling ().scaling law) is recommended when the smaller model behaves。Specifically, using about 3 trillion tokens, a user with 11b parameter transformer decoder model. This is the first attempt to train a model with 1b parameters using such a large amount of data. Following the same architecture and tokenizer as Llama 2, the model is named Tinylama.
Here's how to pre-train tinyllama. First, the details of the pre-trained corpus and data sampling methods are introduced. Next, the model architecture and hyperparameters used in the pre-training process are elaborated.
Pre-training data
The main goal is to make the pre-training process efficient and reproducible. A mixture of natural language data and data was used to pre-train tinyllama, where natural language data was used for slimpajama and data was used for starcoderdata. Llama Tokenizer is used to process the data.
SlimPajama: A large open-source corpus for training language models, based on RedPajama. The original RedPajama corpus is an open-source research effort to reproduce the reproduction of more than 12 trillion tokens of LLAMA pre-training data. Slimpajama is derived from redpajama by cleaning and deduplicating. StarCoderData: This dataset was collected for training StarCoder, a powerful open-source large language model. It contains about 250 billion tokens across 86 programming languages. In addition to **, it also includes github issues and text involving natural language - **pairs. To avoid data duplication, a GitHub subset in SlimPajama was removed and only ** data was sampled from StarCoderData. After combining these two corpora, there is a total of approximation95 billion tokens for pre-training。Tinyllama was trained on these markers for about 3epochsAs observed by Muennighoff et al. (2023), training up to 4 epochs on the same data has little performance degradation compared to using unique data. During the training process, natural language data is sampled to achieve:A 7:3 ratio between natural language data and ** data
Architecture
Table 1: Model architecture details.
It adopts a similar model architecture to Llama 2. The specific details of the Transformer architecture used are as follows:
Position Embedding: Use rope (Rotate Position Embedding to inject position information into the model. ROPE is a method that has been widely adopted by many mainstream large language models recently, such as PALM, LLAMA and QWEN. rmsnorm: In pre-normalization, the input is normalized before each Transformer sublayer for more stable training. In addition, RMSNORM is used as a normalization technique, which can improve training efficiency. Swiglu: Instead of using the traditional Relu nonlinearity, it follows Llama 2, combining a Swish and gated linear unit, called Swiglu, as an activation function in Tinyllama. Grouped query attention: To reduce memory bandwidth overhead and accelerate inference, grouped-query attention is used in the model. There are 32 query attention heads and use 4 key value attention head groups. With this trick, the model can share key and value representations across multiple headers without losing much performance. Speed optimization
Fully Fragmented Data Parallelism (FSDP): During training, the library integrates FSDP1 to make efficient use of multi-GPU and multi-node setups. This integration is critical for scaling the training process across multiple compute nodes, which significantly improves training speed and efficiency. Flash Attention: Another key improvement is the integration of Flash Attention 2, an optimized attention mechanism. The warehouse also provides converged layer normalization, converged cross-entropy loss, and converged rotational position embedding, which together play a key role in improving computational throughput. Xformers: Replaced the fused Swiglu module from the Xformers repository with the original Swiglu module, further improving the efficiency of the ** library. With these features, it is possible to reduce the memory footprint and make the 1.1 billion parameter model suitable for 40GB GPU RAM. Performance analysis and comparison with other models
The combination of these elements pushes the training throughput to 24,000 tokens per second per A100-40G GPU. Compared to other models, such as pythia-1 billion and mpt-1.3 billion, the library demonstrates superior training speed. For example, the Tinyllama-1.1 billion parameter model requires only 3,456 A100 GPU-hours for 30 billion marks, while Pythia takes 4,830 hours and MPT 7,920 hours. This shows the effectiveness of optimization and the potential for significant time and resource savings in large-scale model training.
Figure 1: Comparison of the training speed of the library with Pythia and MPT.
Build a framework on lit-gpt3. Following LLAMA 2, the automatic regression language modeling objectives were adopted in the pre-training phase. Consistent with the settings of Llama 2, use the ADAMW optimizer. Also, use the cosine learning rate plan. Use 2,000 steps to promote optimized learning. Set the batch size to 2,000,000 tokens. Set the weight decay to 01, and use 10 gradient clipping threshold to adjust the gradient value. Tinyllama was pre-trained with 16 A100-40G GPUs in the project.
tinyllama performs better than existing open-source language models of similar size. Specifically, Tinyllama outperformed OP-13B and Pythia 14B in a variety of downstream tasks. Tinyllama is open-source and aims to improve accessibility for language model researchers.
Tinyllama is evaluated on a variety of common-sense reasoning and problem-solving tasks and compared to several existing open-source language models with similar model parameters.
Baseline model
The main focus is on language models with decoder architectures of about 1 billion parameters. Specifically, Tinyllama is compared with OPT-13B, Pythia-1 billion, and Pythia-1.4 billion.
Common sense reasoning task
To understand Tinyllama's common-sense reasoning abilities, consider the following tasks: Hellaswag, OpenBookQA, WinoGrande, Arc-Easy and Arc-Challenge, BoolQ, and Piqa. Adopt a language model evaluation tool to evaluate these models. Following previous practice, these models were evaluated on these tasks in a zero-shot manner. The results are shown in Table 2. It was noted that tinyllama outperformed the benchmark model on many tasks and achieved the highest average score.
Table 2: Zero-shot performance on common-sense reasoning tasks.
Performance evolution during training
Its accuracy in the Common Sense Inference benchmark was tracked during Tinyllama's pre-training process, as shown in Figure 2. Overall, Tinyllama's performance has improved as computing resources have increased, outperforming Pythia's accuracy of -1.4 billion in most benchmarks.
Figure 2: Evolution of Common Sense Inference benchmark performance during pre-training. Pythia-1 is also included in the diagramThe performance of 4b is used for comparison.
Problem Solving Assessment
Tinyllama's problem-solving ability was also evaluated using the Instructeval benchmark. The benchmark includes the following tasks:
Massive Multitask Language Understanding (MMLU): This task is used to measure the model's knowledge of the world and problem-solving ability on a variety of topics. We evaluated these models in a 5-sample setting. BIG-Bench Hard (BBH): This is a subset of 23 difficult tasks from the BIG-Bench benchmark (Srivast**a et al., 2022) and is designed to measure the ability of language models to follow complex instructions. The models were evaluated in a 3-sample setting. Discrete reasoning over paragraphs (drop): This reading comprehension task is used to measure the mathematical reasoning ability of a model. We evaluated these models in a 3-sample setting. humaneval: This task is used to measure the programming ability of the model. These models are evaluated on a zero-shot basis. The results of the assessment are shown in Table 3. It was observed that Tinyllama demonstrated better problem-solving skills compared to existing models.
Table 3: Performance of problem solving tasks on the Instructeval benchmark.
*Title: Tinyllama: An Open-Source Small Language Model* Link: