With the launch of Mixtral 8x7b (announcement, model card), a Transformer model called Mixed Expert Models (MOES) has attracted a lot of attention in the open-source AI community. In this blog post, we'll dive into the core components of MOES, training methods, and the various factors that need to be considered during the inference process.
Let's get started!
Hybrid Expert Model (MOES):
Compared to the dense model,Pre-training is fasterCompared to a model with the same number of parameters, it has a faster oneInference speedYesLots of video memory, because all expert systems need to be loaded into memoryThere are many challenges in fine-tuning, but recent research has shown that a hybrid expert model is conductedInstruction tuning has a lot of potential。Let's get started!
Model size is one of the key factors in improving model performance. With a limited budget of computing resources, training a larger model with fewer training steps is often more effective than training a smaller model with more steps.
A significant advantage of hybrid expert models (MOEs) is their ability to be effectively pre-trained with far less computational resources than dense models require. This means that you can significantly scale up your model or dataset with the same compute budget. Especially in the pre-training phase, hybrid expert models are often able to reach the same level of quality more quickly than dense models.
So, what exactly is a hybrid expert model (MOE)?As a model based on the Transformer architecture, the hybrid expert model consists of two main components:
Sparse MOE layer: These layers replace the feedforward network (FFN) layers in the traditional Transformer model. The MOE layer contains several "experts" (e.g., 8), each of which is a separate neural network in its own right. In practice, these experts are usually feedforward networks (FFNs), but they can also be more complex network structures or even the MOE layer itself, resulting in a hierarchical MOE structure. Gated network or routing: This section is used to decide which tokens are sent to which expert. For example, in the diagram below, the token "more" might be sent to the second expert, and the token "parameters" might be sent to the first expert. Sometimes, a token can even be sent to multiple experts. The way tokens are routed is a key point in the use of MOEs, as the router consists of learned parameters and is pre-trained along with the rest of the network.
Moe Layer in Switch TransformersTo summarize, in the Hybrid Expert Model (MOE), we replace each feedforward network (FFN) layer in the traditional Transformer model with a MOE layer, where the MOE layer consists of two core parts: a gated network and a number of experts.
While hybrid expert models (MOEs) offer several significant advantages, such as more efficient pre-training and faster inference speed compared to dense models, they also come with some challenges:
Training challenges: Although MOEs can achieve more efficient computational pre-training, they often face the problem of insufficient generalization ability in the fine-tuning stage, which is prone to overfitting for a long time. Reasoning challenges: MOE models may have a large number of parameters, but only a subset of them are used in the inference process, which makes them faster to inference than a dense model with the same number of parameters. However, this model requires all parameters to be loaded into memory, so the memory requirements are very high. In the case of a MOE like Mixtral 8x7b, enough VRAM is required to accommodate a dense model with 47b parameters. The reason why it's 47b instead of 8 x 7b = 56b is because in the MOE model, only the FFN layer is considered independent experts, while the other parameters of the model are shared. Also, assuming that only two experts are used per token, then the inference speed (calculated in flops) is similar to using the 12b model (instead of the 14b model) because while it does a 2x7b matrix multiplication computation, some layers are shared. Now that we understand the basic concepts of MOE, let's explore the research that is driving the development of this type of model.
The idea of the Hybrid Expert Model (MOE) originated in 1991 with the Adaptive Mixture of Local Experts. This concept is similar to the ensemble learning approach and aims to establish a governance mechanism for a system consisting of multiple individual networks. In such a system, each network (known as an "expert") processes a different subset of the training samples, focusing on a specific region of the input space. So, how do you choose which expert to work on a particular input?This is where the gated network comes into play, which determines the weight assigned to each expert. During training, both these experts and gated networks are trained simultaneously to optimize their performance and decision-making capabilities.
Between 2010 and 2015, two separate areas of research contributed significantly to the subsequent development of the Hybrid Expert Model (MOE):
Component Specialists: In a traditional MOE setup, the entire system consists of a gated network and multiple experts. In the study of support vector machines (SVMS), Gaussian processes, and other methods, MOE is often seen as part of the overall model. However, Eigen, Ranzato, and Ilya's study explores MOE as a component of deeper networks. This approach allows the MOE to be embedded in a layer of a multi-layer network, making the model both large and efficient. Conditional calculations: Traditional neural networks process all input data through each layer. During this period, researchers such as Yoshua Bengio began to explore ways to dynamically activate or deactivate network components based on input tokens. The convergence of these studies has facilitated the exploration of hybrid expert models in the field of natural language processing (NLP). In particular, in 2017, Shazeer et al. (the team included Geoffrey Hinton and Jeff Dean, the latter sometimes jokingly referred to as "Google's Chuck Norris") applied the concept to 137B's LSTM (an architecture widely used in NLP at the time, proposed by Schmidhuber). By introducing sparsity, this work achieves fast inference speed while maintaining an extremely high scale. This work is mainly focused on the field of translation, but faces a variety of challenges, such as high communication costs and training instability.
The introduction of the Moe Layer Hybrid Expert Model (MOE) in the Outrageously Large Neural Network has made it possible to train models with hundreds of billions or even trillions of parameters, such as the open-source 16 trillion parameter switch transformers, etc. This technology is not only widely used in the field of natural language processing (NLP), but also begins to be explored in the field of computer vision. However, this blog post will primarily focus on applications and ** in the field of natural language processing.
The concept of sparsity employs the idea of conditional computation. In a traditional dense model, all parameters are processed for all input data. In contrast, sparsity allows us to perform calculations only for certain specific parts of the overall system. This means that not all parameters will be activated or used when processing each input, but rather only a partial set of parameters will be called and run depending on the specific characteristics or needs of the input.
Let's dive into Shazeer's contribution to the Hybrid Expert Model (MOE) in translation applications. The concept of conditional computation (i.e., activating different parts of the network only on a per-sample basis) makes it possible to scale the model size without adding additional computational burden. This strategy enables the effective use of thousands or more experts in each MOE layer.
This sparsity setup does present some challenges. For example, in a hybrid expert model (MOE), although a larger batch size is generally beneficial for improved performance, the actual batch size may be reduced as the data passes through the activated expert. For example, let's say we have an input batch of 10 tokensThere may be five tokens that are routed to the same expert, while the remaining five tokens are routed to different experts. This leads to uneven distribution of batch sizes and inefficient resource utilization。In the following sections, we will discuss other challenges of making MOE run efficiently and how to solve them.
So how should we solve this problem?A learnable gating network (g) decides which part of the input to send to which experts (e).
In this setup, while all experts perform operations on all inputs, weighted multiplication is performed through the outputs of the gated network. But what happens if g (the output of the gated network) is 0?If this is the case, there is no need to calculate the corresponding expert actions, so we can save computing resources. So what is a typical gating function?A typical gating function is usually a simple network with a softmax function. This network will learn which expert to send the input to.
The work of Shazeer et al. also explores other gating mechanisms, including noisy top-k gating. This gating method introduces some adjustable noise and then retains the first k values. Specifically:
1.Add some noise.
2.Choose to keep the first k values.
3.Apply the softmax function.
This sparsity introduces some interesting features. By using a lower k-value, such as 1 or 2, we can train and reason faster than when multiple experts are activated. Why not just the best experts?The initial assumption was that inputs would need to be routed to more than one expert in order for the gating to learn how to route effectively, so at least two experts would need to be selected. Switch Transformers has done more research on this.
Why do we add noise?This is for load balancing among experts!
As previously discussed, if all tokens are sent to only a few popular experts, then the training efficiency will be reduced. In the usual hybrid expert model (MOE) training, gated networks tend to predominantly activate the same few experts. This situation can be self-reinforcing, as popular specialists train faster, so they are easier to select. To alleviate this problem, one was introducedAncillary losses, which aims to encourage the giving of the same importance to all experts. This loss ensures that all experts receive approximately equal numbers of training samples, thus balancing the choice between experts. The next section will also include the concept of expert capacity, which introduces a threshold on how many tokens an expert can handle. Intransformers
library, you can passaux_loss
parameters to control auxiliary losses.
The Transformer class model clearly shows that increasing the number of parameters improves performance, so it's not surprising that Google is using gshard to try to scale the number of parameters of the Transformer model to more than 600 billion.
GSShard replaces in each feedforward network (FFN) layer in the encoder and decoder with a hybrid expert model (MOE) layer that uses top-2 gating. The diagram below shows the structure of the encoder section. This architecture works well for large-scale computing: when scaled to multiple devices, the MOE layer is shared across the different devices, while all other layers are replicated on each device. We'll discuss this in more detail in the "Making the MOE Fly" section.
Moe Transformer Encoder in GSShard To maintain load balancing and training efficiency, the authors of GSShard have introduced a few key changes in addition to the similar auxiliary losses discussed in the previous section:
Random routing: In the top-2 setting, we always select the highest-ranked expert, but the second expert is randomly selected based on their weight scale. Expert capacity: We can set a threshold that defines how many tokens an expert can process. If both experts reach their capacity, the token overflows and is passed to the next layer via residual joins, or in some cases discarded entirely. Expert capacity is one of the most important concepts in MOE. Why do you need expert capacity?Because the shape of all tensors is statically determined at compile time, we have no way of knowing in advance how many tokens will be assigned to each expert, so a fixed capacity factor is required. GSHARD's work has also made important contributions to parallel computing patterns for MOE, but the discussion of these is beyond the scope of this blog.
Note: During the reasoning process, only part of the experts are activated. At the same time, some computational processes are shared, such as the self-attention mechanism, which applies to all tokens. This explains why we can run a 47b model with 8 experts using the equivalent of 12b dense models. If we use top-2 gating, the model will use parameters up to 14b. However, due to self-attention manipulation (shared among experts), the number of parameters used in the model runtime is actually 12b.
Although hybrid expert models (MOEs) show great potential, they have stability issues during training and fine-tuning. Switch Transformers is a very exciting job that delves into these topics. The author even released a 1 on Hugging Face6 trillion parameters of MOE, with 2048 experts, you can usetransformers
library to run it. Switch Transformers achieved a 4x increase in pre-training speed compared to T5-XXL.
The Switch Transformer layer in Switch Transformer is just like in GSHARD, where the author replaces the feedforward network (FFN) layer with a Hybrid Expert Model (MOE) layer. Switch Transformer proposes a Switch Transformer layer that receives two inputs (two different tokens) and has four experts.
Contrary to the original idea of using at least two experts, Switch Transformer employs a simplified single-expert strategy. The effects of this approach include:
Reducing the computational burden of gated networks (routing) can be at least halved by reducing communication costs and maintaining model quality switch transformersExpert capacityThis concept was studied.
The recommended capacity above is to evenly distribute the number of tokens in the batch among the individual experts. If we use a capacity factor greater than 1, we provide a buffer when the token allocation is not perfectly balanced. Increasing the capacity factor results in higher inter-device communication costs, so this is a trade-off to consider. Of particular note are the effects of switch transformers at low volume factors (e.g., 1 to 1.).25) Perform well under the show.
The authors of Switch Transformer also revisited and simplified the load balancing losses mentioned in the previous sections. During training, the auxiliary loss for each switch layer is added to the total model loss. This loss encourages uniform routing and can be weighted using hyperparameters.
The authors also tried mixed-precision methods, such as withbfloat16
Precision training experts while performing the rest of the calculations with full precision. Lower precision reduces the cost of communication between processors, the cost of computation, and the memory of the tensor. However, in the initial experiments, when both experts and gated networks were usedbfloat16
During precision training, there is an unstable training phenomenon. This instability is particularly caused by routing calculations, as routing involves operations such as exponential functions, which require high precision. Therefore, it is important to maintain higher accuracy in order to maintain the stability and accuracy of the calculations. To alleviate instability, full precision is also used in the routing process.
Using mixed precision doesn't degrade model quality and enables faster training, and this jupyter notebook shows a detailed guide on how to fine-tune switch transformers for summary generation. However, before you start fine-tuning switch transformers, it is highly recommended that you read the section on fine-tuning the blending expert model.
Switch Transformer uses an encoder-decoder architecture to implement a hybrid expert model (MOE) version similar to T5. GLAM's work explores how to train GPT-3 quality models to increase the scale of GPT-3 models with compute resources of only 1 3 (because MOE models require less computation to train, resulting in a significantly lower carbon footprint). The authors focus on decoder-only models and few- and single-shot evaluations, rather than fine-tuning. They used top-2 routing and a larger capacity factor. In addition, they made the capacity factor a dynamic measure that was adjusted based on the amount of computation used during training and evaluation.
The balance loss discussed earlier can lead to stability issues. There are many methods that can be used to stabilize the training of a sparse model, but this may come at the expense of model quality. For example, introducing dropout can improve stability, but it will result in a decrease in model quality. On the other hand, adding more multiplication components can improve the quality but reduce the stability of the model.
st-moerouter z-loss
While maintaining the performance of the model, the stability of the training was significantly improved. This loss mechanism is greater by penalizing the gating network inputlogits
to keep the absolute size of the value small, which effectively reduces round-off errors in the calculation. This is especially important for gated networks that rely on exponential functions for computation. For an in-depth understanding of this mechanism, it is advisable to refer to the original** for more comprehensive details.
ST-MOE researchers have found that different experts in encoders tend to focus on specific types of tokens or shallow concepts. For example, some experts may specialize in punctuation, while others specialize in proper nouns, etc. In contrast, experts in decoders usually have a lower degree of specialization. In addition, the researchers trained the model in multiple languages. Although one might expect each expert to deal with a particular language, this is not the case in reality. Due to the mechanism of token routing and load balancing, no expert is specifically configured to handle a particular language.
st-moe shows which token groups were sent to which expert**Adding more experts can improve the efficiency of processing samples and speed up the model, but these advantages decrease as the number of experts increases (especially when the number of experts reaches 256 or 512). At the same time, this also means that more video memory is required to load the entire model during inference. It's worth noting that Switch Transformers' research shows that its features in large-scale models are also applicable to small-scale models, even if there are only 8 experts per layer.
There are significant differences between the dense model and the sparse model in the dynamic performance of overfitting. Sparse models are more susceptible to overfitting, so when dealing with these models, it is beneficial to try stronger internal regularization measures, such as using a higher percentage of dropouts. For example, we can optimize model performance by setting a lower dropout rate for dense layers and a higher dropout rate for sparse layers.version
transformers
The library supports mixtral models. You can install it with the following command:pip install "transformers==4.36.0 --upgrade
Whether or not to use ancillary losses in the fine-tuning process is a matter that requires a decision. The authors of ST-MOE tried to turn off the auxiliary loss and found that even though up to 11% of the tokens were dropped, the quality of the model was not significantly affected. Token drops can be a form of regularization that helps prevent overfitting.
The authors of Switch Transformer observed that sparse models did not perform as well as their cohesive counterparts on downstream tasks with the same pre-trained perplexity, especially on re-understanding tasks such as superglue. On the other hand, for knowledge-intensive tasks such as triviaqa, sparse models perform exceptionally well. The authors also observed that a smaller number of specialists helped to improve performance during fine-tuning. Another finding on generalization problem confirmation is that the model performs poorly on small tasks, but does well on large tasks.
In the small task (left), we can see a significant overfit because the sparse model performs much worse in the validation set. In the larger task (right), the MOE performs well. The diagram is from st-moe **A viable fine-tuning strategy is to try to freeze the weights of all non-expert layers. In practice, this results in a significant performance drop, but this is in line with our expectations, as the Hybrid Expert Model (MOE) layer occupies the majority of the network. We can try the opposite: freeze only the parameters of the moe layer. Experimental results show that this method is almost as effective as updating all parameters. This accelerates the fine-tuning process and reduces memory requirements.
By freezing only the MOE layer, we can speed up training while maintaining quality. The graph is from ST-MOE **The last thing to consider when fine-tuning sparse hybrid expert models (MOEs) is that they have special fine-tuning hyperparameter settings - for example, sparse models tend to be better suited with smaller batch sizes and higher learning rates, which results in better training results.
Reducing the learning rate and increasing the batch size can improve the fine-tuning quality of sparse models. The graph is from st-moe At this point, you might be frustrated with these challenges people are having in fine-tuning the MOE, but a recent article on Moes Meets Instruction Tuning (July 2023) brings exciting findings. The following experiments were conducted in this article:
Single-task fine-tuningMulti-task instruction fine-tuningMulti-task instruction fine-tuningAfter receiving order task fine-tuningWhen the researchers fine-tuned the T5 model with comparable MOE and corresponding performance, they found that the T5 counterpart model performed better. However, when the researchers fine-tuned the MOE version of FLAN T5, an instruction-optimized version of T5, the performance of the MOE improved significantly. More notably, the performance improvement of FLAN-MOE over the original MOE exceeds the improvement of FLAN T5 over the original T5, which means that the MOE model may benefit more from the instructional fine-tuning, even more than the dense model. In addition, MOE performs better in multitasking. with the previous closureAncillary lossesFunctions do the opposite, and in fact this loss function can help prevent overfitting.
Sparse models benefit more from instruction fine-tuning than dense models. This diagram is from Moes Meets Instructions Tuning **The Sparse Hybrid Expert Model (MOE) is suitable for scenarios with multiple machines that require high throughput. Under the fixed pre-trained computing resources, sparse models can often achieve better results. Conversely, in scenarios with less video memory and low throughput requirements, the dense model is a more suitable choice.
Note: It is inappropriate to directly compare the number of parameters between a sparse model and a dense model, because the two types of models are based on completely different concepts and methods of calculating the number of parameters.
The original Hybrid Expert Model (MOE) design had a branching structure, which led to computational inefficiencies. This inefficiency is mainly due to the fact that GPUs are not designed to handle this structure, and network bandwidth often becomes a performance bottleneck due to the need to pass data between devices. In the following discussion, we will discuss some of the existing research results aimed at making these models more efficient and useful in the pre-training and inference phases. Let's take a look at how to optimize the MOE model to get the MOE to take off.
Parallel computing
Let's briefly review several forms of parallel computing:
Data parallelism: The same weight is replicated on all nodes, and the data is split between nodes. Models are parallel: The model is split between nodes, and the same data is replicated on all nodes. Models and data are parallel: We can split both the model and the data between nodes. Note that different nodes process different batches of data. Experts in parallel: Experts are placed on different nodes. If combined with data in parallel, each node has a different expert, and the data is split between all nodes. In expert parallelism, experts are placed on different nodes, and each node processes a different batch of training samples. For the non-MOE layer, expert parallelism behaves the same as data parallelism. For the MOE layer, the tokens in the sequence are sent to the node that has the required experts.
Switch Transformers ** illustration showing how to split data and models on nodes using different parallelism techniquesCapacity factor and communication overhead
Increasing the capacity factor (CF) can enhance the performance of the model, but it also means higher communication costs and the need for memory to hold the activation values. In cases where the device communication bandwidth is limited, choosing a smaller capacity factor may be a better strategy. A reasonable initial setup is to use top-2 routing, 125 capacity factors while configuring one expert per node. When evaluating performance, the capacity factor should be adjusted as needed to find a balance between the cost of communication between devices and the cost of computation.
Deploy technology
You can do it in:inference endpoints
Deploy mistralai mixtral-8x7b-instruct-v01。
A key challenge in deploying a hybrid expert model (MOE) is its sheer parameter scale. For on-premises usage, we may want to use a smaller model. To make the model more suitable for deployment, here are a few useful techniques:
Pre-distillation experiments: Switch transformers researchers conducted pre-distillation experiments. By distilling the MOE model back to its dense counterpart, they managed to retain 30-40% of the performance gains due to sparsity. Pre-distillation not only speeds up pre-training, but also makes it possible to use smaller models in inference. Task-level routing: In the latest approach, the router has been modified to route entire sentences or tasks directly to a single expert. Doing so extracts a sub-network for services, helping to simplify the structure of the model. Expert Network Aggregation: This technique reduces the number of parameters required for inference by combining the weights of individual experts. This reduces the complexity of the model without significantly sacrificing performance. Train efficiently
FasterMoe (March 2022) provides an in-depth analysis of the theoretical performance limits of MOE under different parallel strategies, and explores a range of innovative technologies, including methods for expert weight adjustment, fine-grained communication scheduling techniques to reduce latency, and a topology-aware gating mechanism for expert selection based on the lowest latency. The combination of these technologies allows the MOE to run up to 17 times faster.
MegaBlocks (November 2022) focuses on developing a new GPU kernel to handle dynamics in MOE models for more efficient sparse pre-training. Its core advantage is that it does not drop any tokens and efficiently adapts to modern hardware architectures (supporting block sparse matrix multiplication) to achieve significant speedups. What's innovative about MegaBlocks is that instead of using batch matrix multiplication like traditional MOEs (which typically assume that all experts are the same shape and process the same number of tokens), the MOE layer is represented as a block sparse operation that can flexibly accommodate uneven token distribution.
Block sparse matrix multiplication for experts and token numbers of different sizes. The graph is from MegaBlocks **Currently, the following open-source projects can be used to train hybrid expert models (MOEs).
MegaBlocks: For the open-source hybrid expert model (MOE), here are some things you can look for:
Switch Transformers (Google): A collection of MOEs based on T5 with a range of experts from 8 to 2048. The largest model has 16 trillion parameters. NLLB MOE (Meta): A MOE variant of the NLLB translation model. OpenMoe: The community's MOE attempt at Llama-based models. Mixtral 8x7B (MISTRAL): A high-quality hybrid expert model that outperforms LLAMA 2 70B with faster inference. In addition, a model with fine-tuned instructions was released. More information can be found in Mistral's announcement blog post. The first is to try to mix a sparse expert model (SMOE).DistillationBack to a dense model with fewer real-world parameters, but a similar amount of equivalent parameters.
moe'sQuantifyIt is also an interesting area of study. For example, QMOE (October 2023) quantifies the MOE to less than 1 bit per parameter, which will be 1The storage required for a 6 trillion parameter switch transformer is from 32TB compressed to just 160GB.
In a nutshell, some interesting areas to explore include:
The mixtral is distilled into a dense model. Explore techniques for merging expert models and their impact on inference time. Try experiments with extreme quantization of mixtral. adaptive mixture of local experts (1991)learning factored representations in a deep mixture of experts (2013)outrageously large neural networks: the sparsely-gated mixture-of-experts layer (2017)gshard: scaling giant models with conditional computation and automatic sharding (jun 2020)glam: efficient scaling of language models with mixture-of-experts (dec 2021)switch transformers: scaling to trillion parameter models with **and efficient sparsity (jan 2022)st-moe: designing stable and transferable sparse expert models (feb 2022)fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models(april 2022)megablocks: efficient sparse training with mixture-of-experts (nov 2022)mixture-of-experts meets instruction tuning:a winning combination for large language models (may 2023)mixtral-8x7b-v0.1, mixtral-8x7b-instruct-v0.1.
@misc , title = , year = 2023, url = , publisher = }
sanseviero, et al., "mixture of experts explained", hugging face blog, 2023.
Original English text:Original authors: Omar Sanseviero, Lewis Tunstall, Philipp Schmid, Sourab Mangrulkar, Younes Belkada, Pedro Cuenca
Translator: xinyu66 (xinyu yang).