Edited by alan
Recently, Allen Institute of Artificial Intelligence released the first generation of unified-io 2,—— the ability of GPT-4 and other models, so we can get a glimpse of the true face of GPT-5 from the new generation of models.
When will GPT-5 arrive and what will it do?
A new model from the Allen Institute for AI tells you the answer.
Unified-io 2 from the Allen Institute for Artificial Intelligence is the first model that can process and generate text, images, audio, and action sequences.
This new advanced AI model is trained using billions of data points, and while the model size is only 7b, it exhibits the broadest range of multimodal capabilities to date.
*Address: So, what does Unified-IO 2 have to do with GPT-5?
Back in June 2022, the Allen Institute for Artificial Intelligence launched the first generation of Unified-IO, one of the first multimodal models capable of processing images and language.
Around the same time, OpenAI was testing GPT-4 internally and officially released it in March 2023.
Therefore, Unified-io can be seen as a foresight for future large-scale AI models.
That said, OpenAI may be testing GPT-5 internally and will release it in a few months.
And the capabilities that Unified-IO 2 shows us this time will also be what we can look forward to in the new year:
New AI models such as GPT-5 can handle more modalities, perform many tasks locally through extensive learning, and have a basic understanding of interacting with objects and robots.
The training data for Unified-IO 2 includes: 1 billion image-text pairs, 1 trillion text tags, 1800 million ** clips, 1300 million images with text, 3 million 3D assets, and 1 million robot** motion sequences.
The research team combined a total of more than 120 datasets into a 600 TB package covering 220 visual, verbal, auditory, and motor tasks.
Unified-IO 2 uses an encoder-decoder architecture with some changes to stabilize training and make efficient use of multimodal signals.
Models can answer questions, write text based on instructions, and analyze text content.
The model can also recognize image content, provide image descriptions, perform image processing tasks, and create new images based on text descriptions.
It can also generate ** or sounds based on descriptions or instructions, as well as analyze ** and answer questions about **.
By using robot data for training, Unified-IO 2 can also generate actions for the robot system, such as converting instructions into action sequences for robots.
Thanks to multimodal training, it can also handle different modalities, for example, marking the instrument used by a certain track on an image.
Unified-IO 2 performs well in more than 35 benchmarks, including image generation and understanding, natural language understanding,** and audio understanding, and robot manipulation.
In most tasks, it is on par with, or even better than, a dedicated model.
Unified-IO 2 received the highest score currently in the GRIT benchmark for image tasks (GRIT is used to test how the model handles image noise and other issues).
The researchers now plan to further extend Unified-IO 2 to improve data quality and convert the encoder-decoder model to an industry-standard decoder model architecture.
unified-io 2
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating images, text, audio, and motion.
To unify the different modalities, the researchers labeled the inputs and outputs (images, text, audio, motion, bounding boxes, etc.) into a shared semantic space, which were then processed using a single encoder-decoder converter model.
Due to the sheer volume of data used to train the model and the variety of different modalities, the researchers employed a range of techniques to improve the entire training process.
In order to effectively facilitate self-supervised learning signals across multiple modalities, the researchers developed a novel type of multimodal mixing of denoiser targets that combines cross-modal denoising and generation.
Dynamic packaging has also been developed to increase training throughput by 4x to handle highly variable sequences.
To overcome stability and scalability issues in training, the researchers made architectural changes to the perceptron resampler, including 2D rotation embedding, QK normalization, and scaled cosine attention mechanisms.
For instruction adjustments, make sure there is a clear hint for each task, whether it's using an existing task or making a new one. Open-ended tasks are also included, and synthetic tasks are created for less common patterns to enhance the diversity of tasks and instruction.
Encoding multimodal data into a sequence of labels in a shared representation space, including the following:
Text input and output are encoded using bytes in llama, sparse structures such as bounding boxes, keys, and camera poses are discretized, and then encoded with 1000 special tags added to the glossary.
Points are encoded with two markers (x, y), boxes are encoded with a sequence of four markers (top left and bottom right), and 3D boxes are represented by 12 markers (encoding projection center, virtual depth, logarithmic normalized box size, and continuous concentric rotation).
For embodied tasks, discrete robot actions are generated as text commands (e.g., move forward). Special markers are used to encode the state of the robot (e.g. position and rotation).
Images are encoded using a pre-trained vision converter (VIT). Connect the patch features of the second and penultimate layers of the VIT to capture low- and high-level visual information.
When the image is generated, the VQ-GAN is used to convert the image into discrete labels, and a dense pre-trained VQ-GAN model with a patch size of 8 8 is used to encode the image of 256 256 into 1024 tokens with a codebook size of 16512.
The label for each pixel, including depth, surface normal, and binary segmentation mask, is then represented as an RGB image.
U-IO 2 will be up to 4The 08-second audio is encoded as a spectrogram, which is then encoded using a pre-trained audio spectrogram converter (AST) and the input embedding is constructed by connecting the second and penultimate layer features of the AST and applying a linear layer, just like an image VIT.
When generating the audio, the audio is converted into discrete tokens using VIT-VQGAN, the patch size of the model is 8 8, and the spectrogram of 256 128 is encoded into 512 tokens, and the codebook size is 8196.
The model allows for up to four additional image and audio fragments as input, which are also encoded using VIT or AST, followed by perceptron resamplers, which further compress the features to a smaller number (32 for images and 16 for audio).
This greatly reduces the length of the sequence and allows the model to inspect images or audio fragments in high detail when using elements from the history as context.
The researchers observed that as we integrated other patterns, the standard implementation after using U-IO resulted in increasingly unstable training.
As shown in figures (a) and (b) below, training on image generation alone (green curve) results in stable loss and gradient norm convergence.
The combination of the introduced image and text tasks (orange curve) slightly increases the gradient norm compared to a single modality, but remains stable. However, the inclusion of ** modalities (blue curves) results in an unlimited escalation of the gradient norm.
As shown in (c) and (d) in the figure, when the XXL version of the model is trained on all modalities, the loss is after 350k steps, and the accuracy of the next label is significantly reduced at 400k steps.
To solve this problem, the researchers made various architectural changes:
Apply a rotating position embedding (ROPE) at each Transformer layer. For non-text modalities, extend the ROPE to a two-dimensional position;When image and audio modalities are included, layerNorm is applied to q and k before the dot product attention calculation.
In addition, the training was significantly stabilized by using a perceptron resampler, which compressed each image frame and audio fragment into a fixed number of markers, and using scaled cosine attention to apply stricter normalization in the perceptual.
To avoid numerical instability, float32 attention logarithms are also enabled, and VIT and AST are frozen during pre-training and fine-tuned at the end of instruction adjustments.
The figure above shows that despite the heterogeneity of the input and output modes, the pre-training loss of the model is stable.
This article follows the UL2 paradigm. For image and audio targets, two similar paradigms are defined here:
r]: mask denoising, randomly masking x% of the input image or audio patch features, and letting the model rebuild it;
s]: The model is required to generate the target mode under other input modal conditions.
During training, modal tags ([text], [image], or [audio]) and paradigm tags ([r], [s], or [x]) are used as prefixes to indicate the input text to indicate the task, and dynamic masks are used for autoregression.
As shown in the image above, one problem with image and audio masking denoising is information leakage on the decoder side.
The solution here is to mask the token in the decoder (except in this token), which does not interfere with cause and effect, while eliminating data leaks.
Training on large amounts of multimodal data results in highly variable sequence lengths for converter inputs and outputs.
Packing is used here to solve this problem: the markup of multiple examples is packed into a single sequence, and attention is masked to prevent converters from cross-participating between examples.
During training, heuristics are used to rearrange the data streamed to the model to match long samples with short samples that can be packed. The dynamic packaging of this paper increases the training throughput by nearly 4x.
Multimodal instruction tuning is a key process to equip a model with different skills and abilities for various modalities, and even to adapt to new and unique instructions.
The researchers constructed a multimodal instruction tuning dataset by combining a wide range of supervised datasets and tasks.
The distribution of instruction tuning data is shown in the figure above. Overall, the instruction tuning portfolio includes 60% prompt data, 30% data inherited from pretraining (task-augmented data built using existing data sources to avoid catastrophic forgetting), and 4% free-form text (for chat-like responses).
References: