The effective time of kilocalorie training accounts for more than 95 , and Ant Group s AI Infra tech

Mondo Health Updated on 2024-02-01

On February 1, it was reported that NextEVO, the AI innovation R&D department of Ant Group, fully open-sourced AI Infra technology, can help large models account for more than 95% of the effective time of kilocalorie training, and can achieve "automatic driving" during training, which promotes the efficiency of AI R&D.

Figure: Ant Group's automated distributed deep learning system, DLROVER, is now fully open-sourced).

The technical framework, called DLROVER, aims to intelligentize large-scale distributed training. At present, the training operations of many enterprises are running in mixed deployment clusters, and the operating environment is complex and changeable, no matter how "rugged the terrain", DLROVER can "easily drive".

The development of large model technology in 2023 has brought about an explosion of engineering practices, and how to manage data, improve training and inference efficiency, and maximize the use of existing computing power has become a key part.

It takes 32 years to complete a large model with hundreds of billions of parameters, such as GPT-3, and it takes 32 years to train it with one card, so the use of computing power during training is particularly important. One way to do this is to make better use of the computing power that can be used, such as further squeezing the performance of the purchased GPU; The second is to use the computing power that could not be used before, such as CPU, memory, etc., which needs to be solved by heterogeneous computing platforms.

The latest integration into DLROVER is the Flash Checkpoint (FCP) solution. When the model is trained, it is generally necessary to play checkpoints (checkpoints) so that it can be restored to the nearest state when interrupted, but the current conventional practice has the disadvantages of taking a long time, reducing the available time of training with high-frequency dots, and losing too much when recovering from low-frequency dots. After the new solution FCP is applied to the training of the 100 billion parameter model with 100 billion kcal, the training waste time caused by checkpoint is reduced by about 5 times, the persistence time is reduced by about 70 times, and the effective training time is increased from 90% to 95%.

There are also three new optimizer technologies that are integrated. Optimizers are a core component of machine learning and are used to update neural network parameters to minimize the loss function. Among them, Ant's AGD (Auto-switchable Optimizer with Gradient Difference of Adjacent Steps) optimizer accelerates 15 times, AGD has been used in multiple scenarios inside ants and has achieved significant results, and the related ** has been neurips'23 included.

Figure: In the large model pre-training task, AGD can be accelerated by 15 times).

As an automated distributed deep learning system, DLROVER's "autonomous driving" function module also includes: ATORCH, a PyTorch distributed training extension library, which can achieve a computing power utilization rate of up to 60% at the scale of 100 billion parameter models at the kilocalorie level, helping developers further squeeze hardware computing power.

DLROVER uses the concept of "ML for System" to improve the intelligence of distributed training, aiming to allow developers to completely get rid of the constraints of resource allocation and focus on model training itself through a system. Without any resource configuration inputs, DLROVER can still provide the best resource configuration for each training job.

Recently, Ant Group has established an AI innovation R&D department, NextEvo, which undertakes all the core technology research and development of Ant AI, including all the R&D work of the Bailing large model, involving core technologies such as AI algorithms, AI engineering, NLP, and AIGC, and technology R&D and product innovation in the fields of multi-modal large models and digital humans.

At the same time, Ant Group has also accelerated the pace of open source, filled the gap in related domestic technologies, and promoted the rapid development of the artificial intelligence industry.

Related Pages