Since the beginning of this year, the emergence of AI models represented by ChatGPT has marked the beginning of a new era. The rapid iteration of large model technology has given rise to a number of technologies such as Midjourney and CharacterAI and other types of AIGC (Artificial Intelligence Generated Content) applications have brought revolutionary changes to the fields of office, e-commerce, education, healthcare, and law.
Complex AI algorithms require a large amount of computing resources to implement, and computing power is the foundation to support the operation of AI algorithms. AI chips are coprocessors dedicated to processing AI computing-related tasks, providing efficient computing power for AI algorithms and significantly improving the training and inference efficiency of AI algorithm models such as deep learning Xi.
With the development of the AIGC industry, the demand for computing power continues to increase, but China faces many challenges in the field of computing power.
On October 17, 2023, the U.S. Department of Commerce's Bureau of Industry and Security (BIS) announced the latest semiconductor control rules (the "New Section 1017"), which upgraded the "New Export Controls on Advanced Computing and Semiconductor Manufacturing Items Exported to China" (the "Rule 107") issued by BIS on October 7, 2022. The 1017 new regulations are divided into three parts: one is to adjust the export control rules for advanced computing chips, and the other is to adjust the export control rules for semiconductor manufacturing equipmentThe third is to announce the list of new enterprises on the entity list. Exports of GPU chips, including A100, H100, A800, H800, etc., to China will be affected. A100 and H100 are NVIDIA's high-performance GPUs, which are widely used in AI, data analysis, and other work scenarios. A800 and H800 are alternatives to A100 and H100, that is, under the US 107 rules last year, Nvidia supplied the Chinese mainland market with a reduced transmission rate to comply with the regulations, but was banned after the new 1017 regulations this year. The above products are currently the most suitable high-computing power chips for the development and deployment of AI algorithms.
China has a strong demand for computing power in the field of AI and relies on high-performance AI chips to support its applications and research. The new regulations 1017 remove the parameter limit of "interconnection bandwidth" and add the parameter limit of "performance density". The new regulations aim to further narrow the export scope of high-end computing chips;In the era of large-scale AI, restricting China's computing power will limit the development and innovation of AIGC in China.
This article will explain the main challenges faced by China's computing power one by one, including the bottleneck of the performance improvement of the chip architecture, the insufficient utilization of the computing power of existing chips, and the risk of the ** chain caused by the US export control. Then, the strategy to break the game is analyzed, and the model and algorithm are optimized in the software to reduce the demand for computing powerDevelop new architectures in hardware to improve the energy efficiency ratio of AI chipsSynergistically integrate software and hardware in the system to improve system efficiency and reduce energy consumptionIn terms of industry, we will strengthen the construction of ecological chain and multi-party cooperation to promote joint investment.
AIGC iteration accelerated
At present, China's large-scale model technology is still in the early stage of R&D and iteration, but the industrial potential is huge. Chinese universities, Internet technology companies, and start-up technology companies have all joined the wave of AI models, and more than 100 models of various types have been born.
According to iResearch**, the scale of China's AIGC industry will reach 14.3 billion yuan in 2023, and it will grow rapidly in the next few years. It is estimated that by 2028, the scale of China's AIGC industry will reach 720.2 billion yuan, and the technology will be implemented in key areas and key scenarios.
AIGC technology has undergone significant evolution in the fields of NLP (Natural Language Processing) and CV (Computer Vision). The improvement of AIGC technology and capabilities will bring more innovation and application opportunities to various industries, mainly in the following aspects:
From single to multitask. Initial AIGC techniques focused on a single task, such as natural language generation, image generation, and translation. However, the future trend is to train models to handle multiple tasks at the same time and improve the generalization ability of models.
From unimodal to multimodal. Unimodal generative models typically focus on one type of data, such as text or images. Multimodal generative models can process multiple data types simultaneously, such as the joint generation of text and images, bringing new opportunities for applications in multiple fields such as augmented reality, intelligent dialogue systems, and automatic document generation.
From a generic model to a vertical model. Generic generative models excel in a variety of fields, but the future trend is toward greater specialization and verticalization.
Insufficient computing power**
With the development of AIGC, the model is becoming more and more complex and the number of parameters is increasing, resulting in the growth rate of computing power demand far exceeding the performance growth rate of chips. In the early stage of AIGC algorithm model deployment, the computing power consumption is mainly focused on large model training, but with the growth of large model users, the inference cost will become the main computing power expense.
The specific requirements of AIGC for computing power are illustrated in three typical application scenarios:
If Google uses a large model such as GPT to recommend searches: Google receives 3.5 billion search requests per day, according to GPT-4 API 0For 14 yuan**, Google needs to pay 178.8 billion yuan in API fees every year. If you use a self-built computing cluster, you need to provide a peak access capacity of about 100,000 times per second, and a round of GPT-4 dialogue involves more than 200 trillion floating-point operations, and about 100,000 A100 clusters are required when the computing resource utilization is about 60%.
If every Microsoft Office user uses a large model for office: Microsoft uses Copilot based on a large model to empower office software, operating systems and editing scenarios, which has the potential to reconstruct future office scenarios. In the future, software development, copywriting, and artistic creation will be completed in frequent interactive dialogues with AI. According to the report "China's Data Analytics and AI Technology Maturity Curve" by an information technology research company, China's student and white-collar workers reach 2800 million people, based on the demand for 10 visits per person per day, has 102 trillion access requirements require 80,000 A100 computing power.
If everyone has a customized AI personal assistant (a large-scale native application), the AI personal assistant can provide customized education, medical care, government affairs, financial management and other services to China's 1.2 billion Internet users. Under the condition of 10 daily visits per person, 340,000 A100 computing power is required.
According to AMD's 2023 keynote speech by Lisa Su, global CEO (CEO) of AMD, from the perspective of a single computing power center, supercomputers have developed rapidly in the past decade, and the innovation of chip architecture and the advancement of manufacturing processes have made computer performance every 1Doubled in 2 years. The energy efficiency of a computer (i.e., the number of calculations per unit of energy) is growing at a rate of only 2Doubled in 2 years. If this trend continues, by 2035, a high-performance supercomputer will have a power of 500MW, which is about half the power of a nuclear power plant.
AIGC relies heavily on high computing power, but at present, China is facing great challenges in computing power.
1) Moore's Law, which drives chip performance, is difficult to maintain.
The size of semiconductor devices is approaching the physical limit, and the performance gains brought about by process advancements are shrinking. The growth rate of chip energy efficiency ratio has slowed down significantly, and higher transistor densities also bring greater heat dissipation challenges and lower production yields. At present, the demand for computing power in AIGC far exceeds the development speed of AI chips, and the improvement rate of existing chip hardware performance is difficult to meet the rapidly growing computing power demand of algorithm models, and new hardware architecture breakthroughs are needed.
2) Low GPU utilization.
When large models process large amounts of data, due to many problems such as computing power scheduling, system architecture, and algorithm optimization, the utilization rate of GPU computing power of many large model enterprises is less than 50%, resulting in huge waste.
3) The software ecosystem is immature.
Since its inception in 2006, NVIDIA's CUDA software has formed a mature ecosystem including drivers, compilations, frameworks, libraries, programming models, etc. At present, most of the mainstream AIGC algorithm training is based on the CUDA ecosystem, and the barriers are extremely strong. If AIGC wants to replace NVIDIA GPUs, it faces extremely high migration costs and stability risks. Therefore, if domestic GPGPU products want to be deployed on a large scale, the software ecology is a great challenge.
4) The amount of high-performance AI chips is insufficient.
Large computing power chips are the infrastructure for large model research and development, and NVIDIA's high-performance GPU chips have two core advantages: one is larger memory configuration and communication bandwidth. The interconnection between chips with high bandwidth is crucial to improve the training efficiency of large models. The second is higher durability of large model training. Consumer graphics cards are aimed at personal applications, and the failure rate and stability are much worse than those of server versions. The training of a large model with hundreds of billions of parameters requires long-term synchronous operation of thousands of GPUs, and the failure of any single graphics card requires interruption of training and hardware maintenance. Compared with consumer-grade graphics cards or other chips, high-performance GPUs can shorten the training cycle of large models by 60%-90%.
However, Nvidia's GPU production capacity is insufficient, and the United States is gradually increasing the ban on the sale of high-performance chips to China. In October last year, the United States imposed bandwidth rate limits on AI chips exported to China, including Nvidia A100 and H100 chips. Since then, Nvidia has offered alternative versions of the A800 and H800 to Chinese companies. According to the new 1017 regulations, NVIDIA's exports of chips, including A800 and H800, to China will be affected, and there will be a serious shortage of domestic high-performance AI chips.
At present, large model training mainly relies on NVIDIA's high-performance GPUs, and the ban on sales has a great impact on the progress of domestic large model research and development. For example, if you use a V100 GPU that complies with the 1017 regulations to replace the A100, the decrease in computing power and bandwidth will increase the training time of large models by 3 to 6 times, and the decrease in video memory will also reduce the maximum number of parameters that can train the model by 25 times.
5) Self-developed AI chips are difficult to mass-produce. The U.S. has increased export license requirements for advanced chips to 22 countries. Following the previous restrictions on the export of EUV lithography machines to China, restrictions on lower-generation DUV lithography machines have also begun. In addition, the U.S. Department of Commerce has added China's leading local GPU chip companies to the entity list, which will make it difficult for domestic self-developed chips to use the latest process for tape-out mass production.
6) Stress on the power system due to high energy consumption.
The computing, cooling, and communication facilities of the computing center are all energy-intensive hardware. According to data from the China Electronic Energy Conservation Technology Association, the average growth rate of power consumption in China's data centers is currently more than 12%, and the power consumption of data centers in China will reach 270 billion kWh in 2022, accounting for 3% of the electricity consumption of the whole society. In the era of large models, the power consumption of data centers in China will increase and is expected to reach 420 billion kWh by 2025, accounting for about 5% of the total electricity consumption of society. The power of the data center**, as well as the heat dissipation of the system, will put a lot of pressure on the existing power system.
How does technology break the game?
In the face of the unfavorable situation, China's computing power bottleneck needs to be planned and gradually broken through with a system concept, mainly including two levels: technology and industry, mainly in the form of "open source" and "throttling".
On a technical level, our recommendations are as follows:
1) Develop efficient large models.
Reduce the demand for computing power by streamlining model parameters. Compression is intelligence, and large models are designed to compress data losslessly. On February 28 this year, Jack Rae, a core R&D staff member of OpenAI, said that the goal of general AI (AGI) is to achieve maximum lossless compression of effective information. With the development of large models, while the complexity of AI increases, the ability of algorithm models will continue to improve at the same parameter scale. In the future, large models with higher information compression efficiency may appear, and the algorithm capability comparable to GPT-4 with tens of billions of parameters can be obtained.
In addition, large models can be adapted to specific business scenarios and selected capabilities to reduce computing power expenditures. For example, in the government Q&A scenario, the model can refuse to answer non-business requests. Tasks that could only be solved by relying on a 100-billion-parameter general model are expected to be completed using a 10-billion-parameter model.
2) Software optimization based on existing models.
If the AI development before and after GPT-3 is divided into 10 vs. 20 times, then AI 1The core task of software optimization in the 0 era is to enable the deep learning Xi model to run on low-power devices at the edge and on the device side, realize automation and intelligence, and apply it in a large area in the fields of AIoT, intelligent security, and intelligent vehicles. And AI 2Model compression in the 0 era is the overall optimization of large-scale and centralized computing power requirements, and application scenarios need to start from the "center" side, and then radiate to the edge and end side.
Model compression is the most direct way to reduce the computing power requirements of algorithmsThe technology of the 0 era is in AI2The 0 era will also be inherited and developed.
Pruning takes advantage of the redundancy of deep learning Xi model parameters, prunes the weights that have little impact on the accuracy, retains the network backbone and reduces the overall computational cost. In AI2In the 0 era, in the case of long sequence inputs, the bottleneck of the calculation delay of the Transformer algorithm model is in the attention mechanism operator, and the end-to-end acceleration ratio of 2 times can be achieved by clipping the activation value of the attention mechanism operator, and it is expected to be further accelerated in the future.
Parameter quantization takes advantage of the fact that the equivalent computing power of GPU processing fixed points is significantly higher than that of floating-point computing power, and uses 16 bits, 8 bits, and 4 specific points to replace 32 bits of floating-point numbers, which is expected to simultaneously reduce the demand for inference computing power.
Operator fusion fuses multiple operators into one operator to improve the access locality of intermediate tensor data, reduce memory access, and solve the memory access bottleneck. The design and optimization of the operator loop space improves the overall computational parallelism by arranging the operator nodes in the computational graph in parallel.
In short, by compressing and quantifying the existing large models, the number of model parameters can be significantly reduced, the computational complexity of the model can be reduced, and the storage space can be saved, and the computational efficiency can be improved by 2-3 times. While reducing the delay of large models responding to users, model optimization technology can efficiently deploy large models in edge and device devices such as automobiles, personal computers, mobile phones, and AIoT, and support local large model applications with high real-time, privacy protection, and security.
3) New architecture chips with high energy efficiency and high computing power density.
The energy efficiency of traditional computing chips has reached a bottleneck, and it is necessary to improve the chip architecture, interconnection, and packaging to achieve higher energy efficiency. At present, the main methods are data flow architecture, storage and computing integration, chiplet technology, etc.
Data flow architecture: The order of computation is controlled by the order of data flow flow, eliminating the additional time overhead caused by instruction operations. The data flow architecture enables efficient pipeline operations while performing data access and data calculations in parallel, further reducing the idle time of the computing unit and making full use of the computing resources of the chip. A data flow architecture that is different from an instruction set architecture that uses dedicated data channels to connect different types of highly optimized compute modules. With distributed local storage, data reading and writing and computing are carried out at the same time, saving data transmission time and computing time.
Storage-computing integration: The core of the storage-computing integrated chip is to fully integrate storage and computing, and use emerging memory devices and memory array circuit structure design to integrate storage and computing functions on the same memory chip, eliminating the data transfer of matrix data in storage and computing units, so as to efficiently support matrix computing in intelligent algorithms, and greatly improve the "performance density" of computing chips in the same process.
Chiplet technology: Conventional integrated circuits integrate a large number of transistors into a two-dimensional plane on a silicon substrate to form a chip. Integrated chip refers to the integration and manufacturing of transistors and other components into chiplets with specific functions, and then the chips are integrated and manufactured into chips through semiconductor technology according to application requirements. Chiplet technology can achieve a larger chip area and increase the total computing powerThrough the multiplexing and combination of chiplet IP, etc., the design efficiency of chips is improvedSplit large chips into multiple small-sized chiplets to improve yield and reduce costsDifferent cores can be prepared by different processes, and higher performance can be achieved through isomerization.
The new computing architecture can break the storage wall and interconnection wall of existing chips, connect more computing power units with high density, high efficiency, and low power consumption, greatly improve the transmission rate between heterogeneous cores, reduce data access power consumption and cost, and provide high computing power guarantee for large models.
4) Collaborative optimization of software and hardware to improve the utilization rate of computing systems.
In large-scale model systems, hardware and software collaboration is essential to achieve high performance and energy efficiency. Through the efficient architecture design of sparse + mixed precision + diverse operators, algorithm optimization, system resource management, collaboration between software framework and hardware platform, and system monitoring and tuning, the advantages of the entire computing system can be better played.
In terms of large model training, due to the huge computing power and storage overhead required for training, the high-performance cluster computing system with multi-card interconnection is an inevitable way for large model training. NVIDIA's high-performance GPU's first-class chain is restricted in China, and the performance of a single card of localized chips is limited by the process. In addition to increasing the scale of the computing system, it is also necessary to carry out research on efficient fine-tuning schemes for software and hardware collaboration to reduce the hardware resource overhead for large model training and fine-tuning.
In large model systems, effective system resource management is essential to ensure high performance and efficiency. This includes allocating compute resources (e.g., CPU, GPU, etc.), optimizing memory management and data transfer strategies to reduce latency and increase throughput.
In order to achieve software and hardware synergy, the software framework of DeepLearning Xi needs to work closely with the hardware platform. This includes optimizing for specific hardware platforms to take full advantage of their computing power and storage resources, as well as providing easy-to-use APIs and tools to simplify the model training and deployment process.
5) Build a heterogeneous computing platform.
Due to the sharp increase in the number of AI algorithm model parameters and computational complexity, large-scale cross-node multi-card clusters are required for large model training, and the hardware challenges come from computing, storage, and communication. The cost of building a large model data center with a scale of 1,000 calories is as high as hundreds of millions of yuan, which is difficult for many startups to afford. In order to solve the above problems and reduce the cost of data center construction, it is urgent to build a centralized computing power center and integrate heterogeneous chips of different architectures to achieve a large computing power platform that meets the needs of various application scenarios. The unified middle layer of large models can be adapted to large models in different vertical fields upwards and compatible with different domestic AI chips downward, thereby improving the efficiency of heterogeneous computing platforms and reducing the migration costs of users between different models and different chips, which is one of the key directions to solve the computing power challenges in the era of large models.
6) Layout of advanced technology.
The core indicator of "performance density" is the synergy of multiple levels such as manufacturing process, chip design level, and advanced packaging. In the context of the current limited access to advanced manufacturing processes such as 3nm and 5nm in China, it is necessary to continue to tackle important equipment and materials in advanced manufacturing processes, such as DUV EUV lithography machines, photoresists, etc.
7) Optimal use of energy.
In the context of carbon neutrality, to cope with the extremely high energy consumption demand of computing power centers, "data center + clean power + energy storage" will be the necessary development path. The data center will become a complex with variable and adjustable loads to respond to power generation and grid-side demand, and to achieve intelligent "peak shaving and valley filling" arbitrage by participating in power trading to reduce operating costs.
According to the "Top Ten Trends in Data Center Energy", high-energy computing centers cannot rely on air cooling to achieve effective heat dissipation, liquid cooling will become the standard, and water supply efficiency has also become the key to computing power centers. Traditional data centers consume a lot of water for heat dissipation, which has an impact on the ecological environment of water-scarce areas. Water use efficiency (WUE) has become an important reference index for international attention, and refrigeration technology with no or little water is the future development trend.
How does the industry respond?
At the industry level, we have the following recommendations:
1) Strengthen the top-level design and plan the strategic deployment of the computing power industryA few days ago, the Ministry of Industry and Information Technology and other six departments jointly issued the "Action Plan for the High-quality Development of Computing Infrastructure" to strengthen the top-level design of the computing industry, but it is still necessary to further strengthen the overall planning. It is suggested that a computing power development committee (or joint meeting) should be set up in the existing relevant leading groups, uphold the position of timely and appropriate intervention, strengthen the top-level design of computing power development, improve the information exchange mechanism, and form a unified and coordinated decision-making mechanism.
2) Optimize the spatial layout and promote the construction of computing infrastructure as a wholeAt the grassroots level of implementing the relevant plans of the "14th Five-Year Plan", we will strengthen the construction of national hub nodes of the integrated computing network, promote the construction of computing infrastructure in an orderly and on-demand manner for key computing nodes such as Beijing-Tianjin-Hebei, Yangtze River Delta, and Guangdong-Hong Kong-Macao Greater Bay Area, and strive to promote the utilization rate of existing and new computing facilities.
3) Layout leading projects and enhance the industry's common key technology reserves. To give full play to the symbolic and leading role of the national science and technology plan, a number of projects can be considered in the national natural sciences to carry out basic research such as computing architecture, computing methods and algorithm innovationAt the same time, a number of projects will be set up in the national key R&D plan to carry out demonstration research on the application of key technologies of computing power, and strengthen the integration and application of computing power and related industries.
4) Explore diversified investment to promote the high-quality development of the computing power industry. Give full play to the leveraging role of industry guidance, encourage local governments to increase investment in the computing power industry through guidance, and cultivate more good enterprises and projects. Explore new financial models of science and technology, and increase financial support for key computing projects. Innovate the social financing model of computing infrastructure projects, and support the flow of social capital to the computing industry.
5) Create an open ecology and jointly build new formats and models. The high investment, high risk, and high monopoly of computing power determine that the competition of computing power is a game that only a few enterprises in a few large countries can participate in. **It is necessary to vigorously promote the deep integration of industry, university and research, guide leading enterprises to work key technologies related to computing power, improve research and development capabilities, build an open platform, attract upstream and downstream enterprises to effectively connect, and share the achievements of computing power innovation. Encourage domestic enterprises, universities and other organizations to expand cooperation with relevant overseas organizations.
In summary, breaking the bottleneck of computing power requires the coupling of hardware, software, and systems, as well as the collaboration of ecology and industry, with the characteristics of a multi-level and multidisciplinary system. It is necessary to combine industrial application, scientific research, talent training, and basic platforms to promote corresponding research and final commercialization.
The author, Wang Yu, is a tenured professor and head of the Department of Electronic Engineering, Tsinghua University, and vice dean of the School of Information Science and Technology, Tsinghua UniversityRu Peng is the deputy director of the Think Tank Center of Tsinghua University and an associate professor of the School of Public Policy and Management of Tsinghua UniversityXie Qijun is the assistant director of the Science and Education Policy Research Center of Tsinghua University and an assistant professor at the School of Public Policy and Management of Tsinghua University