The Heart of the Machine Original.
Author: Jiang Jingling
Insufficient computing power is still an important factor restricting the development of general artificial intelligence. According to a data from GPU Utils in August this year, the current global supply gap of H100 equivalent computing power has reached 430,000. In order to solve the problem of insufficient computing power, in addition to snapping up and hoarding Nvidia, more solutions are emerging.
Tsinghua start-up company Wuwen Xinqiong is a respondent on this track.
Not long ago, Heart of the Machine introduced a new approach, FlashDeCoding++, proposed by a joint team from Infinigence AI, Tsinghua University, and Shanghai Jiao Tong University. This work not only speeds up GPU inference by 2-4x, but also supports both NVIDIA and AMD GPUs. Compared to FlashDecoding, this work achieved an average inference speedup of 37% on the NVIDIA A100 and a 300%+ performance improvement on the AMD MI210.
Based on this work, the infini-acc large model computing optimization engine developed by Wuwen Core Dome can promote the inference speed of large models by 10 times, reduce the model storage space by 10 times, and reduce the deployment time to hours through system optimization at the model, system and hardware levels.
Relying on the core advantages of computing acceleration, Wuwen Core Dome helps existing computing power providers improve computing performance and cost performance. On the basis of its core advantages, it has launched an intelligent computing cloud and an integrated intelligent computing platform to support heterogeneous computing power scheduling and provide an end-to-end one-stop large model landing solution.
Through the improvement of the efficiency of the existing computing power and the activation of the unused computing power, Wuwen Core Dome hopes to bring new computing power increments to the large model market. According to Xia Lixue, CEO of Wuwen Core Dome, the optimized computing power cost will reach 4 orders of magnitude in the future compared with OpenAI, which can be compressed by 2 3 orders of magnitude. This means,If an application party originally needs to pay a token fee of 100 yuan to OpenAI, after optimization, this ** will eventually be compressed to about 1 cent.
What's more noteworthy is that Xia Lixue revealed in an exclusive interview with the heart of the machine that the outward ** system as middleware is only the first step in the commercialization strategy of Wuwen Core DomeThe long-term plan of Wuwen Core Dome is to optimize the cost of computing power by cooperating with the computing power center and directly provide low-cost computing power that can be directly scheduled to B-side and C-side developers.
Our ultimate goal is not just to provide the ecosystem as an intermediate layer, but to provide computing power directly to the market. In the future, all services and applications involving large models will be our potential customers. ”
According to Wuwen Xinqiong, within half a year of its establishment, the company has completed hundreds of millions of yuan in financing, and investors include strategic partners such as Tencent and Zhipu, as well as investment institutions such as Xuhui Capital, Sequoia China, Moolith, Qiming Venture Capital, Northern Light Venture Capital, Jingwei Venture Capital, Zhen ** and Oasis Capital.
Wuwen Xinqiong was established by Wang Yu, director of the Department of Electronics of Tsinghua University, and has three co-founders:
Co-founder and CEO Xia LixueGraduated from Tsinghua University, he is the first doctoral graduate of Wang Yu, director of the Department of Electronics of Tsinghua University. Xia Lixue has long been committed to the design methodology research of deep learning systems, and has been selected into the AI2000 list of the world's most influential scholars in artificial intelligence, and the list of the top 2% scientists in Stanford disciplines. After graduation, Xia Lixue was responsible for core strategic projects such as compression acceleration of large language models and generative AI model chips at Alibaba Cloud. He served as the technical leader of user growth products, helping Alibaba Cloud incubate user growth products from 0 to 1 and steadily obtain hundreds of millions of annual revenue.
Co-founder and CTO Yan ShengenGraduated from the Institute of Software of the Chinese Academy of Sciences, he is one of the earliest researchers engaged in AI high-performance computing in China. As the executive research director of the data and computing platform department of SenseTime, he helped SenseTime build a large-scale high-performance AI computing platform with 20,000 GPUs, presided over the development of a number of deep learning system software, and led a team of 200 people to build an AI supercomputing prototype project in Shanghai for 3 years, with a total investment of 6700 million.
Co-founder and Chief Scientist Guohao DaiHe is currently a tenured associate professor at Shanghai Jiao Tong University and the head of the Artificial Intelligence Design Automation Innovation Laboratory of Qingyuan Research Institute. Dai Guohao has published more than 50 high-level articles in the fields of circuit design automation, heterogeneous computing, and architecture architecture, and has been cited more than 1,000 times by Google Scholar. He has undertaken a number of vertical and horizontal projects, including the National Natural Science ** Youth Project, and is personally responsible for more than 10 million yuan.
At present, there are more than 100 people in the Wuwen Xinqiong team, and more than 35% of the R&D team are from Tsinghua University, and the team is still expanding rapidly. Xia Lixue said that the company's current business focus is commercialization to ensure that Wuwen Core Dome is walking on the right business path.
The problem of difficult and expensive computing power restricts the development of large models
Heart of the Machine: Can you briefly explain the reason for the establishment of the company and its goals?
Xia Lixue:The company was registered in May this year, and the core team was formed in March.
Our establishment is closely related to the development of large models in the entire industry, which has received a lot of attention since the end of last year, triggering a wide range of imagination about its application in different industries.
But at the same time, but we see that commercially speaking, it needs to solve the cost problem in order to implement it on a large scale. The establishment of many scenarios needs to go from "losing money and making money" to at least "calculating the account".
I was the first PhD student of Prof. Yu Wang and joined Alibaba Cloud after graduation. During my time at Alibaba Cloud, I have been in close communication with the Department of Electronics at Tsinghua University. At the end of last year, Professor Wang began to discuss with me frequently what we can do for the industry from the position of the Department of Electronics after the outbreak of large models, and whether we can provide only academic value or industrial value
The last thing we see is the core problemThe overall computing power in China is far from enough, and this problem cannot be solved by relying only on the process improvement of the chip layer and waiting for the growth of multiple chips.
Our goal is to make good use of the computing power that can be used now, and to use the computing power that cannot be used now, so as to help provide more computing power that is available and cheaper in the large model industry.
Therefore, our two core technical directions are: one is the ultimate performance optimization of large models on chips;The second is to make use of multiple heterogeneous computing power. Our goal is to build an ecosystem where different models can be automatically deployed on different hardware, so that this unactivated computing power can be used efficiently.
Heart of the Machine: What is the team composition?
Xia Lixue:Mr. Wang Yu is the initiator of Wuwen Xinqiong, and the core members are me, Yan Shengen and Dai Guohao, and we have been responsible for Alibaba Cloud large model compression and acceleration, generative AI model chips, Shanghai AI supercomputing prototype, National Natural Science ** and other projects. Members of our R&D team have participated in the construction of AI-related open source projects such as Apache, Onnx, TensorFlow, PyTorch, and PyG, and are important contributors to them. More than 35% of the R&D team comes from Tsinghua University, and it is still expanding rapidly.
Heart of the Machine: You define yourself as "pursuing the ultimate energy efficiency of large-scale model landing", why did you choose to solve this problem, and what exactly does energy efficiency mean?
Xia Lixue:We have seen that the energy efficiency problem of the landing of large models has been hanging over everyone's heads.
There is a lack of GPU availability around the world, that is, "not enough", and the current global chip gap is as high as 430,000 H100 equivalent computing power.
The second is "difficult to use", the large model training delay is sensitive, the fault tolerance rate is low, and some hardware performance itself is not as good as NVIDIA, so even if the multivariate heterogeneous GPU cluster is built, it is difficult to really use all the computing power in practice.
As an interface for human-computer interaction, the large model has a high room for play in edge-end applications, but the energy consumption of edge-side devices is sensitive, and the computing power, storage and bandwidth are insufficient, making it difficult to popularize applications.
Wuwen Core Dome defines itself in the pursuit of the ultimate energy efficiency of large modelsEnergy efficiency here refers to the ratio of what the technology actually does to the amount of energy consumed.
We believe that the level of energy efficiency is a measure of productivity and competitiveness, for example, in species competition, the number of neurons in the cerebral cortex determines the level of intelligence. The main reason why humans were able to surpass other species so quickly was that they had mastered cooking technology, that is, how to consume large amounts of energy in a short period of time and at a low cost to support the operation of a large number of neurons in the brain. The large-scale model industry is now in great need of such a holistic, energy-efficient "cooking solution".
In the competition of any economy and business organization, in the same way, whoever can achieve higher development results and product quality with faster speed, lower energy consumption or cost is more likely to win.
Heart of the Machine: You mentioned that there is a large chip gap in the world, and even if a multi-heterogeneous GPU cluster is built, it is difficult to really use all the computing power in practice, and this computing power cannot be fully utilized or the energy efficiency is low
Xia Lixue:In the AI chip market, the world is not even facing the "28 law" pattern, it can be said to be the "19 law". Nvidia occupies an absolute leading market share, not only because of its stronger hardware performance, but also because of its advantages in terms of software ecosystem.
The software ecosystem, in turn, has helped NVIDIA accumulate a large amount of application model information, allowing it to iterate on the design of the next chip in a timely manner. This forms a strong ecological flywheel, and once NVIDIA's production capacity cannot keep up with demand, it will cause a global shortage of computing power.
Although hardware vendors are playing catch-up with Nvidia, they are still lagging behind in building a software ecosystem, which results in their hardware not being widely adopted even if it is comparable to Nvidia's A100. Therefore, building a robust software ecosystem is an important task at the moment, and this is what we are doing.
Heart of the Machine: Why is it hard to build a software ecosystem?
Xia Lixue:The development of the software ecosystem takes time, patience, and opportunity. For example, Nvidia has invested a lot of energy in building its software ecosystem very early, and after a long period of user cultivation and accurate insight into the needs of graphics computing and high-performance computing, this barrier has gradually been built and is getting thicker and thicker. If hardware manufacturers miss this first-mover opportunity and market opportunity, it will be difficult to obtain enough funds to invest in high-quality chip research and development and its promotion and use.
Heart of the machine: If domestic large-scale model companies and chip companies directly cooperate to build intelligent computing centers to increase the computing power they can use, what problems may they face?
Xia Lixue:Today, many large model companies are directly cooperating with chip companies in the "one" share space in order to increase the availability of computing power.
In this type of collaboration, both parties need to draw a lot of manpower and resources from the main business to adapt, and no one wants to "put all their eggs in the same basket". In this case, each company will invest resources with multiple potential partners, such as a model company and multiple chip companies. In addition, if this kind of cooperation is based on the material basis, it needs to be shared by many parties to bear the cost and price together, which forms a complex multi-dimensional cooperation space.
Our goal is to help simplify the adaptation and optimization process for this part, without requiring customers to take the risk of joint R&D, and to provide better optimization results. This essentially creates a middle-tier ecosystem, which provides more computing power supply options for computing power users on the one hand, and helps various hardware ecosystem partners get real business feedback for the next iteration.
Our clients are not limited to large model companies with strong technical capabilities, but also include companies that use models. Energy efficiency is important to these companies, and their AI algorithms are so tied to the use case that they might only be able to dedicate a team of 3 to 10 people to work on the model, but with our involvement, they don't need to commit another 30 people to a full engineering team.
The middle-layer ecology ushered in a window of opportunity
Machine Heart: Why do you think this can be done now?How has the situation changed?
Xia Lixue:Although chipmakers usually do some of the software work, they can provide some low-level basic commands to help developers implement some functions directly. However, in some complex tasks, for example, now that a general large model has appeared, it is necessary to have a special person translate the task requirements of the large model into a combination of instructions for hardware operation. For example, like the addition and subtraction buttons on a calculator, the combination of these basic keys allows us to solve more complex problems.
What we see is that in the era of general large models, the energy efficiency optimization of the middle layer can have more depth. In the past, to solve a task in the industry, a custom model was required. Like chat skills, translation skills, search engine ......This needs to be achieved using different models. The task and algorithm are bound, and the collaborative design of the task and algorithm can only be carried out, and the middle layer has to do a lot of different work when it falls on the system.
The technology founded by Mr. Wang in the past is a bit similar to our current work, but because of the huge differences between image models, speech models and natural language models, if you want to not lose money, you can only do it for a single type of model.
Now, we can use a common model to solve multiple tasks. Through downstream task fine-tuning, the same large language model can achieve different tasks.
Due to the high degree of unification of the model structure of the large model, there is a good window of opportunity for ecology, so that we can focus on such a narrower field, and collaborative optimization between applications, algorithms, and systems can be carried out. The cost of completing it is not unreliable, or it will never be worth it.
While model training data may be different from company to company, the model structure is similar, which allows us to develop a good middle-tier tool at this particular point in time to map different models to different companies' hardware.
Heart of the Machine: Specifically, what is the difficulty of building a software ecosystem in the past and now?
Xia Lixue: The number of sub-molecules can be estimated to reflect the change in the degree of difficulty.
For example, in the past, there were many operators for each domain and each model structure, such as the PyTorch operator library, which had about 2000 operators. But in GPT or other large models that now have the Transformer family as the core, the number of operators may eventually be reduced to no more than 100.
This means that while the overall development volume is still more than 2,000, more than 99% of the computation is concentrated on these 100 operators when considered in terms of usage. So, we can focus on optimizing these 100 operators. The other parts are no longer the bottleneck of optimization.
Machine Heart: What are your strengths in this matter?
Xia Lixue:I think our team itself is good at doing this. Tsinghua University has been committed to combining meaningful algorithms with real-world scenarios to create solutions with commercial value.
We focus on the integrated optimization of models, software, and hardware to reduce the cost of model inference and transform the technical results of the laboratory into sustainable commercial products.
Our tool has two features, fast and efficient. This means that the person using the model does not need to understand the underlying details in order to use it efficiently while guaranteeing optimal performance.
The Heart of the Machine: What exactly is the so-called "m n" middle layer?
Xia Lixue:As I mentioned earlier, each company invests resources with multiple potential partners, which creates a complex, multi-dimensional space for collaboration. Our solution is to create a flexible and compatible middle layer between the multi-flower model layer and the multi-heterogeneous chip layer to achieve efficient and unified deployment between "M N", that is, "M models" and "N chips".
We break down this set of work into three starting points, which are:
From the algorithm to the chip stage, in response to the shortage of computing power, the large model calculation optimization engine is used to adapt the algorithm to the chip and improve the availability of the chip.
From the chip cluster to the model stage, the intelligent computing system layer is built to help developers shield the impact of heterogeneous hardware according to the heterogeneous characteristics of the computing power pool.
From the model to the model application implementation stage, we provide end-to-end implementation services including each model, its efficient fine-tuning, and computing optimization, reducing the magnitude of inference computing, latency, and cost.
Inject increments into the computing power market
Heart of the Machine: According to this idea, how do you bring increments to the computing power market?
Xia Lixue:At present, we have completed the verification of the overall solution.
First, we validated the capabilities of our optimization tools with NVIDIA graphics cards, and we still achieved the world's number one optimization performance in an environment where various industry teams are vying to optimize NVIDIA, which is about 30% better than SOTA.
In addition, we have verified the versatility of optimization capabilities on different hardware, and our optimization results are also the first in the world on AMD hardware, with a test effect improvement of more than 300%.
This shows that our toolchain has a direct benefit in terms of performance improvements, the ability to support scaling on different hardware, and we have a number of action groups that are working with more than 10 hardware vendors.
Heart of the Machine: What is your overall business model at the moment?
Xia Lixue:There is a shortage of domestic computing power, so everyone is not competing for customers, but fighting for limited resources. The core of our commercialization is to provide optimized, more cost-effective computing power services to expand supply and meet customer needs.
There are two main aspects, one is to provide hardware manufacturers with "middle-layer packaging" to improve hardware availability, so that they can open up the large model market and sell their products to more customers.
On the other hand, based on the capabilities of the middle layer, it works with computing power clusters to optimize and improve the supply of computing power, and improve the cost performance of computing power use. In this area, we have signed cooperation agreements with some computing power clusters. In the future, it will directly connect with customers related to large models and provide them with computing power.
Heart of the Machine: Is the second business model here to earn the difference through computing power?
Xia Lixue:Generally speaking, the price difference means that the computing power is obtained at a low cost, and then directly *** is like a middleman. But our goal is to "make the pie bigger" and use technology optimization and adaptation capabilities to make the underutilized computing power more valuable. This "price difference" is actually the incremental computing power that we provide through technology.
What we're doing includes expanding the hash pool so that cards that can't be used can be used, and improving the efficiency of each card so that the production capacity of one card is equivalent to two cards or more. In this way, the computing power that could only support dozens of businesses can now support hundreds of businesses, which is an incremental market.
In addition, our ultimate goal is not only to provide an ecosystem as an intermediate layer, but also to provide a large model for all services and applications in the future, whether it is a B-side or a C-side, will be our potential customers. Because they need the computing power of large models, we can provide cost-effective and easy-to-develop computing services. These services may also include certain development tools.
Heart of the Machine: What is the cost of using your products?How much can customer costs be reduced?
Xia Lixue:Through the collaborative optimization of software and hardware, our goal is to eventually achieve a reduction of about 4 orders of magnitude in the cost of invocation.
Some time ago, we launched the large model Wuqiongtian Quan, which is excellent in handling long texts, with 256k tokens, which is the longest text length that the large model can handle at that time, which is about 40w Chinese character length text. On the one hand, this proves the reliability of our optimized system architecture, and on the other hand, it also emphasizes the technical strength of Wuwen Core Dome in scenarios with high performance optimization requirements such as long text.
It is very expensive to input 40W words to ChatGPT, and now the industry generally reflects that the cost is very high and it is very expensive to do reasoning, and some entrepreneurs even said that "GPT has been in business for four months, invested five or six thousand, users five or six thousand, and earned dozens of yuan". Most developers and users can't accept such a high ** and such a low production ratio.
At present, Wuwen Core Dome has achieved 2 3 orders of magnitude of cost compression, and the goal is to eventually reduce this ** by 4 orders of magnitude, so that the application of large models is no longer "driving Lamborghini to deliver food". We hope to give full play to the potential of heterogeneous computing power, reduce costs, and lower the threshold for model training and inference, so that more creators can enter this field.
Heart of the Machine: How far can the future be achieved in an idealized state?
Xia Lixue:Our slogan is "Unleash the power of domeless power and make AGI within reach". We hope that when you're using a large model to develop internal or external applications, invoking our hashrate is as easy as using an API. When using our services, you don't need to care about the specific technology behind it, such as whether it's a specific brand of card.
For communication, please add the author of this article WeChat: jjingl- (indicate the company-position-name).