Industry Observation NVIDIA s new CUDA conversion is limited, and self built ecology is the way to g

Mondo Technology Updated on 2024-03-07

Software ecosystem refers to a common technology platform where many participants work together to form a large number of software solutions or services. Such an ecosystem can play an extremely important role in the developer scenario, which can reshape the entire AI workflow and strengthen developer stickiness.

According to a report by Tomshardware, Nvidia has banned the use of the translation layer to run CUDA-based software on other hardware platforms in the updated license terms of its software platform CUDA. The general interpretation of this policy change in the industry is that in order to prevent other vendors from using CUDA** through translation layers such as Zluda, Nvidia restricts other vendors from directly converting the CUDA ecosystem software through Zluda and other AI chip platforms.

As NVIDIA's software, CUDA can drive AI models very efficiently when working with hardware, and has become the first choice for many AI manufacturers to train inference models, which is also an important pillar supporting NVIDIA's dominance in the current AI computing field. However, with the advent of more competitive hardware, more and more users want to be able to run their CUDA programs on other platforms. With a translation layer such as Zluda, running CUDA programs on non-NVIDIA hardware is the easiest way to do this (and can also be recompiled**).

This has obviously had an impact on NVIDIA's position in AI applications, and has been the reason for NVIDIA's decision to ban the use of translation layers to run CUDA applications on other hardware platforms. In fact, since 2021, NVIDIA has banned the use of the translation layer to run CUDA-based software on other hardware platforms in the license terms published online. Now, Nvidia has added this warning to CUDA 116 version of the Terms. In the long run, NVIDIA will undoubtedly put up more legal hurdles to prevent CUDA software from running through the translation layer on third-party hardware.

In the daily process of developers, the first step is data management, including data extraction, transformation, and loading to the application, which are collectively referred to as ETL. Then there are data storage, data training, validation (visualization), inference and other links. A good enough software ecology can greatly affect the above workflow, and through the support of a developed software ecology, it can greatly improve work efficiency, greatly increase the stickiness to developers, form positive feedback, and improve the barriers of software ecology.

Before the advent of CUDA, people had to write a lot of low-level ** or borrow graphics APIs to invoke the computing power of GPUs, which was very inconvenient for programmers who mainly used high-level languages. This situation led Nvidia to decide to build a computing platform to match it. CUDA was released in 2006 and CUDA1 was officially launched in 20070 beta versions. From 2008 to 2010, the CUDA platform was further developed, expanding the synchronization instructions of the new local area, expanding the full-speed constant memory, etc. NVIDIA has made the CUDA ecosystem take shape by providing development tools to various software vendors for free. Programmers no longer need to call the GPU through the graphical API, but can directly manipulate the GPU in a way similar to the C language.

CUDA includes a wide range of ecosystem components, including programming languages and APIs, development libraries, analysis and debugging tools, data center and cluster management tools, and GPU hardware. Each of these categories contains a large number of components. These are all formed by NVIDIA and the developers of the open source ecosystem over the past two decades.

It is compatible with the CUDA ecosystem in the form of porting

Of course, the CUDA ecosystem, while large and a first-mover advantage, is not irreplaceable. Since NVIDIA occupies the vast majority of the AI training market, finding a second option outside of CUDA is an important strategy for many AI model companies.

For AMD, Intel, and other catch-ups, while direct porting of software through a translation layer such as Zluda is prohibited, it is still legal to recompile existing CUDA programs. Therefore, on the one hand, we will continue to launch better hardware products, establish our own software ecosystem, and attract more software developers to design software for these new platforms. On the other hand, it is also an important supplement to use the compatible tools provided by the open source community to be compatible with the CUDA ecosystem in a portable way.

For example, ROCM is an open-source computing ecosystem developed by AMD based on its GPU products, which is still compatible with CUDA to a large extent. Software library support is at the heart of usability. Since 2015, the ROCM ecosystem has continued to enrich its components. In 2016, ROCM1In stage 0, the basic data format, basic operation instructions, commonly used basic linear algebra libraries, and some commonly used AI frameworks have been initially supported. By April 2023, AMD has launched ROCM5Version 6 forms a relatively clear software architecture such as the underlying driver runtime, programming model, compiler and test debugging tools, computing library, and deployment tools. From the perspective of completion, at present, ROCM compared with CUDA has achieved relatively complete support in terms of development, analysis tools, basic computing libraries, deep learning libraries and frameworks, and system software.

Of course, the core framework and software library that only support AI are still not enough for GPUs, and they often need comprehensive software and hardware support to reach a large number of users, get feedback in use, and iterate quickly, so as to build an ecological moat. Since ROCM only supports Linux, the official support for Windows is insufficient, and the threshold for installation in the form of command line or script is higher than that of CUDA's graphical operation, which is acceptable for professionals such as developers or algorithm engineers, and still has a certain threshold for those users who do not focus on AI.

However, the ROCM ecosystem has been able to replace CUDA to a certain extent, especially in the core AI field, which has relatively complete support and usability, and many of these practices are very worthy of reference.

Self-built ecology is the way to go for the long term

Under the ChatGPT craze, generative AI has become the biggest hot spot at present. This boom has greatly driven the rapid development of the GPU industry, and at the same time provided a rare development opportunity for domestic GPUs. At present, hardware performance indicators are not the biggest obstacle for domestic GPUs, and some domestic GPUs can be close to the international mainstream level in terms of theoretical hardware performance, but they face many limitations at the level of software ecology.

In 2023, with the explosion of generative AI, especially after the performance degradation of Nvidia's special version, domestic GPU companies will receive higher attention. However, the application of these products is also inseparable from the support of the ecosystem. Domestic enterprises still need to accelerate the development of the ecosystem in terms of the core framework and algorithm library of AI, and take the lead in the ecological competition.

First of all, self-built ecology is the way to go for the long term. In China's core domestic AI computing power chip technology route, Huawei Ascend and Cambrian have both chosen to build their own ecosystems. Although this approach will face many challenges in the initial stage, it will go further and more steadily in terms of ecological autonomy. However, from a global point of view, AMD, Intel, etc. have not given up on porting compatibility with the CUDA ecosystem, and many of these practices are also worthy of reference by domestic manufacturers.

Related Pages