At present, the development of large models is on the eve of a great change full of infinite possibilities, and big data, as the core element, has also been given a new meaning. What does the emergence of large model technology mean for big data, how will big data affect the development of large models, how can the two better go both ways, bring new quality productivity, and what technology tipping points and killer applications will appear in 2024?
With the various problems that linger between large models and big data, at the "6th Golden Ape Season & Rubik's Cube Forum - Big Data Industry Development Forum", Ou Xiaogang, senior chief writer of Data Ape, as the host of the roundtable forum, and the director of Hong Kong Science and Technology Parks Corporation, Justice of the Peace, Che Pinjue, member of the ** Digital Economy Development Committee of the Hong Kong Special Administrative Region, Hou Jianye, deputy general manager and CIO of Shishi Technology, Wang Long, founder and CEO of Matrix Origin, Luo Yongxiu, co-founder and CTO of Hongyi, Xiong Wei, Vice President of Weiyin China, had an in-depth discussion and made suggestions for the integrated development of large models and big data.
Although the performance of the large model is enough to surprise people, it has to be admitted that even if it is as strong as ChatGPT, there is often a serious nonsense. The wisdom and foolishness of large models are deeply affected by the corpus of big data, so we can't help but ask, how does big data affect the play of large models?
The five panelists agreed that the quality of the data determines the quality of the large model. Che Pinjue believes that for large models, the "big" of big data is not omnipotent. Fragmented data is not as valuable as real, logical data. Moreover, at a critical point, no matter how much scattered corpus is fed, it will not have more effect on the reasoning effect of the large model, and high-quality data can play the role of "a word is worth a thousand words".
Director of Hong Kong Science and Technology Parks Corporation, JP, Member of the Hong Kong Special Administrative Region ** Digital Economy Development Committee Cha Pin Kok.
Wang Lung vividly pointed out that the working mechanism of the large model is like compressing a high-definition ** into a 32kb thumbnail, finding the law of data arrangement, and then generating a new high-definition image according to the law. In the process of building a large model, the quality of the large data set directly determines the performance and accuracy of the model. Building a real-time and accurate closed-loop link to make data collection, processing, and training seamless, is a key link to promote the development of large models.
If it is only in the "laboratory", the quality of the data may only affect the performance of the model, which is nothing more than the difference between 80 or 60 points. However, in commercial applications, there are only two options, available and unavailable.
Luo Yongxiu said: "It is self-evident that the role of large models in promoting intelligent document management is self-evident. As soon as the large model came out, Hongyi tried to apply it to our ECM intelligent content management product, but it is difficult to form a system because it is a continuous and dynamic optimization process for the collection, sorting, analysis and application of document data.
First, the standard product. In the field of knowledge management, large models are like a fish in water, and they are making rapid progress. This is because knowledge management contains the most rigorous and logical knowledge, such as product operation manuals, process production standards, etc., which directly determine the content of enterprise operation and production, which has extremely high requirements for accuracy and certain organizational norms. Based on the dataset built on the basis of the industry knowledge base, whether it is vectorization processing or high-precision fine-tuning, once the enterprise is connected to the large model, it will definitely bring a significant improvement to production efficiency. ”
Luo Yongxiu, co-founder and CTO of Hongyi.
Combined with practical application, Xiong Wei believes: "The large model is a language model, and the human language system is relatively complete and systematic, which can provide sufficient corpus for the large model, so the large model has a natural advantage in understanding and generating ** language." The customer service we are engaged in is one-stop, cross-regional, multi-lingual, and the large model can help us communicate with many countries around the world without barriers, and act as an intelligent assistant. ”
Of course, we must also clearly point out that emphasizing the importance of the "quality" of big data is not to negate the role of "quantity", the quality and quantity of big data are not opposites, the two complement each other, and a larger amount of data and higher data quality jointly determine the quality and performance of the model.
In the past decade, the rapid development of the Internet has laid a data foundation for the rise of large models, and a number of extremely valuable data assets have been precipitated. In the past, in order to carry out effective data mining, not only had to pay huge expert costs, but also needed to go through a series of cumbersome procedures such as data collection, big data preprocessing, and data labeling, resulting in a large amount of data not being able to exert its value and becoming a "sleeping gold mine". The emergence of large models has brought a new atmosphere to big data, and at the same time, it has also put forward new requirements and tests for data infrastructure such as databases and data platforms.
Hou Jianye pointed out, "Before the birth of large models, there were few scenario applications that needed to process 100 terabytes or petabytes of data, and only scientific research projects similar to meteorology and biomedicine needed such a huge amount of data. The large model allows the analysis application of huge amounts of data to fly into the homes of ordinary people, and can be used in almost all industries. Many companies engaged in the research and development of large models are often hundreds of billions of parameters. In the last stage of information development, the industry often talked about bandwidth, access, storage and other words, but in the new stage of development, models, computing power, and graphics cards have become hot topics, which is a rhyme of the new era. ”
Hou Jianye, deputy general manager and CIO of Stone Technology.
The large model is like a drilling rig improved by a new process, capable of detecting deep oil buried deeper in the ground. As a factor of production in the digital age, big data has different characteristics from the factors of production in the feudal era and the industrial era, and big data can be reused and regenerated indefinitely. The wide application of large model technology will generate a huge amount of new data.
Nowadays, a lot of short**, text is generated by large models. The wide application of large models has brought a huge increase in the amount of enterprise data, and it can be said that large models are the brains of big data. Documents and various data assets that were previously scattered between different departments of the enterprise will be rediscovered and their value will be reproduced because of the emergence of large models. Luo Yongxiu said.
Large models can not only analyze big data, but also generate big data, these generated big data are not available in the world a second ago, how do we look at these "unprecedented" data created by large models?
Wang Long believes that the large model is a probability system, and it is barely available to write press releases, but to write a company's financial report, the large model may have 10,000 pieces of content that are right, and only one is wrong, and the bad thing is that the user does not know which one is wrong, when it will go wrong, in this case, the company does not dare to hand over all this matter to the large model. Ensuring that the information output of the large model is true and accurate is a very important issue at present. With the wide application of large models in various fields, the information output is directly related to the accuracy of decision-making and the stable operation of society.
Wang Long, founder and CEO of Matrix Origins.
Xiong Wei pointed out that training large models means investing in massive amounts of data, and how to release the value of data on the basis of ensuring privacy will be an important challenge for enterprises in all walks of life. As the application of AIGC becomes more extensive and deeper, the security and privacy of data will be greatly improved through model training optimization, security encryption technology upgrades, and the gradual improvement of the compliance supervision system.
Xiong Wei, Vice President of Weiyin China.
Some analysts have pointed out that every scientific and technological revolution will go through two periods: the first two or three decades are the introduction period, during which a large number of infrastructure and key industries are gradually formed and gradually improved, and they are also washed away and subverted by the new paradigm while encountering the resistance of the old paradigm; The next twenty or thirty years is the expansion period. The structural contradictions accumulated in the early stage have been alleviated under the adjustment of the institutional framework, and the transformative power brought about by the scientific and technological revolution has gradually spread to the entire economy and society, so that economic growth has re-entered the sustainable growth mode.
Entering 2024 with people's infinite expectations for the large model, the guests expressed a positive attitude towards the development of the large model in the new year. Luo Yongxiu believes that in the context of the slowdown in economic development, enterprises will pay more attention to reducing costs and increasing efficiency, and enterprises may lay off employees to reduce various expenditures, but the investment in data asset management and knowledge management will increase. Wang Long is more optimistic, he believes: "There are many opportunities in the upstream and downstream of large models, and the upstream opportunities come from infrastructure links such as large model training and inference; Downstream opportunities come from the application layer, such as multimodal content generation. I believe that in the next 20 years, Microsoft and Toutiao may be born in 24 years. ”
AI Copilot, AI Agent, AI PC and other large model technology branches are struggling in their respective directions, and tipping points and killer applications may surprise the world overnight in unexpected ways. Large models and big data are like quantum entanglement, accelerating the convergence of industries and pushing data science into a new era. This convergence opens the door to deeper insights and intelligent decision-making, ushering in a new era of data science.