On the afternoon of March 2, the Artificial Intelligence Security Governance Professional Committee of the Cyberspace Security Association of China held a symposium on "Artificial Intelligence Corpus Construction and Compliance" in Beijing. Representatives of the National People's Congress, members of the Chinese People's Political Consultative Conference (CPPCC) and academic, legal and industry experts attended the meeting to conduct in-depth research on the relevant legal issues involved in the collection, processing and circulation of artificial intelligence large model corpus.
Deputy to the National People's Congress, CMGE (00302.)HK) co-founder and founding partner of Guohong Jiaxin Capital, Mr. Sin Handi expressed his views on "Reflections on the Construction and Compliance of Artificial Intelligence Corpus", he said: Model training still has a large risk of copyright infringement, and my view is that for the development of the AI industry, I suggest that in the case of fair use of copyrighted works, some large models can be exempted from copyright liability, but not unconditionally and unlimitedly; Technological innovation, such as the development of intelligent copyright identification technology, should be encouraged to better help and manage copyright issues; It is necessary to encourage the public to participate more in the discussion of this topic, raise copyright awareness and intellectual property education, and jointly promote the healthy development of artificial intelligence technology.
The following is the full text of the speech:
Good afternoon everyone, thank you very much for the invitation, I am very happy to be with you today on the issue of intelligent corpus construction and compliance, and I will also talk about my thoughts here.
OneAt present, large model training still has a large risk of copyright infringement
First, let's review the basic concept of a large model. A large model is a deep learning model that can be trained on massive amounts of data to achieve tasks such as natural language understanding and generation. However, it is precisely because its training relies on a large amount of data and involves the use of copyrighted works, which also raises concerns about copyright infringement, and the unauthorized use of third-party platform work data for large model training has also led to some disputes.
Many AI developers don't disclose the exact details of their training datasets about the data used by generative AI**, but they can be roughly divided into two steps: the first step is to obtain massive content data by purchasing databases, publicly crawling, etc., and then store it in relevant servers after some form of transformation; In the second step, the content data is analyzed and processed to find certain patterns, trends, and correlations and transform them into large model parameters for subsequent content generation. However, some of these data contain copyrighted content.
For example, AI developers, including Google, Facebook, and OpenAI, are training large models using the "Colossal Clean Crawled Corpus" dataset (often referred to simply as the C4 dataset), which includes a lot of copyrighted content**, and these forms of data collection can also raise issues of copyright ownership and fair use, which is the crux of the healthy development of generative AI technology.
IIWhat are the existing criteria for judging whether a copyrighted work used in large model training is infringing?
In order to further understand how to determine whether the use of copyrighted works to train large models is infringing, I also went to understand the relevant laws and regulations in China:
Article 24 of China's Copyright Law stipulates the specific circumstances of "fair use" (i.e., the exploitation of a work can be made without the permission of the copyright owner and without payment of remuneration to the copyright owner), and the specific rules involving large model training generally include "personal use", "appropriate citation", "study and research use", etc.
Among them: the first point is that there are strict restrictions on the applicable purpose of "personal use", and the current large model is mainly for commercial services, which does not meet this item;
The second point, "appropriate citation", the premise in the legal provisions is "to explain a certain work for the purpose of introducing or commenting" or "to explain a certain issue", and the commercial application of the AIGC model obviously does not meet this item;
The third point, "scientific research", limits the use of works to "school classroom teaching or scientific research", and also emphasizes that only a small number of copies can be made, and the current situation of large-scale models copying and using works in large quantities cannot meet this requirement.
Then, if you look at it only according to copyright law, it is undoubtedly an infringement to use unauthorized copyrighted works for model training.
However, in addition to the Copyright Law, in order to make the Copyright Law serve the higher level of public interest such as promoting the sharing of cultural knowledge in the whole society and the advancement of content dissemination technology, countries have also established an exception rule, that is, if the "three-step test" is satisfied, it can also be judged as "non-infringement". The specific content of the three-step test is that "it can only be made under special circumstances, does not conflict with the normal exploitation of the work, and does not unreasonably harm the interests of the copyright owner".
As for using these three steps to judge whether a large model is infringing, I believe that all the legal experts here today will also have their own professional opinions. I would like to focus on the following: Will the unauthorized use of copyrighted works for model training have a market impact on copyrighted works? Will it lead to an imbalance in the public interest? This must be a process of value consideration and balance of interests, and it is difficult to say that there is a 100% correct answer. I believe that the development of AIGC will greatly promote social development. Although model training may have a certain market impact on copyright holders, if we overemphasize the payment of copyrighted works during the training process, it will definitely restrict or even hinder the development of AIGC's industry.
Therefore, we also see that since the release of ChatGPT, in order to promote the development of AI, countries around the world have actually made preliminary explorations on the improvement of laws, with the goal of "exempting the AIGC platform from copyright liability in the model training stage" to a certain extent. For example, the European Union, Japan, and the United States have all made a certain degree of exemption from copyright liability for large models by amending legal provisions.
IIIRecommendations
Therefore, I would like to make the following recommendations:
1.For the sake of the development of the AI industry, I suggest that in the case of fair use of copyrighted works, some large models can be exempted from copyright liability, but not unconditionally and unlimitedly.
2.Encourage technological innovation, such as the development of intelligent copyright recognition technology, to better help and manage copyright issues.
3.Encourage the public to participate in the discussion of this topic, raise copyright awareness and intellectual property education, and jointly promote the healthy development of AI technology. For example, a discussion like today's is a good opportunity, so I would like to thank the organizers again for organizing such a conference, so that we can have more in-depth thinking and exchange on the issue of AI corpus construction and compliance.
That's all for my sharing, thank you!
Hotspot Engine Program