The Copyright Crisis of Large Language Models How to Protect the Rights of Original Works?

Large language model refers to an artificial intelligence model that uses deep learning technology to learn language rules and knowledge from massive text data, and can generate various types of text content according to given prompts. In recent years, with the improvement of computing power and the increase of data, the performance and application range of large language models have been expanding, and they have become a hot topic and innovation engine in the field of artificial intelligence. However, the development of large language models has also given rise to a series of copyright disputes and legal challenges, as they may involve the copying, adaptation, and recreation of copyrighted works during the training and generation process, thus infringing the rights of the original creator.

Recently, a startup called Patronus AI released an API to detect whether content generated by large language models contains copyrighted content. The company also presented a study that tested several popular large language models, such as OpenAI's GPT-4, Mixtral's Mictral, Anthropic's Claude 2, and Meta's Llama-2, and found that they all generated copyrighted content to varying degrees. The study used copyrighted books in the U.S. as test data, selected popular books from cataloging** goodreads, and designed 100 different prompts for the model to respond by continuing or outputting the first page.

The results showed that GPT-4 performed the worst, generating copyrighted content on 44% of prompts, with Mictral at 22%, Claude 2 at 8%, and LLAMA-2 at 10%. These results imply that these models may also use these copyrighted works in the training data without the permission of the original creators. OpenAI had said earlier this year that it was "impossible" to train top-tier AI models without copyrighted works.

This is not the first time that large language models have sparked copyright controversy. Last year, OpenAI released GPT-3, a large language model with 175 billion parameters, which attracted global attention and praise. However, GPT-3 has also been accused by multiple writers who found that GPT-3 had used their work as training data without authorization and generated content similar to theirs. One of those writers is American sci-fi s*** Scott Card, who found GPT-3 generating similar content to his Ender's Game and tweeted, "This is a copyright infringement of mine and I will not tolerate this behavior." ”。Another writer, British fantasy artist Neil Gaiman, discovered that GPT-3 generated content similar to his American Gods and tweeted, "It's illegal and I wouldn't agree with this behavior." ”。

In addition to writers, some ** institutions have also expressed dissatisfaction with the copyright issue of large language models. In January, The New York Times sued OpenAI and Microsoft, alleging that they had used millions of New York Times articles as training data for GPT-3 and Codex and made them accessible and available to users through services like Copilot. The New York Times demanded that OpenAI and Microsoft stop the infringement and pay billions of dollars in damages.

The copyright issue of large language models involves not only the legitimacy of the training data, but also the ownership and responsibility of the generated content. At present, large language models can generate various types of text content, such as **, poems, lyrics, news, comments, **, etc., which may have a certain originality and value, and may also have similarities or conflicts with existing works. So, should this content be protected by copyright law? If so, to whom? Is it the developer, the consumer, or the provider of the original data? Who should be held liable if the content infringes the copyright or other rights of others, or causes harm to society? These are all legal problems that have not yet been clearly answered, and they need to be solved by legislators, judiciators, scholars and practitioners in various countries.

The copyright crisis of large language models reflects the rapid development of artificial intelligence technology and the lag of traditional copyright laws. In order to protect the rights of original works and promote the healthy development of artificial intelligence, it is necessary to establish a copyright system that adapts to the characteristics and needs of large language models, balance the interests and demands of all parties, standardize the training and use of large language models, prevent copyright infringement and abuse, and promote the harmonious coexistence of artificial intelligence and human culture.

Hotspot Engine Plan Information Reference**:1: openai says it can’t build artificial intelligence without copyrighted works - the verge

2: gpt-3 is generating plagiarized content, researchers say - venturebeat

3: neil gaiman on twitter: “this is illegal. i do not consent to this.”

4: new york times sues openai and microsoft for billions over gpt-3 - business insider

The Copyright Crisis of Large Language Models How to Protect the Rights of Original Works?

Related Pages

How to break through the biggest bottleneck of large language models

How to integrate large language models into your own applications via APIs

How can Chinese companies seize the opportunity when the language model tuyere is coming?

The evolution of the large language model API market

AI Paper of the Year The "intelligent emergence" of large language models simply does not exist!