Shared todayAIGC seriesIn-depth Research Report:AIGC Special Topic: Sort out which areas of global AIGC data copyright specification have the potential for commercialization
Report Producer: Everbright**).
Report total: 15 pages.
Featured Report**: The School of Artificial Intelligence
In the training process of AI models, data collection, cleaning, and annotation are important prerequisites. With the gradual promotion and commercialization of AIGC applications based on large models, whether model training data is infringing needs to be considered, and the data used for model training can be divided into proprietary data, open-source data, and dedicated datasets.
There are different ways to guarantee the copyright of the data for different data types, or by directly compensating the creator, which greatly reduces the risk of infringement of training data. With the continuous selection and performance improvement of AI models, the prosperity of the downstream application industry chain, and the maturity of relevant rules and regulations, technology companies need to pay more and more costs to ensure the copyright and compliance of training data.
1.1. Proprietary data: Copyright is mainly guaranteed through copyright cooperation agreements, API paid access, etc., and the commercial space is broad
AI companies use proprietary data for model training and can negotiate directly with copyright owners to ensure copyright compliance for training datasets. Domain-specific high-quality data and private data that is not publicly licensed usually requires a fee, but it is important to further improve the performance of large models and enhance the model's ability to subdivide verticals. The two main ways AL obtains proprietary data are copyright cooperation agreements and paid API access.
1.1.1. Copyright cooperation agreement: A number of overseas copyright providers such as Shutterstock and Axel Springer have established cooperation with AI companies
The high-quality corpus of copyright providers is important for improving the performance of the model and reducing the workload of data cleaning and annotation. News copyright owners have abundant, comprehensive and up-to-date information, and a large number of high-quality training materials are included in literary works, artistic creations, and film and television worksIn addition, some material libraries have annotations for ** and other materials, which can greatly reduce the workload of data cleaning and annotation.
Shuttertock has established cooperation with companies such as Openal, Meta, LG, etc., to provide its **, audio and other materials to partners for model training, and earn income from them, and the news publishing house AxelSpringer has cooperated with Openal, and its news materials will be used to enrich Openal's model training dataset;Bria AL, an Israeli model company based on Wensheng diagrams, has established a long-term partnership with Gettylmages to train on licensed content from image rights libraries such as Getty LMAGES, Alamy, Envato, and others.
1.1.2. API paid access: In the past 23 years, API access for Reddit, Twitter, etc. has shifted from free to paid
Crawling network data through APIs is also important for model training data. As the application of large language models in different industry segments becomes more and more in-depth, the demand for professional data will also increase.
Some high-value, highly specialized data providers charge for API access. For example, the Bloomberg API in the financial space, the New York Times API in the news** space, the Elsevier API in the imaginary number space, the Amazon API in the e-commerce space, the Google Maps API, etc., all require payment.
Non-professional data such as social platforms and open-source platforms have also gradually begun to charge for API access. The reason behind the shift from free to paid API access for 23M4 social platforms Reddit and Twitter may be due to the significant increase in API calls driven by the demand for large model training, which has brought higher costs to the two social platforms. Stack Overflow, an open-source** platform, has announced that it will charge Company A for training data.
1.2. Open source data: Relying on open license agreements and specific data scraping strategies to protect copyright, there are still hidden dangers of infringement
AI companies use open-source data for model training, and can ensure copyright compliance through open license agreements, specific data scraping strategies, manual teacher checks, and community supervision. An open license is a standardized way to authorize data for copyright holders to use to othersIn addition, model vendors can also improve the transparency of training datasets and ensure copyright compliance through manual screening and community supervision.
1) Open license agreement: Common open license agreements for open source datasets include Creative Commons (CC), Open Data Commons (ODC), Community Data License Agreement (CDLA), etc. There are six Creative Commons options: CC BY: Authorship, adaptation, commercial useCC BY-SA: Author must be acknowledged, and adaptations must be shared under the same terms;CC BY-NC: Authorship required, adaptation permitted, commercial use not allowedCC BY-NC-SA: Author must be acknowledged, non-commercial use is permitted only, adaptations must be shared under the same terms;CC BY-ND: Author must be indicated, adaptation is not permitted;CC BY-NC-ND: Author must be acknowledged, non-commercial use is allowed, adaptation is not allowed.
2) Specific data scraping strategies: AL companies can use specific strategies to open copyrighted information when scraping web page data, and web page maintainers can also strengthen the review of data crawling. For example, a webpage of a robotThe txt file specifies which crawlers can access, and noindex blocks the indexing of a page, blocking crawlers.
3) Community supervision: Company A can improve the transparency of the training dataset, encourage community supervision, and file a complaint if the creator of the training data claims infringement. This approach is more suitable for open-source models, while for commercial closed-source models, the training dataset is often kept secret as part of the developer's technical barriers.
On the whole, the acquisition of open source data has developed a complete copyright protection system, but there are still certain hidden dangers of deadlock. For example, some public pages don't have a well-established open license and API crawling rules, and even the content in the public pages may be infringing on its own.
1.3. Direct compensation for creators: Overseas advanced technology identifies the copyright of AI-generated content**, and the establishment of ** will provide subsidies for creators
There are two main ways to protect copyright by directly compensating creators: 1) ex ante compensation: the copyright owner's work is compensated when it is adopted as training data;2) Post-event compensation: Trace the training data source of AI-generated content through specific technologies and provide targeted compensation.
The technical difficulty of ex-ante compensation is low, but it is difficult to define a reasonable amount of compensation. Shutterstock, a well-known copyright library overseas, has established contributors, who will be compensated when the content created by the contributor is used for AI model training, and will continue to be compensated when the content is generated using the model in the future. This kind of method can ensure that the creator receives a certain amount of compensation, but the contribution of different styles and quality of content to model training is different, which is difficult to quantify, and will bring certain difficulties to compensation pricing.
Post-compensation refers to the traceability of training data through technical means and corresponding copyright compensation, which is more reasonable in pricing but immature in technical difficulty. 23m9 Carnegie Mellon University, Adobe Research, and the University of California, Berkeley, have collaborated to develop two algorithms, the first of which prevents the model from invoking copyrighted works, the second that compensates creators when the model generates content with copyrighted works, and the option for artists to opt out of AI models at any time. In addition, Bria A, an Israeli model company based on Wensheng graphs, has developed an attribution model at 23M9 that calculates the impact of data sources on AI-generated content, thereby providing more reasonable pricing to copyright holders of training data.
1.4. Dedicated Datasets: Directly applicable to AI and ML datasets, or as part of a MaaS service to enhance the user experience
Dedicated datasets refer to datasets that have been screened and cleaned and are directly applicable to model training, and require the dataset provider to fulfill the obligation to confirm data rights. Dedicated datasets provide strong support for developers to conduct machine learning and model training-related research, most of which are open source datasets, but some of which are used**. For cloud service providers, private datasets are often packaged as part of the MaaS service to help users better train their own custom models.
1) Direct** datasets: These datasets have been screened, organized, and annotated in the early stage, and are composed of labeled examples or input-output pairs, which can be directly used for AI and machine learning model training. Payment methods include one-time purchase and subscription, and are affected by factors such as data volume, accuracy, coverage time, and region. For example, Datastock, a dataset store, sells high-quality, structured web crawling datasets across retail, healthcare, travel, and moreDatarade, a data trading platform, has set up an AI & ML training data zone for providers and developers to trade datasets.
2) Provide users as part of MaaS: Cloud service providers such as Microsoft, Amazon, and Google all provide MaaS services to help customers with AI model training and application development, including self-developed and third-party AI model call interfaces, supporting services and guidance around the technical details of model training, etc. For custom models, the dataset is typically the customer's personalized data, but some MaaS platforms also provide specific model training datasets for the customer to use. For example, the Microsoft Azure cloud platform provides customers with curated datasets, made from publicly available data, and accessible at any time during model training.
2.1. The cooperation between the copyright provider and the AI company is mutually beneficial and win-win
The rapid increase in AI-generated content poses a certain threat to copyright providers such as ** material libraries and news publishers. 1) AI-generated content may be uploaded to the copyright repository for mixing. With the continuous breakthrough in the performance of large models, the quality of AI-generated content is gradually improving, and it is even difficult to distinguish it from content created by human authors and artists. If a large amount of AI-generated content is uploaded to the copyright library, it may affect the user's willingness to pay. 2) AI-generated content may be a substitute for copyright repositories. With the promotion and popularization of AIGC products, the continuous reduction of the cost of large models in the future, and the continuous improvement of relevant policies, AI-generated content will be increasingly used in commercial products, thus squeezing the living space of traditional copyright material libraries. Therefore, copyright providers also need to actively embrace the AIGC trend and explore the new situation of combining traditional business with AI technology.
For AI companies, model training requires massive amounts of high-quality data, and AGC products also need to be linked to more information sources. For the subsequent commercialization of the model and the long-term healthy development of the company, it is better to obtain high-quality training data from copyright providers. In addition, copyright providers can also enrich the information** and product functions of AIGC products to empower users with user experience.
2.1.1. Overseas multi-** copyright library shutterstock: * Model training and material to generate income, through ** will provide compensation for creators
Following the AIGC wave, Shutterstock, a well-known overseas multi-** copyright library, has launched a special area for generation and provides an AI Wensheng diagram tool powered by Openal. Shutterstock has more than 4. contributed by more than 1 million contributors500 million **, and the multi** materials provided mainly include:
1) **Quantity chart, **AI-generated**, etc.;
2) Pond5** platform: shots, AE materials, sound effects, 3D models, etc.;
3) Design: Business Marketing Templates, Social ** Templates, etc. In addition, Shutterstock offers design tools, including editors, cutout tools, AI-generated tools, and more.
Shutterstock's two-way collaboration with OpenAI began in 2021. In 2021, Shutterstock began working with OpenAI and LG;23M7 OpenAI and Shutterstock have signed a six-year partnership agreement.
Shutterstock's collaboration with AI companies can be summarized in three areas:
1) Shutterstock provides Openal with the copyright of the material for model training. After signing the agreement, OpenAI has access to Shutterstock's images, ** and other materials for the training data of AI models. Shutterstock owns a wealth of high-quality content material copyrights, and is an industry leader in diversity and data annotation, giving it a great advantage in training AI models.
2) Shutterstock has set up contributors** who will be compensated when their authors are used for A-image model training. Shutterstock was the first company to launch Contributors, which has compensated hundreds of thousands of creators as of 23M7 and continues to compensate creators through royalties associated with licensing activities for newly generated assets.
3) AIGC Wensheng diagram and ** editing tools are integrated into the Shutterstock platform, and are supported by Openal's Wensheng diagram model DALL·E. Contributors who create ** and are used for model training will receive long-term rights to use the AI Wensheng Diagram tool. In addition to OpenAI, Shutterstock has partnered with companies such as Nvidia, Meta, LG, and others to develop AIGC authoring tools for text, images, 3D, and more.
2.1.2. Axel Springer, an overseas news publisher: Provide text training data for Openal and attract traffic to creators through links
The high-quality article materials of the publishing house are high-quality text corpora trained by large models**, which helps to accelerate the performance iteration of large models and promote the improvement of the copyright system of A-generated content.
On December 13, 2023, German digital publisher Axel Springer and Openal announced a global partnership and became the first news agency in the world to partner with Openal.
1) For OpenAL: OpenAL will pay to use the content of Axel Springer's publications to improve its AI model training database. ChatGPT users will receive a selection of global news feeds from brands owned by Axel Springer. When ChatGPT answers a user's question using information from an Axel Springer publication, a link will be provided below the answer, ensuring credit, compensation, and traffic for content copyright holders.
2) For Axel Springer: It can open up new lines of business and capture potential revenue increments by providing high-quality content materials to AI companies, while leveraging Openal's technical support to improve its products. Explore the future of journalism by partnering with Openal to leverage AI to enhance content experiences and create new growth opportunities.
Openal has repeatedly sparked controversy over the unauthorized use of news** articles to train models. The leading news organizations in the United States, Wall Street**, and the New York Times have all had disputes with OpenAI over copyright issues. 23M2, Jason Conti, general counsel for New Corp's Dow Jones division, said in a statement to Bloomberg News that any business using Wall Street** to train AI should seek permission from Dow Jones & Company23M8 The New York Times updated its terms of service to prohibit its news coverage and** from the development of application software and training AI models, and warned that it would sue Openal if it continued to cause controversy.
The commercial partnership with Axel Springer is the starting point for a mutually beneficial relationship between Openal and publishers around the world. Brad Lightcap, COO of Openal, announced Openal's commitment to working with publishers and creators around the world to ensure they benefit from advanced AI technologies and new revenue models.
2.2. Shutterstock is bullish on the cooperation between the copyright library and AI companies: the overall positive of AIGC is stronger than the negative
2.2.1. Shutterstock's data licensing revenue has been more clearly reflected in the performance side, driving valuation repair and stock price recovery
Shutterstock's stock price bottomed out as the business of partner** data gradually unleashed its performance potential. From 23m1 to 23m5, Shutterstock's stock price rose rapidly and**, and the subsequent stock price showed a ** trend until the stock price began after the release of 23Q3 results**.
1) 23M1-23M4: Catalyzed by the investment logic of the AIGC industry, the stock price has risen sharply. Under the AIGC investment hotspot, the market has begun to explore industries that may benefit, and Shutterstock has attracted attention as a company that has established a cooperation with OpenAI since 2021, and the logic of large model training to drive the demand for training data copyright is very smooth, with the highest increase of 511%。
2) 23M5-23M10: The market is starting to worry about the rapid development of AI Wensheng Diagram squeezing Shutterstock's traditional business **copyright**.
3) The stock price bottomed out as Shutterstock's revenue from model training data licensing grew rapidly. Shutterstock's disclosure of the Computer Vision Data Partnerships Offering represents a license to assets such as D models provided to Big Tech companies to train generative AI and machine learning models. 2303 This revenue amounted to $45.5 million, or 195%;In the first nine months of '23, the revenue reached $79.5 million, or 121%。
2.2.2. There are many reasons for the decline of Shutterstock's traditional business, and the threat and substitution of AIGC to copyright providers is not obvious
In our view, the decline in Shutterstock's traditional business is not due to a Wensheng diagram, but more likely to be due to a variety of factors, such as competitive pressures. We compare Shutterstock's traditional business, which excludes computer vision data revenue, to that of its competitor, Getty Lmage. Shutterstock's traditional business representatives excluded businesses other than large model training data, including e-commerce businesses (customers can subscribe on a monthly basis or pay as you go***, as well as enterprise services that provide customers with libraries, libraries, and other materials that are more comparable to Getty Lmage's revenue. The copyright provider gettylmages has shown strong competitiveness in the library market with its abundant and high-quality resources.
Over the past 23 years, Gettylmage's revenue has remained stable and has not been significantly affected by a Wensheng diagram. As a competitor to Shutterstock, Getty LMAGES does not have AI model training data, and its total revenue has been relatively stable in the past two years, with 2303 total revenue of 2$300 million, down 05%。Compared to Gettylmage, Shutterstock's traditional business revenue has declined continuously since 2204, falling to 2303 to 1$900 million, down 73%, while 2303 Shutterstock subscribers and paid** subscribers also trended down. In our view, the decline in Shutterstock's traditional business revenue was more affected by competitive pressure from the industry, but computer vision data** has also become a new growth driver for performance.
As of the end of '23, public antipathy to a Wensheng diagram and other multimodal generation was still strong. On December 6, 2023, the Spring Festival Gala mascot "Long Chenchen" was questioned as AI mapping, which was widely criticized by the domestic public. Since Stable Diffusion, Midjourney and other Wensheng diagram software have come into the public eye, it has sparked continuous discussions on whether AI-generated** is infringing. 22m12 A joint article published by the University of Maryland, College Park and New York University shows that some Wensheng diagram models with small parameter quantities will directly copy a certain part of the material used for training, and stablediffusion, a more mature Wensheng diagram product at the time, has also copied the details, structure, and painting style of famous paintings at the pixel level.
The public's doubts about a multimodal generation mainly come from: 1) whether the ** materials used in model training are authorized;2) Whether generation through machine learning can be defined as a process of learning and authoring;and 3) whether the materials used in the training data are simply and crudely spliced in the process of AI generation.
Gradually reversing the public's negative emotions and one-sided perception of AI multimodal generation is a necessary prerequisite for the promotion of technologies such as AI and AI to production and life and release the potential of commercialization. As the AIGC's influence rapidly expands, tech companies will also need to pay more to ensure copyright and compliance with model training data and generated content to meet possible legal challenges in the future.
Disputes over AIGC's copyright issues and related regulations can be divided into two main categories:
1) Copyright definition of AI-generated content: refers to whether AI-generated text, ** and other content are protected by copyright, and which party should the copyright belong to users, model providers, training data providers, etc. Clarification of the copyright of AI-generated content is an important prerequisite for the large-scale commercialization of AIGC products.
2) Copyright provisions on model training data: refers to whether the dataset used by model providers such as Openal and Stability Al when training the basic model is protected by copyright, and how model providers should obtain the copyright of the training dataset. Copyright requirements for training data are key to the healthy and sustainable development of the AIGC industry and to eliminate the public's negative feelings about AI-generated content.
Report total: 15 pages.
Featured Report**: The School of Artificial Intelligence