The New York Times sued OpenAI and Microsoft Corp. in December, alleging that the two companies illegally used their copyrighted material to train AI models. To this day, OpenAI publicly fired back and published a blog post saying, "We support journalism, work with news organizations, and believe that the New York Times lawsuit is baseless." ”
In the blog post, the company reiterated the following four positions:
We partner with news organizations and create new opportunities.
Training is fair use, but we offer the opportunity to opt out because it's the right thing to do.
"regurgitation" is a rare mistake that we are working to reduce to zero.
The New York Times doesn't tell the full story.
OpenAI said its goal is to support a healthy news ecosystem, be a good partner, and create mutually beneficial opportunitiesWith this in mind, the company has been looking to form partnerships with news organizations. Through early collaborations with the Associated Press, Axel Springer, the American Journalism Project, and New York University, they already have a preliminary understanding of the methodology.
The company believes that some of the precedents that have been generally accepted suggest that training AI models using publicly available internet materials falls under the category of fair use. "We believe this principle is fair to creators, necessary to innovators, and critical to U.S. competitiveness. ”
As for "rumination", OpenAI explained that they have taken steps to limit unintentional memory and prevent rumination in the output of large models. It is also noted that rumination is less likely if the training data comes from a single ** (e.g., the New York Times);Also urging users:"act responsibly"to avoid deliberately prompting their models to ruminate. "Deliberately manipulating our models for regurgitation is not an appropriate use of our technology and is a violation of our Terms of Use". "Rumination" refers to a phenomenon in which a particular piece of content appears multiple times in the training data of a large model.
OpenAI revealed that the last communication with the New York Times was on December 19, 2023, and that "our discussions with the New York Times seem to have progressed constructively." We explained to the New York Times that, like any single **, their content does not contribute meaningfully to the training of our existing models and will not have enough impact on future training. But they filed a lawsuit on Dec. 27 — which we learned about through the New York Times — to our surprise and disappointment. ”
It is worth mentioning that OpenAI said that the New York Times had found some rumination in the course of communication between the two parties, but repeatedly refused to share any examples on the premise that OpenAI was committed to investigating and resolving any issues.
Interestingly, the rumination mentioned by the New York Times appears to have come from articles from many years ago that were heavily circulated on multiple third-party **s. In order for our model to regurgitate, they seem to have deliberately tampered with the prompts, often including lengthy excerpts from the article. Even when using such prompts, our models often don't do what the New York Times insinuates, suggesting that they are either instructing the model to ruminate or hand-picked a few examples from numerous attempts.At the end of the statement, OpenAI said that although they believe that the New York Times lawsuit is baseless;But there is still respect for the New York Times and a desire to build a constructive partnership with it. "We look forward to continuing our work with news organizations to help them improve their ability to produce high-quality news by realizing the transformative potential of AI. ”Although they claim that this misuse is not typical or permissible user behavior and is not a substitute for the New York Times. But in any case, we are constantly improving the system's resistance to reverse attacks to avoid rumination of training data, and we have made great progress in recent models.
OpenAI's response comes as the copyright debate over generative AI is heating up. Some data suggests that the public is more inclined to side with the publishers. According to a recent poll by The AI Policy Institute, 59% of respondents agreed that AI companies should not be allowed to use publishers' content to train models, according to the details of the New York Times' lawsuit against OpenAISeventy percent of those surveyed said they should compensate publishers if they wanted to use copyrighted material in model training.