Chat GPT is a reference answer , and even ByteDance is copying homework ?

Mondo Education Updated on 2024-01-30

Visual China.

Text |New Voices of the Metaverse, edited by Sun Haonan.

As we all know, in the field of AI large models, OpenAI's development of chat-GPT is like a particularly difficult topic assigned by the teacher when I was in school, just when everyone was still figuring out the idea of solving the problem or was puzzled, the top student in the class had already finished writing it first, so most people were more inclined to exchange ideas with the top student, or directly copy the homework.

The recent turmoil also seems to confirm that many seemingly complex things are essentially the same. Previously, Musk's Grok AI was suspected of plagiarism and even shell chat-gpt because of dataset pollution, and the existing ByteDance was banned by OpenAI on suspicion of violating the terms of service.

Recently, foreign media The Verge reported that ByteDance used Microsoft's OpenAI API account to generate data to train its own artificial intelligence model, which has actually violated the terms of use of Microsoft and OpenAI. Shortly after the news was revealed, The Verge further said that OpenAI had suspended ByteDance's account.

So what specific terms did ByteDance violate? In fact, there is a clear provision in OpenAI's terms of service, that is, the model capabilities provided by OpenAI are not allowed to be used to "develop any AI model that competes with its products and services".

According to The Verge, the evidence came from an internal document from ByteDance, the chat logs of Lark, an overseas version of Feishu.

The document shows that ByteDance relies on OpenAI's API for development, including training and evaluating models, at almost every stage of development, in its basic large language model project, codenamed Project Seed.

The "Seed Project" was launched about a year ago, and currently focuses on the development of two products, one is the bean bag that has been launched in China; The other is a chatbot platform for business users, which is currently under development.

Employees involved in the "Seed Project" are well aware of the consequences of over-reliance on OpenAI's APIs, so they began to discuss how to whitewash evidence through "data masking". So much so that it often happens that employees reach the maximum access limit of the OpenAI API.

According to internal documents, ByteDance issued an order about a few months ago to "stop using GPT-generated text at any stage of model development."

However, it was at this time that ByteDance released its own large language model bean bag. Doubao AI official micro introduction,Doubao AI can provide chatbots、Writing assistant and English learning Xi assistant and other functions,It can answer various questions and have conversations,Help people get information,Support web platform,iOS and Android platform。 Beanbao can provide many types of assistance such as natural language processing, knowledge understanding, conversation, information retrieval, sentiment analysis, machine learning, Xi.

However, ByteDance continues to use the API in ways that violate OpenAI and Microsoft's terms of service, including evaluating the performance of the model behind the bean packet. One person with first-hand knowledge of ByteDance's internal situation noted that "they say they want to make sure everything is legal, but they really just don't want to be caught."

Following The Verge's report, ByteDance spokesperson Jodi Seth responded that the data generated by GPT was used to annotate models in the early development of the "Seed Project" and was removed from ByteDance's training data around the middle of this year. ByteDance is licensed by Microsoft to use the GPT API. We leverage GPT to support our products in non-Chinese markets; But in the Chinese market, we use our self-developed model to support bean bags.

Yesterday afternoon, the relevant person in charge of ByteDance responded again that the company emphasized that it must abide by its terms of use when using OpenAI-related services. We are also in contact with OpenAI to clarify any misunderstandings that may have arisen from external reporting.

An introduction to ByteDance's use of OpenAI's services:

1. At the beginning of this year, when the technical team first started the initial exploration of large models, some engineers applied GPT's API services to experimental project research of smaller models. The model is for testing only, with no plans to go live and never used externally. This practice has been discontinued after the company introduced GPT API call specification checking in April.

2. As early as April this year, the Byte Model team has put forward clear internal requirements that the data generated by the GPT model should not be added to the training dataset of the Byte Model, and the engineer team has been trained to comply with the terms of service when using GPT.

In January, the company conducted another round of internal inspections and took steps to further ensure that API calls to GPT met the specification. For example, the similarity between the training data and GPT of the model is sampled in batches to prevent data annotators from using GPT privately.

4. In the coming days, we will conduct another comprehensive inspection to ensure that the terms of use of the relevant services are strictly adhered to.

OpenAI spokesperson Niko Felix issued a statement confirming that ByteDance's account has been suspended. "All API customers must adhere to our usage policy to ensure that our technology is being used for good. While ByteDance rarely uses our API, we have suspended their account while further investigation continues. If we find that their use does not comply with company policy, we will ask them to make the necessary changes or terminate their account. Felix said.

Microsoft AI solutions, such as Azure OpenAI Services, are part of our limited access framework, which means that all customers must apply for and be approved by Microsoft to have access, Microsoft spokesperson Frank Shaw said in a statement. We also set standards and provide resources to help our customers use these technologies responsibly and comply with our Terms of Service. We also have processes in place to detect abuse and stop businesses from accessing if they violate our Code of Conduct. ”

It can be seen from the three-party statements in this incident that OpenAI is more conservative, only suspending ByteDance's account and saying that it will conduct an investigation before deciding whether further measures are needed. Microsoft, on the other hand, has a "nothing to do with it" attitude, as if to say "I'm just a middleman, we have our own rules, and if there is a violation, we will prohibit it". ByteDance is more anxious, after all, the "fire" is already burning on its body. first clarified the explanation, and then immediately contacted openai to quickly "extinguish the fire" of the incident.

According to public information, as early as 2016, ByteDance established an AI laboratory, focusing on natural language processing, machine Xi, data mining and other aspects of research. Douyin, Toutiao and other ByteDance products have also frequently added AIGC (generative artificial intelligence) functions to continue to attract traffic.

In 2023, ByteDance's actions in the AI field will accelerate significantly. In June, ByteDance's Volcano Engine released the large model service platform "Volcano Ark", which provides enterprises with a full range of platform services such as model fine-tuning, evaluation, and inference.

In August, ByteDance's self-developed general large model "Skylark" was revealed in the list of the first batch of large models that passed the "Interim Measures for the Management of Generative Artificial Intelligence Services".

On August 17, ByteDance publicly tested the AI chatbot "Doubao" developed based on the lark large model, and focused on AI applications for the C-end market.

Recently, while shrinking its gaming and XR businesses, ByteDance has established a new AI division, Flow. According to relevant recruitment information, Flow is the AI innovation business team of ByteDance, which has launched two products "Doubao" and "CICI" in China and overseas respectively, and a number of AI-related innovative products are being incubated.

At the same time, ByteDance has ordered more than $1 billion of GPUs from Nvidia this year, and its orders alone are equal to the total revenue of Nvidia from commercial GPUs sold in China last year. In addition, in terms of talent recruitment, ByteDance also ranks first among the top 10 companies in terms of the number of new AIGC jobs, accounting for 324%。

All kinds of behaviors show that Bytes attach great importance to AI and large models, back to the incident itself, will Bytes, which attaches so much importance to it, take such a big risk in order to "overtake in corners"?

After the advent of ChatGPT, Byte, like many major domestic manufacturers, is trying to keep up with the rhythm of AI. But obviously bytes are a little more backward, and many people use them after the bean bag is launched, but the effect does not reach the first-class level. If the AI trained with ChatGPT-GPT is only this effect, it seems that it is not very said in the past, and if you do not use ChatGPT-GPT to train bean bags, then this effect is expected.

In an interview with Ars Technica, artificial intelligence researcher Simon Willison said in an interview with Ars Technica that "many large models have been fine-tuned on datasets generated using OpenAI APIs, or scraped from ChatGPT itself." ”

But obviously these operations are carried out within a reasonable range, and the same may be true for bytes, as for whether bytes are too "quick for quick success" and choose to use beyond a reasonable range, presumably, as a huge Internet company, it should not be able to carry out such a "small loss" plagiarism.

Related Pages