What about polluted data?

The problem of "pollution" of human databases caused by AIGC (generative artificial intelligence) is on the table.

Outlook Oriental Weekly reporter Mao Zhenhua editor Chen Rongxue.

AI diagram of a certain online platform "National Football Team Wins Hercules Cup".

The advent of ChatGPT (a large language model trained by OpenAI) has opened the door to a new world: it turns out that artificial intelligence can make life so convenient. With the birth and application of more and more AIGC (generative artificial intelligence) tools, the whole society has paid unprecedented attention and expectations to artificial intelligence.

However, after the, problems followed. AIGC-generated **, news, Q&A, etc. are starting to spread to all corners of the online world, and as technology continues to upgrade, it becomes more and more difficult to distinguish them from the real world.

The problem of "pollution" of human databases caused by AIGC has been put on the table. How to face this new challenge is unavoidable.

Confuse the real with the fake. A national football team holding the Hercules Cup has been widely circulated on the Internet. If it weren't for the fact that there is too much contrast between the content and the public's perception, just as far as ** is concerned, the characters' expressions, actions, and backgrounds can completely reach the point of being fake.

Such a "realistic" **, according to the web publisher, is generated by AI (Artificial Intelligence). It may seem nonsensical and funny, but the risks are real – it conveys misinformation that is likely to be accepted as true by the public, leading to widespread rumours.

This is not unfounded.

Yellow grapes, delicate pink begonia flowers, and ...... dappled clouds pierced by the sunThe ** on these networks, despite their stunning visuals, are all proven to be synthesized by AIGC. Many netizens are worried that a large number of such ** flooded the Internet, not only confusing people's cognition, but also after the iterative progress of technology in the long run, AIGC generation may become more and more difficult to distinguish between true and false, thus "polluting" the human database.

Worrying is not unnecessary. In real life, at an art fair in Colorado, USA, a painting called "Space Opera" won the digital art category. However, the work was first generated by an AI mapping tool and then polished by Photoshop software. The incident sparked a discussion on the Internet, with many artists accusing AI of cheating in the competition.

* It can be "faked", and objectively based news can be "generated".

Research by Newsguard, an investigative agency that tracks misinformation, has found that fake news and information generated by AIGC has begun to explode and is becoming a new challenge in the current internet age.

The number of fake articles generated has surged by more than 1,000% since May 2023, from 49 to more than 600 in 15 languages, the News Guard reported. These ** produce a large number of articles every day, covering various fields such as politics and society. The motivations behind this range from shaking faith and causing disruption to relying on polarized content to drive traffic and ad revenue.

New information warfare.

Advancements in AIGC technology have made it easy for almost anyone to create seemingly legitimate news**, producing content that is often indistinguishable from real news.

For example, there was an article generated by AIGC that fictionalized a story about "Israeli Prime Minister Benjamin Netanyahu's psychiatrist", a false story that was widely disseminated and even appeared on TV shows. Some ** confuse true and fake news, which greatly increases the credibility of deceptive stories.

The danger of this scenario, "News Guard" warns, lies in the scope and scale of artificial intelligence, and when it is combined with more sophisticated algorithms, the scale and speed of the proliferation of misleading information will be unprecedented, becoming a new kind of information warfare.

AIGC news fraud seems far away, but it is actually happening all around us. On December 28, 2023, Fengjie County, Chongqing City** found that netizen Wang Moucheng used AI writing software on an information platform to fabricate and publish a post that "12 people have been killed in an accident in a coal mine in a certain place", which aroused the attention of netizens and caused adverse social impact. After the first investigation, Wang Moucheng's purpose was to attract people's attention and attract traffic. He himself was duly punished.

Similar to the fake news generated by AIGC, the use of AIGC to produce and spread false information is becoming more and more common on some short platforms.

A reporter from Oriental Outlook found on a short ** platform that this kind of ** often appears as "digital people" such as wise old men and little monks created by AIGC, and their "speech" voices and subtitles are also generated by AIGC. For those who are proficient in operation, it is not a problem to make multiple segments in an hour. The so-called health knowledge, life philosophy, and life philosophy disseminated by ** are either extreme in views or with advertising "private goods", which is very deceptive to the elderly and children.

Pei Zhiyong, director of the Industry Security Research Center of Qianxin Group, said that both sound and image can be decomposed into several feature vectors through specific mathematical transformations, and a specific set of parameters can be assigned to each vector to form a specific sound or image. The so-called AI voice change is to use a person's previous voice as a sample to learn through machine learning, so as to give specific parameters to each feature vector of the voice, and then use this set of parameters to read out new content, so that the speech, tone and even emotion can be imitated.

Hanni Farid, a professor of digital forensics at the University of California, Berkeley, believes that advances in AI technology have made it easy for scammers to copy sounds based on short audio samples.

Two years ago, you might need a lot of audio to clone a person's voice. But now, as long as you post a piece of audio for more than 30 seconds on a social platform, copying the sound can be done quickly. Hanni Farid said.

Devouring the "real world."

At this stage, most people have a peaceful and tolerant attitude towards the content generated by AIGC because many of the content is "fake at a glance". But when technology leaps to the point where it's hard to tell, it's too late to ring the alarm.

At the beginning of the birth of a popular AIGC tool, the reporter of "Outlook Oriental Weekly" tried it and found that the experience was not good. For example, in response to the obviously wrong question of "in which year did Liu Xiang win the World Table Tennis Championships", it gave the answer "Liu Xiang won the World Table Tennis Championships in 2004". The question was repeated, and the answer in 2005 was given. As for "Is Mount Tai a famous scenic spot in Jinan?", its first answer was "Yes, Mount Tai is a famous scenic spot in Jinan City." It is located in Tai'an City, Shandong Province, one of the Five Mountains of China, with a long history and rich culture". The tool corrects the answer only after asking the same question again after a period of time. But after nearly a year, the tool has been able to deal with similar problems again, and there are no more similar low-level errors.

Cao Feng, director of the artificial intelligence department of the Institute of Cloud Computing and Big Data of the Chinese Academy of Information and Communications Technology, believes that generative AI has achieved stronger self-learning ability with the help of technical means such as pre-training learning, fine-tuning learning, prompt learning and reinforcement learning, and then continuous human feedback. This is where its power and desirability begin.

After repeated professional training and data accumulation, the accuracy and personalization of the responses generated by AIGC will gradually improve, and it will be more difficult to distinguish them at that time. Therefore, the problem of human database "pollution" caused by AIGC is gradual, highly hidden, and the harm is not easy to detect.

The more technology advances, the more difficult it becomes to distinguish the authenticity of the content it generates, and it "devours" the traditional world. From a purely technical point of view, this will be a difficult trend to stop.

It is hard to imagine that in the future, a considerable part of the **, data, and Q&A retrieved by people, such as the appearance of animals, the appearance of plants, and the content of calligraphy and painting, will be modified by AIGC. When comparing these "generated" contents with the real world, what kind of mentality do people look at the world and what kind of judgments do they make?

On July 7, 2023, the 2023 World Artificial Intelligence Conference was held in Shanghai, and the audience visited the AIGC art exhibition with the theme of "Symphony" (photo by Xin Mengchen).

Model autophagy. In addition to "polluting" human databases, another hidden problem of AIGC is "autophagy". In other words, the process of knowledge generation is backwards rather than forwards.

According to the latest research, feeding content generated by AI to similar models for training can lead to degraded model quality or even crashes. This phenomenon of self-engulfing is known by scientists as model autophagy.

The researchers noted that while the AIGC algorithm has made great progress in areas such as images, text, and more, the continued use of synthetic data to train models can lead to models becoming closed and eventually losing diversity and accuracy.

A new article from Stanford University and the University of California, Berkeley, confirms this, with GPT-4 in June 2023 objectively performing worse on some tasks than in March. For example, using the same 500 questions to determine whether a given integer is prime, we test two versions, and find that GPT-4 in March got 488 answers right, while in June it got only 12 right.

Not only that, but the ability has also decreased. The researchers believe that "feeding" only AIGC-generated content to the model without fresh data, i.e., human-annotated data, will only lead to degraded AIGC performance.

The negative effects brought by AIGC have attracted the attention of relevant departments. The Interim Measures for the Management of Generative AI Services, which will be implemented in 2023, specifically proposes to take effective measures to improve the quality of training data and enhance the authenticity, accuracy, objectivity, and diversity of training data, which is essentially to broaden the track for the future progress of AIGC.

Establish a "no-go zone".

The development of science and technology should ultimately serve the happiness and beautiful life of mankind. When a large amount of illusory or even erroneous information makes the human knowledge space no longer pure, it is necessary to correct the deviation in time. Especially in the early stage of AIGC development, it is urgent to take decisive measures in the long run.

Now, more and more platforms are starting to require that the content generated by AIGC must be clearly labeled to help people understand it correctly. It has become necessary to set up "off-limits zones" for AIGC, and journalism is one of them.

Cui Wei, CEO and Chief Scientist of DataQin Technology, said that the fake and shoddy news generated by AIGC has become a public nuisance. For example, there is often news about the release of oil prices, but when you open it, there is no official news released recently.

If you look closely, you can see that this type of news has common characteristics, such as the headline is amazing, attracts traffic, and the number of clicks and comments is high; The content is very formatted, first talking about the news, and then talking about the impact in hundreds or thousands of words. Interestingly, each one is different, but very similar. Comments are often netizens who pour bitter water regardless of whether they are true or false, and in the end, it is the platform and self-**** that earn traffic, and readers vent their emotions, and no one knows what the facts are. He said that when AIGC is used to generate false information, its speed and scale advantages can quickly amplify the impact of misleading content, leading to a crisis of public trust and social disorder.

There must be constraints on the content generated by AIGC. He suggested that in particular, it is necessary to strengthen management from the source and platform side, strictly prohibit the spread of fake news generated by AI out of nothing in the field of news, and prevent the production of fake news from becoming a factory and assembly line. Once discovered, resolutely clean up the relevant content and avoid leaving the difficulty of distinguishing the real from the fake to the public. Even for "black humor" purposes, place the generated content under a non-news section and clearly label it to avoid rumors being believed as true.

Co-governance. Strengthening the ethical governance of science and technology and promoting science and technology for good will become a long-term task.

Cui Wei and others argue that technology itself does not have the ability to make moral judgments, and that its application depends on the intentions of human users. Therefore, it is particularly important to develop and enforce ethical guidelines and laws and regulations regarding the application of AIGC. This requires not only a sense of responsibility on the part of technology developers and users, but also the effective involvement of relevant regulators.

Zhao Jingwu, an associate professor at the Law School of Beihang University, said that at the end of 2022, the Cyberspace Administration of China, the Ministry of Industry and Information Technology, and the Ministry of Public Security formulated the "Provisions on the Administration of Deep Synthesis of Internet Information Services", which clarifies the legal obligations of deep synthesis service providers, such as the use of technical or manual methods to review the input data and synthesis results of deep synthesis service users.

In addition to regulation, technology companies, educational institutions, and the general public all have an important role to play. Zhao Jingwu believes that this is not only a technical issue, but also a broad issue involving social governance, public education and international cooperation.

The use of technology to control the technological chaos has been put on the agenda.

As technology advances, society's adaptation and coping mechanisms need to be updated. For example, news organizations and social media platforms need to develop more efficient tools and methods to identify and filter fake content generated by AIGC. Wang Yangping, a blockchain expert at New Huo Tech Holdings, believes that blockchain technology can help solve the problem of data and knowledge fraud induced by AIGC.

He said that news photography, species**, public knowledge, etc. can be recorded on the blockchain, and the whole process can be traced, the content is transparent and cannot be tampered with, and the permanent record can effectively assist people to improve their cognition and discernment, and reduce the generation of false knowledge and rumors on the Internet.

Click on the title below to read all the articles in this special section.

2024, Three Questions on Artificial Intelligence" special series.

What about polluted data?

Related Pages

What should I do if the data network signal is poor?

What should I do if the data network signal is poor?

Professional Answer: What to do with nuclear-contaminated water?Can you still eat seafood?

What should I do if my server data is lost?

Analyze what to do if the ultrafiltration membrane element is contaminated