A text classification training sample enrichment method is used to automatically enrich the quality

Mondo Technology Updated on 2024-02-01

Sinovatio technology.

The technology was delivered by Sinovatio and participated in the selection of the "Data Ape Annual Golden Ape Planning Activity - 2023 Big Data Industry Annual Innovative Technology Breakthrough List and Awards".

A text classification training sample augmentation method based on word embedding was designed and implemented. The text classification training sample enrichment method based on word embeddings described in this technology invents a class that uses existing sample data to automatically and efficiently enrich the class with a small sample size in the existing samples. The main innovations of this method are: first, the existing training samples are extended by using words outside the text, so that the expression of words in the new samples is richer; The second is to use k-nearest neighbor text classification to screen the candidate samples generated based on word embedding, eliminate irrelevant and wrong candidate training samples, and obtain usable training samples with high probability, so as to achieve the purpose of expanding the training samples.

The application value of this technology lies in alleviating the problem of unsatisfactory effect of simple sample enrichment and improving the classifier algorithm. Text classification is a typical supervised learning problem, and one of the main problems faced by supervised learning is that it needs to learn through a large number of manually labeled training samples. However, in practice, obtaining labeled training samples usually requires a huge amount of manpower and material resources, which is the so-called "labeling bottleneck". As a result, the number of labeled training samples that can be obtained for supervised learning is often limited, which is manifested in the limited number of training samples and the insufficient information contained in the training samples. Because the limited number of training samples (limited number and distribution information) cannot well characterize the overall distribution characteristics of the data, the generalization ability of the learned classifier is poor, which is the so-called "small sample" problem. This technique is an effective solution to this "small sample" problem.

This technique can be widely used in various text classification tasks, so as to improve the accuracy of classification. Classification tasks include, but are not limited to: sentiment classification, news topic classification, spam filtering, product review classification, chat intent classification, health disease classification, political leaning classification, legal document classification, and more.

The detailed steps of the specific embodiment of the present art are as follows:

The first step is to obtain small-sample keywords and build a small-sample keyword collection. As shown in Figure 1, the text classification training sample set is divided into small-shot classes and non-small-shot classes. The small sample class obtains a keyword set through keyword extraction. In this embodiment, the method for obtaining keywords adopts the positionrank algorithm. The keyword extraction algorithm of PositionRank is similar to the TextRank algorithm, which is based on the graph relationship of PageRank to calculate the score of the word. The importance score is used to indicate the word, and its formula is as follows:

where the damping factor is described, which is generally set to 075;w represents the weight of the edges of the graph, i.e., the similarity between words; Indicates the sum of the weights of all the outgoing edges of the word. The initial score of a word is inversely proportional to the position of the word in the text and directly proportional to the frequency of the word:

Assuming that the word v appears in the 2nd, 3rd, and 8th positions of the text, then =1 2+1 3+1 8.

As shown in Figure 2, after a text segmentation, six words A, B, C, D, E, and F are obtained, and the weights of A, B, C, D, E, and F are respectively after the positionrank algorithm. 047, and then select 3 keywords from the text to get the top 3 keywords: b, c, f. In the implementation, the number of keywords selected for each text is related to the length of the text itself, and f(n) is used to represent the number of keywords that need to be selected for the text, and its expression is as follows:

In the second step, half of the words are randomly selected from all text tokens in the non-small-sample category, and the corresponding number of words are randomly selected from the small-sample keyword set to replace the extracted words in the non-small-sample text to form a new segment. The word segmentation algorithm is implemented by ICTCLAS Chinese word segmentation of the Chinese Academy of Sciences.

In the third step, the k-nearest neighbor of the new segment and the known training sample is calculated by using the text similarity. The similarity of the two texts was calculated by DSSM model. The principle of DSSM (Deep Structured Semantic Models) is that through the massive click logs of queries and titles in the search engine, DNN (deep neural network) is used to express the query and title as low-dimensional semantic vectors, and the distance between the two semantic vectors is calculated through the cosine distance, and finally the semantic similarity model is trained, which can be used to not only use the semantic similarity of two sentences, but also obtain the low-latitude semantic vector expression of a sentence.

As shown in Figure 3 The k-nearest neighbor algorithm classifies new segments, the training sample set includes three categories: class 1, class 2, and class 3, and when k=5, the top five classes that are most similar to the new segments to be classified are class 1, class 1, class 1, class 2 and class 3, and the new segments to be classified are class 1 because class 1 has the largest number. In the implementation, the number of k is related to the small sample size, and k = is set, where is a hyperparameter, which is set empirically, indicating the lower bound rounding, such as.

In the fourth step, the new segments classified as small-sample classes after k-nearest neighbor classification are screened out and merged with the text classification training sample set to form an expanded training sample set.

Figure 1: The process of enriching the training samples for text classification.

Figure 2 Keyword extraction based on PositionRank algorithm.

Fig.3 The k-nearest neighbor algorithm classifies new segments.

Patent application number Publication number:zl 2019 11119076.5

Name of the person in charge of the team: Lu Yunchuan

Lu Yunchuan, Vice President and General Manager of Big Data Product Line of Sinovatio. Master of Tsinghua University, senior engineer, currently the general manager of Sinovatio big data products, a member of the China Database Committee, the head of the member unit of the China Internet Network Security Threat Governance Alliance, and the vice chairman of the Nanjing Artificial Intelligence Industry Association. He has been deeply engaged in the fields of telecommunications, big data and artificial intelligence for more than 20 years, has 5 intellectual property rights, and has led and participated in 8 provincial and ministerial science and technology projects, including the National 242 Information Security Project and the Jiangsu Strategic Emerging Project.

Names of other important members of the team: Zhang Quan, Zhuo Keqiu.

Affiliation:Oceanmind, Sinovatio.

Nanjing Sinovatio Technology Co., Ltd. (hereinafter referred to as Sinovatio) was established in 2007, formerly known as a subsidiary of ZTE, and is now controlled by Shenzhen Innovation Investment Group. The company was listed on the Shenzhen Stock Exchange in 2017 and 002912.

Oceanmind is a big data operating system brand under Sinovatio. In the industry, Hiradis innovatively proposes systematic and online data construction solutions, redefines enterprise data engineering, and provides business-driven, online, visualized, and seamless data construction services, that is, a one-stop solution for data construction consulting, achievement landing, application construction, and data management, which successfully solves the four major problems of difficult implementation of enterprise consulting plans, difficult implementation of consulting results, difficult construction of data applications, and difficult operation of data systems, and escorts the digital transformation of enterprises. At the same time, it provides a data middle platform, an intelligent data warehouse, a master data management, an indicator management platform and industry big data business analysis applications, closely follows the business scenarios of enterprise operation and management, deepens the concept of visualization of operation status, operation process and operation risk control, creates digital solutions for enterprise operation and management, and continuously helps enterprises in digital transformation.

Sinovatio's AI-based audit file mining and utilization system has solved the problems that we have always had difficulty in applying electronic audit files and the value of archives, greatly improved the efficiency of our use of massive existing files, and formed a good demonstration effect in the industry.

Liaocheng Audit Bureau.

Song Xinchang, Chief of Electronic Data Section.

The intelligent search function of Sinovatio Hirith innovatively adds natural semantic capabilities to the search of global data and data, so that we can efficiently extract data originally scattered in multiple systems based on entity elements, and automatically generate reports, which greatly reduces the workload of our personnel in collecting and compiling materials.

China Energy Construction Jiangsu Electric Power Design Institute.

IT Manager Piquan Huang.

Related Pages