Golden Ape Technology Exhibition Data augmentation techniques for training sample sets

Mondo Technology Updated on 2024-01-31

Transwarp Technology.

This project was submitted by Transwarp and participated in the selection of the "Data Ape Annual Golden Ape Planning Activity - 2023 Big Data Industry Annual Innovative Technology Breakthrough List and Awards".

In order to meet the growing demand for AI performance, the training sample set needs to be constantly updated to train and update AI models. Researchers continue to create new model structures and propose new model training techniques, which make AI models achieve results that exceed the human performance level on public datasets in specific fields, but when learning based on actual use scenario data, they are prone to bias in different subsets of data, resulting in a lack of fairness.

To solve the above problems, the training sample set can be adjusted according to the feedback collected during model training and online model deployment to ensure a high-quality training set. The commonly used data shaping methods are mainly data augmentation, and common data augmentation methods include supervised data augmentation and unsupervised data augmentation methods. Taking image data as an example, supervised data augmentation methods include based on geometric transformation classes and color transformation classesThe geometric transformation class is to perform geometric transformations on an image, including: flipping, rotating, cropping, deforming, and scalingData enhancements for color transformations, including noise, blur, color transformation, erasure, and filling. The unsupervised data augmentation method is to randomly generate images consistent with the distribution of the training dataset through the distribution of the model learning data, and the representative method is Generate Adversarial Network (GAN).

However, the traditional data adjustment method cannot accurately locate the training samples in the training sample set that cause poor model accuracy, and then cannot enhance the training sample set for defects, which makes the improvement efficiency of the training sample set inefficient and cannot guarantee the performance improvement effect of the model.

Transwarp's innovative data augmentation technology for training sample sets solves the problem that existing data processing methods cannot accurately locate the training samples in the training sample set that lead to poor model accuracy, and then cannot enhance the training sample set for defects, which makes the training sample set inefficient, and realizes the accurate positioning of the training samples that lead to poor model accuracy for targeted data augmentation, improves the data augmentation efficiency of the training sample set, and then improves the performance of the model.

Transwarp's innovative data augmentation technology for training sample sets is to divide the data sample set into at least two subsets of data samples by determining the attribution feature set of the data sample set and according to the attribution features in the attribution feature setclassify the data sample subset according to the numerical value of the first evaluation index of the data sample subset, and form a subset of error data samples with reasoning errors and a normal data sample subset without reasoning errors;determining a control data sample subset corresponding to the error data sample subset from the normal data sample subset according to the contribution degree of the attribution feature to the reasoning error of each error data sample subset in the error data sample subset;According to the propensity scores of each data sample in the error data sample subset and the control data sample subset, the training sample set that obtained data augmentation is determined, the training sample set that the existing data processing method cannot accurately locate in the training sample set that causes poor model accuracy, and then the training sample set cannot be enhanced for defects, so that the improvement efficiency of the training sample set is low, the training sample set that is accurately located and the training sample that causes the model accuracy is poor for targeted data enhancement is realized, the data augmentation efficiency of the training sample set is improved, and the performance of the model is improved。

The technical solution can use the dataset to train data mining tools such as classifiers and regressors, improve the training effect, and can be used for specific refined scenarios (such as load ** in the power field, fault detection or loan repayment in the field of financial risk control data processing**) Due to the poor quality of the dataset or the excessive concentration of the dataset, it is easy to partially update the data when processing the continuously updated data in the real scene** According to the contribution of attribution characteristics to the inference error of the wrong data samples, the training samples in the training sample set that lead to poor model accuracy are accurately located, and then the targeted data enhancement is carried out on the training samples, so as to improve the data augmentation efficiency of the training sample set and improve the performance of the training model trained by using the enhanced dataset.

For example, for the load system in the power field, the data sample of the load system may be related to the geographical location, weather conditions, user structure and economic development of the load area, the geographical location will affect the weather conditions and economic development, and the geographical location has a certain causal relationship with the power load, but the geographical location is not the direct cause of the difference in the power load, and there is a large reasoning error in the power load according to the geographical location. Therefore, this technical scheme can be used to accurately locate the training samples in the training sample set that lead to poor model accuracy according to the contribution of attribution characteristics to the inference error of the wrong data samples, and then carry out targeted data enhancement on the training samples, improve the data augmentation efficiency of the training sample set with the highest load, and improve the performance of the load ** model trained by using the enhanced dataset.

For the loan repayment scenario in the field of financial risk control data processing, the set of attribution characteristics of the loan applicant can include: the applicant's age, the applicant's annual income, and the applicant's marital status. Using this technical scheme, the attribution features with the greatest contribution to the inference errors of the wrong data samples can be found according to the contribution of attribution features to the inference errors of the wrong data samples, so as to locate the training samples in the training sample set that lead to poor model accuracy, and then carry out targeted data enhancement on the training samples, improve the data augmentation efficiency of the training sample set of loan repayment, and improve the performance of the loan repayment model trained by using the enhanced dataset.

Patent application number Publication number:zl202211173668.7

Name of the person in charge of the team: Yang Yifan

Yang YifanHe is currently the Vice President of Transwarp. He received his bachelor's degree from the University of Science and Technology of China in 2008 and his Ph.D. in statistics from the University of Kentucky. He has worked in the Anti-Money Laundering Department of Bank of America and the Adversarial Intelligence team of Alibaba's Search Division. He is currently working in Transwarp's Artificial Intelligence Product Division. He has a rich background in anti-money laundering and anti-cheating business, as well as research experience in statistical learning, deep learning and graph computing. His main research areas are in the basic software of big data, artificial intelligence, data security, and privacy computing. Author of "Machine Learning in Practice" and "Data Security and Circulation: Technology, Architecture and Practice".

Names of other important members of the team: Xia Zhengxun, Tang Jianfei, Zhang Yan.

Affiliation:Transwarp.

Transwarp Technology (**688031) is committed to building enterprise-level big data basic software, providing basic software and services around the whole life cycle of data such as data integration, storage, governance, modeling, analysis, mining and circulation. After years of independent research and development, Transwarp has established a number of product series: TDH, a one-stop big data infrastructure platform, ARGODB, a distributed analytical database, and KunDB, a container-based intelligent data cloud platform, TDS, a big data development tool, and SOPHON, an intelligent analysis tool, etc., and has a number of patented technologies. At present, the company's products have been applied in more than a dozen industries, with more than 1,400 end users. In 2016, the company became the first vendor in China to enter the Gartner Magic Quadrant for Data Warehousing and Data Management Solutions, and was named one of the most forward-looking visionariesIn 2017 and 2020, it was twice rated as the leader in China's big data market by IDCIn 2018, Transwarp became the world's first database vendor to pass the TPC-DS test and official auditIn 2022, it was rated by Gartner as the world's leading vendor in the field of data middle platform and graph database, and was selected as one of the vendors with the largest database product categories in China. In the same year, it became the world's first software vendor to pass the TPCX-AI benchmark test and official audit, and the world's first single-node performance. In October 2022, it successfully landed on the Science and Technology Innovation Board of the Shanghai Stock Exchange.

Based on the high-performance storage and computing capabilities of Transwarp's big data infrastructure platform TDH, data from different ** and different structures are cleaned and processed to form high-quality real production datasets that can be directly used for model training. The model training is carried out through the artificial intelligence platform Sophon, which integrates more than 680 existing industry models, and on this basis, the model training and iteration are easily completed by using perfect training tools such as graphical modeling and parameter tuning. The model trained in Sophon can be seamlessly connected to the upper-level application system, so that the experimental results can be quickly put into real production.

A 985 engineering university.

Based on Transwarp's intelligent analysis tool SOPHON combined with the dataset for model training, the image and optical flow information were fused to achieve accurate recognition of behavioral work7 24 monitor the scene in the warehouse, and timely warn of dangerous situations such as open fire and smoke;The on-site deployment of edge computing boxes and unified access to remote monitoring solve the problems of manual inspection, abnormal conditions, and inefficiency such as operation behavior records of warehouse management personnel.

A financial institution.

Related Pages