The first important step in the training of large language models (e.g., the Wenxin Yiyan large model) is corpus collection. A corpus is a large amount of text data that is used to train a model. This article will introduce in detail the process of corpus collection, corpus evaluation, and corpus evaluation criteria.
The process of corpus collection:
1. Definition of objectives:
Determine the purpose and application area of the model. For example, the goal of a general-purpose language model might be to understand and generate text for a variety of tasks and domains.
Based on the goal, determine the type, size, and diversity of the corpus you need.
2. Data ** identification:
Make a list of possible data, such as web pages, books, news articles, academics, social posts, and more.
Given the diversity of the data, it may be necessary to collect data from a variety of sources.
III. Legal and Ethical Considerations:
Ensure compliance with all relevant data use and privacy laws.
Obtain the necessary licenses or licenses to use a specific data source.
To ensure ethical data collection, avoid involving sensitive or private information.
4. Data Capture and Collection:
Use web crawlers, APIs, or other tools to scrape data from sources.
Get data from a dataset or partner that already exists.
5. Pretreatment:
Purge irrelevant, redundant, or low-quality text.
Format the text as necessary.
Label or segment the data if needed.
Remove or anonymize sensitive or private information.
6. Data Enhancement and Balance:
If certain categories or topics are too sparse in the corpus, consider employing data augmentation techniques such as re-sampling, generating new data, or using small, specific datasets that exist.
Ensure representation and balance of various topics, domains, and styles in the corpus.
7. Data Storage:
Store and organize the collected data using a suitable database or file format.
Ensure a backup and recovery strategy for your data.
8. Assessment and Feedback:
A preliminary analysis of the collected data is carried out, checking its quality and representativeness.
Adjust the data collection strategy or ** based on the results of the analysis.
9. Repeat iteration:
Usually, the process is iterative. Based on the initial training results of the model or new data needs, the corpus may need to be returned and adjusted.
After the corpus collection is completed, the next step is usually data preprocessing, vocabulary construction, model design, etc. However, a high-quality and diverse corpus is a key factor in the success of large language models.
Corpus collection of **:
Corpus collection can come from a variety of sources, depending on factors such as your needs, the purpose of the model, and the permissions available to access the data. According to the way the corpus is presented, it can be divided into online and offline, and according to the type of corpus producer, it can be divided into user-produced corpus, expert-produced corpus, * and organization-produced corpus. Below we have randomly combined these three dimensions and listed the data collection strategies for each combination:
1. Online: aUser-generated content (no requirement for accuracy):
Social** platforms: e.g. Weibo, Twitter, Facebook, Instagram, Reddit, etc.
Review platforms: such as Douban, Amazon product reviews, App Store app reviews, etc.
Forums and communities: e.g. Tieba, V2EX, Stack Overflow (for technical issues), etc.
Blogs and Personal**.
b.Content produced by experts (with requirements for accuracy):
Academic databases: e.g. Google Scholar, PubM, IEEE XPLORE, etc.
Expert blogs and columns: e.g. Medium, Zhihu columns, etc.
Official reports of research institutes and societies: such as official announcements of academic conferences, etc.
*Courses and Lectures: e.g. expert lectures or course materials from Coursera, EDX, Udemy.
c.*Institutionally produced content (with requirements for accuracy):
*Official**: Announcements, regulations, reports, press releases, etc.
Public databases: statistics, public records, etc.
*Social Accounts: News updates, policy promotions, etc.
2. Offline: aUser-generated content:
Oral interviews and focus groups.
User-generated paper materials: e.g. handwritten notes, letters, diaries, etc.
Public events and gatherings: Consider audio or video recording of relevant parts.
b.What is produced by experts:
Academic conferences and seminars: reports, lectures, speeches, etc.
Books and monographs: especially those written by authoritative experts.
Expert seminars and workshops.
c.*Institutional production content:
* Published paper materials: e.g. reports, announcements, legal documents, etc.
* Hosted or participated in public events: such as press conferences, public hearings, etc.
Official audio or video recordings.
For all of these, especially offline, it is important to ensure that all relevant legal and ethical requirements are followed when collecting, using, and storing data, especially when it comes to privacy and copyright issues.
Evaluation criteria for corpus:
Assessing the quality of corpus collection is a critical step to ensure that the quality of the corpus can provide a better basis for subsequent model training or other applications. Here are some methods and indicators to assess the quality of corpus collection:
1. Diversity:
Make sure that the corpus covers a variety of topics and styles in the target area or application.
Check for duplicate text or text that is overly representative of a subset.
2. Representativeness:
The corpus should be a true reflection of language usage in the target domain or application.
For example, a corpus collected for a news app should contain texts from various areas of journalism (e.g., politics, economics, entertainment, etc.).
3. Accuracy and Authenticity:
The facts, figures, and information in the corpus should be accurate.
For user-generated content, screening and validation may be required to eliminate errors or misinformation.
4. Completeness:
Whether the text is complete and has not been truncated or partially lost.
5. Format and structure:
Whether the corpus has a uniform and clear format for subsequent processing.
Check for coding errors, garbled characters, or inconsistent formatting.
6. Grammar and spelling:
For apps that require high-quality text, check the corpus for grammatical errors and spelling errors.
7. Noise level:
Evaluate noise in the corpus such as irrelevant text, ads, links, HTML tags, etc.
Ensure that these noises are handled correctly in subsequent data cleaning steps.
8. Prejudice and Fairness:
Assess the corpus for injustice or bias to ensure that it does not negatively impact subsequent applications.
For example, a corpus that contains gender, ethnic, or cultural biases can cause the model to perform poorly in certain applications.
9. Time Relevance:
For some applications, it is important to evaluate the timeliness of the corpus. For example, a news or social corpus should reflect recent events and topics.
In summary, assessing the quality of corpus collection is a multifaceted task that requires consideration of multiple factors and indicators. By regularly and systematically evaluating the quality of the corpus, it is possible to ensure that subsequent model training and application are supported by high-quality data.
Hotspot Engine Project Conclusion:
The above is a detailed description of the large model training-corpus collection link, and in the next article, we will introduce the relevant content of the large model training-corpus cleaning and preprocessing link in detail.