MATHPILE is a high quality, large scale mathematical corpus

Mondo Technology Updated on 2024-01-31

Introduction

Mathpile: A high-quality, large-scale mathematical corpus of 29 GB containing approximately 9.5 billion tokens. Covers content from K-12 to college, graduate level, and math competitions, including high-quality textbooks, handouts, science**, and more. Provides detailed data records, including datasets** and quality annotations, increasing transparency and allowing users to tailor the data to their needs.

Data and processing: Data was initially generated from a number of different data sources, totaling approximately 52 billion tokens, or 22 TB of data.

Source data includes StackExchange, Proofwiki, Common Crawl, Arxiv, and others. This data undergoes a rigorous series of processing processes, including data pre-processing and pre-filtering, language recognition, cleansing and filtering, and deduplication.

Mathpile corpus: After processing, a math-centric corpus is obtained, i.e., mathpile. This corpus has a total of 29 GB of data, contains about 903,000 documents, and about 9.5 billion tokens.

Key features:

1. MathPile Focus: MathPile is specifically designed for the math field, which is distinctly different from a corpus with a general or multilingual focus.

2. Diversity: Mathpile aggregates data from a wide range of sources, including textbooks (including handouts), ARXIV, Wikipedia, Proofwiki, StackExchange, and web pages. It covers math content suitable for K-12, college, graduate level, and math competitions. In particular, the project has published a significant collection of high-quality textbooks (about 0.).19b token).

3. High quality: The project adheres to the principle of "less is more", and believes that data quality is better than quantity, even in the pre-training stage. The project's data collection and processing efforts include complex pre-processing, pre-filtration, cleaning, filtration, and deduplication to ensure the high quality of the corpus.

4. Data Documentation: To enhance transparency, provide detailed data records, including datasets** and quality annotations, to improve transparency and allow users to customize the data as needed. Such as language recognition scores and symbol-to-word ratios. This gives users the flexibility to tailor their data to their needs.

Data contamination was also tested to eliminate duplicates in benchmark sets such as MATH and MMLU-STEM.

With this specialized corpus, researchers and developers are able to more effectively improve the capabilities of language models in mathematical reasoning.

Project address: gair-nlpgithub.io/mathpile/

*:arxiv.org/abs/2312.17120

github:github.com/gair-nlp/mathpile

Dataset: HuggingFaceco/datasets/gair/mathpile

Related Pages