How to train encoded LLMs using small, auto generated datasets

Translator: Li Rui.

While large language models (LLMs) like GPT-4 are highly proficient at writing software**, the cost and opacity of these models have sparked interest in more affordable, smaller-scale coding LLMs.

These alternatives can be fine-tuned for specific tasks and are inexpensive. One of the challenges in developing these LLMs is finding the optimal balance between the size of the training dataset and the performance of the model.

In response to this challenge, Microsoft introduced a new technique for training efficient coding language models with fewer examples in a recent article. This article introduces the w**ecoder model and claims to outperform other coding LLMs trained on a similar number of examples.

To complement the w**ecoder, Microsoft has also developed CodeOcean, a curated dataset of 20,000 different examples. This dataset can enhance the fine-tuning of the underlying model of the coding application.

Figure 1 CodeOcean pipeline.

While W**Ecoder is an impressive LLM model, the more interesting part of this article is CodeOcean, which is an accompanying dataset. CodeOcean solves a major challenge: creating a dataset that balances cost-effectiveness and quality. Researchers believe that a dataset with the greatest diversity can produce impressive results, even if it contains limited examples.

The research team started with CodeSearchNet, an extensive encoding dataset of 2 million pairs of annotations and **. They used a BERT-based Transformer model to generate embeddings for each example, transforming complex information into a list of numbers.

They applied a clustering algorithm to the embeddings, ranking the examples based on their similarity. This approach enables researchers to extract a subset from the original data set, maximizing diversity.

Add a description

After establishing the core dataset, the researcher must create a training example that contains ** and instructions. To achieve this, they created a generator-discriminator framework for generating instructive data based on the original examples. Initially, they used GPT-4 to make task definitions in specific scenarios. These initial task definitions, combined with instructional prompts, were provided to GPT-35 to generate the corresponding instructions for additional examples.

Figure 2 CodeOcean's generator-discriminator framework.

For the discriminator component, the investigators developed a separate assessment prompt. This prompt, along with ** and instruction examples, is provided to GPT-4 for evaluation. The CodeOcean pipeline then uses good examples to generate future training examples.

Through this iterative process, the researchers generated 20,000 high-quality teaching samples. These examples span four different coding task categories: Generation, Summarization, Language Translation (from one programming language to another), and Fix. These four categories encompass a large portion of LLM encoding tasks.

Figure 3 W**Ecoder outperforms other encoding LLMs trained on a similar number of examples

There are many ways to generate training examples for coding LLMs. But Microsoft's CodeOcean stands out by its emphasis on generalization and example efficiency. Unlike studies that rely on large amounts of data, CodeOcean can achieve high performance with smaller datasets.

To demonstrate the effectiveness of CodeOcean, the researchers fine-tuned three coding language models: StarCoder-15B, Codellama (7b and 13b), and DeepSeekCoder-67b。Given the size of the dataset, fine-tuning is fast and cost-effective. The fine-tuned model was evaluated against three key coding benchmarks: Humaneval, MBPP, and HumanevalPack.

Through multiple trainings on CodeOcean, all models have seen significant improvements on these benchmarks. In terms of generation, the researchers describe the effects and limitations of the w**ecoder: "After the fine-tuning process, the performance of the w**ecoder model has improved significantly compared to the base model and some open-source models, but it still lags behind proprietary models (such as GPT-4 and Gemini), as well as the indicative model trained with more than 70,000 training data." ”

The performance difference between w**ecoder and wizardcoder is minimal, with 78,000 training examples. This shows that "fine-grained and diverse instruction data can significantly improve the efficiency of instruction tuning." ”

w**ecoder is particularly good at summarization and remediation tasks. It outperforms other open-source models on almost all programming languages. This success underscores the "effectiveness of defining and classifying" related tasks to enhance the generalization capabilities of LLMs.

While Microsoft has yet to release models, data, and data for W**Ecoder and CodeOcean, discussions about Hugging Face suggest that the company is reviewing whether to release them to the public. Going forward, the researchers aim to explore the effects of larger datasets, as well as the potential benefits of combining CodeOcean with other coding datasets.

How to train encoded LLMs using small, auto generated datasets

Related Pages

Instructions for use of the automatic particle counting instrument

WIPO China How to use the Global Brand Database to search for international trademarks?

Are auto-generated electronic contracts valid?

How to count screws quickly with a fully automatic screw counter

Use MinIO's SDK for automated data preparation to support machine learning