Meta is heavyweight, branding the training data to determine whether it is used by large models

Mondo Technology Updated on 2024-03-07

In order to make the LLMS look like the ** class, it is often necessary to collect a lot of data to fine-tune the model. Before the LLMS era, crowdsourcing was the primary way to obtain annotated data.

Since the advent of LLMS, researchers have discovered that they can generate synthetic data from powerful models such as Bard, ChatGPT, or Claude to fine-tune their own models, which is more time-saving, labor-intensive, and cost-effective than crowdsourcing.

However, this process involves data generated using other models, which can raise copyright and intellectual property issues. For example, if one model is used to generate training data, which in turn is used to train another model, is the latter a derivative of the former?

In order to trace back to the source, the output of the LLM can be watermarked like copyright protection to detect the synthetic data, which greatly promotes the security protection of the large model.

Recently, there have been a lot of articles about LLM watermarking technology, and the article introduced today is from Meta, which does not ** how to watermark LLM while ensuring the output quality, but finds another way,What happens when studying the "radioactivity" of watermarked text – i.e. when the watermarked text is used as fine-tuning data? How much potential "contamination" power is there to the model?

**Title:

watermarking makes language models radioactive

Considering that some children's shoes don't know much about watermarking technology, here are some brief introductions.

The decoding process for LLMs is simple, taking the context as input, where is the vocabulary of the model, and then outputting a vector of logits. This vector will be converted to, i.e., the probability distribution of the next token. Then use top-k sampling, nucleus sampling, and other methods to select the next token.

Watermark embedding: is to change the logits vector or sampling process. Typically, the output of the secret key cryptographic function hashes the previous k tokens, which is used as the seed to initialize the random number generator, thus influencing the choice of the next token.

Watermark detection: Textual tokenizes processing to be censored, repeating secret seed generation and scoring each token. Thus, a function for the current token can be written as a function that takes the token as an input to the fraction relative to the seed and token sequences:

Then, based on the statistical test of the cumulative score and the number of tokens, determine whether the text contains a digital watermark.

Let's say Alice owns language model A. Bob has another language model, B. Bob fine-tunes B with a small amount of text taken from language model A. The dataset used to fine-tune B is denoted by D, where the text generated by A is represented as. Define the proportion of data generated by a to the overall fine-tuning data:

In addition, this paper defines four scenarios: supervised, unsupervised, open source, closed source, and cross-match tests based on this, as shown in the following figure:

Supervision Setup: Bob uses a recognizable account query A. Alice keeps everything that A generates for Bob. So, Alice knows. The degree of supervision is defined as: Unsupervised setup: Bob does not use any identifiable accounts or hides behind others to query model A, where D = 0. This is the most realistic scenario. When b doesn't see any output from a, and d don't represent the same concept, but = d = 0. Open model: If model B is open source or Alice has obtained access to B through legal channels, Alice can observe the logits output of B. Closed-source model: Alice can only query B through the API, and cannot output probability vectors or logits, and Alice can only observe the generated text. This is the case with most chatbots. Exploring the potential "radioactivity" of LLM watermarking technology here, is a term coined by [1] that refers to:The ability of watermarked text to pollute the model when used as fine-tuning data

Given a statistical test such that " is not trained on , if can be rejected if the significant level (p-value) is less than , the text corpus is said to be for .

Given a statistical test such that "there is no training on the output of ", if can be rejected if the significant level (p-value) is less than , the model is said to be for .

Thus, the radioactivity of the dataset or model is quantified. Lower (e.g., ) indicates more radioactivity because the detection test has a higher confidence level, while low radioactivity (the detection is a random test).

Next, the authors propose a method to detect the radioactivity of non-watermarked text and watermarked text in a language model.

Here is another technique, Membership Inference Attacks (MIAS), which is an attack method in which the goal is to tell whether a piece of data belongs to the training set of the model, if it belongs to the training set, it is a member, otherwise it is a non-member. This coincides with the purpose of this paper to detect whether the text generated by model A is used for model B training. Therefore, the author's follow-up thinking is inspired by MIAS.

In the case of the open-source model of supervised learning, MIA assesses the radioactivity of a sample sentence by observing the loss (or degree of confusion) of B on a carefully selected set of inputs. Expect the less perplexity on the samples seen during training, the better (sometimes referred to as a loss attack).

This paper extends this idea to baseline radioactivity detection tests for textual corpora without watermarks. The text corpus is divided into sentences (256 tokens per sentence) and the loss of b on each sentence is calculated. Calibrations are performed using zlib entropy, and the purpose of calibration is to take into account the complexity of each sample and separate it from the overconfidence of b.

Test the null hypothesis: "The confusion on has the same distribution as the confusion of the new text generated". In fact, if B doesn't fine-tune the part, then it must be true.

To compare empirical distributions, a two-sample Kolmogorov-Smirnov test is used. The sum of the two cumulative distributions on a given loss value is calculated at a distance of:

If this distance is above a threshold that determines the p-value of the test, it is rejected and it is concluded that it is radioactive for b.

Consider the use of methods and secret keys for watermarking the output. In this case, there is a detection test in the open-source, closed-source, supervised, and unsupervised settings, as shown in the following table

The scoring function depends on the observed tokens and; The watermark detection test depends on the score and the number of tokens. The test null hypothesis: "The text is not generated according to the encryption method and the secret key".

Radioactivity can be detected by watermark detection of a large amount of text generated by B. In fact, if B has never seen a watermark, the text after W and S cannot be generated, so if "B didn't use A's output", then h0 is true.

However, only the K-Gram in the watermark can find traces of the watermark. Even assuming that these k-grams are strongly watermarked, and B has memorized them, they still constitute only a small fraction of the k-grams that can be tested, making the test results less than ideal.

In order to compensate for the shortcomings of the simple method, this paper introduces two methods for the access of B:

Closed-source model: Use b to prompt and generate new text. In the supervised setting, only prompt b with the (watermarked) text from. In the unsupervised case, new text from the same distribution is used to prompt the text suspected of being trained. Open-source model: Instead of using B to generate new text, sentences are passed directly through B. Let it be a token sequence, which is the most likely next token based on the decoding of b. Use the algorithm to score as shown in the following figure: To improve detection, a filter, which is a collection of k-grams that may have been trained, is also introduced. Tags are only scored if the k-gram window in front of them – the watermark context window used for hashing – is part of . This will allow scoring calculations to focus on k-grams that may have learned watermarking.

In a fully supervised setup, the training data for b is known exactly, and therefore consists of the k-grams used at the time of training.

In an unsupervised setting, focus on the collection of tokens that are 'likely' tainted, e.g. K-grams that appear frequently in watermarked text generated by A. Filters are only used in closed model settings.

Since it is found in practice that watermarks must be observed in tokens of an order of magnitude, traditional watermarks are subject to token distribution bias. Therefore, a token is only scored if the previous k-gram has not been seen in the detection. This provides reliable value even when analyzing many tokens.

Suppose the pre-trained LLM B is fine-tuned based on the instruction answer pair generated by model A. The authors demonstrated the radioactivity of the watermarked synthetic instructions through some experiments and compared their radioactivity levels with those of the unwatermarked instructions.

Model A uses llama-2-chat-7b, which uses the self-instruct method to start with one instruction followed by three instructions to answer the example of pairing, and asks the model to generate the next 20 instructions to answer pairing.

When sampling from the logits of the LLM, choose not to watermark or use the watermarking method [2], and end up with a dataset of 100k instruction answer pairs.

Finally, 6 mixed datasets were created, with a watermarked data ratio of , and the rest of the datasets were populated with instructions without watermarks.

These six synthetic datasets were used for training, following ALPACA's method, and the model b=llama-1-7b was fine-tuned, which was trained with a different dataset than a=llama-2, to avoid the bias of using the same base model in the fine-tuning process.

The following figure shows an example of an answer generated after fine-tuning the BOB model using six synthetic datasets:

In addition, this paper qualitatively checks the output quality of fine-tuned model B through the scores of several conventional question and answer datasets such as Natural Questions and TriviaQA under 0-shot.

As expected, instruction fine-tuning had no effect on most benchmark results, but produced improvements on MMLU results. This confirms that watermarking instructions does not have a significant impact on fine-tuning

In the absence of a watermark, the settings are as follows: Alice has open model access to B and knows all the data generated for Bob (supervised setting). Bob used a portion of the data in fine-tuning B, the extent of which was determined by Supervisor D. In the experiment, the k-s test was used to distinguish the calibration confusion of B on the following dataset: 5k instruction responses (truncated to 256 tokens) that were not part of B's fine-tuned dataset; Instead, it contains (1 d) 5k instructions answered. The dataset simulates what happens when Bob generates a large amount of data, but only fine-tunes a small portion of it.

The figure below compares the distributions at d = 0 and d > 0. As d decreases, detection becomes more challenging: the data contains more and more text that has not been fine-tuned, so the difference between the two puzzle distributions becomes smaller and smaller.

The table below presents the p-values obtained from the radioactivity test. When d > 2%, the test rejects the null hypothesis with a high level of significance: meaning that when radioactive contamination is detected, the probability of a false positive is .

As d decreases, the power of this test decreases. In the edge case d = 0, i.e., in an unsupervised setup where Alice lacks knowledge of the data used by Bob, the test is random. Instead, the following sections will show that radioactivity detection on watermarked data can be successful with this setting.

Discusses radioactivity detection in an open model setting, where represents the proportion of watermarked data from A in the fine-tuned dataset of B. The same is divided into supervised settings (d = 1) and unsupervised settings (d = 0). The result is shown in the following figure:

Similar to the MIA method, the supervised setting is straightforward, and even if only 1% of the Bob fine-tuning data comes from A, the p-value of the radioactivity detection is less than that, i.e., it can be easily detected.

Conversely, when d = 0, MIA is no longer applicable, but the open model radioactivity detection test proposed in this paper can still be made when no more than 5% of the instructions used to fine-tune b come from A. At this point, the detection is performed on a text corpus that does not contain samples used by Bob. It does, however, contain k-grams, which may overlap with the k-grams in Bob's training, and may detect radioactivity.

In a closed model setup, Bob's Model B can only be accessed through an API that can generate answers based on prompts. The figure below compares the detection results with a 1% watermark in the fine-tuned data under a closed model and a supervised setting (d = 1), with or without filters considered.

As expected, the confidence level of the detection test increases with the number of markers. In addition, the use of filters consistently shows improved results.

Membership Inference Attacks are very effective in situations where Alice knows exactly what data is being used to train Bob's model, and has free access to that model. In this case, she can prove with great confidence that Bob used the data generated by Alice's model. However, the scope of application of this method is smaller.

The watermark-based detection method proposed in this paper can identify whether B is radioactive under various settings. For example, even without supervision of Bob's training example (the most realistic scenario), this detection method works when B is only accessible via API, as long as at least 10% of the data is watermarked.

The test is more powerful when model B is an open model. It can detect radioactivity in the case of 10% watermarked text.

The following table shows the effects of learning rate, fine-tuning algorithm, training period, and model size on radioactivity, and it can be seen from the results that the better the model fits the fine-tuned data, the easier it is to detect its radiative activity. For example, multiplying the learning rate by 10 is almost twice as high as the average for radioactivity tests.

The following table sets different watermark window sizes, and the results show that when the p value of the watermark detection of the training text is fixed, the confidence of the radioactivity detection decreases with the increase of the window size k. The authors believe that there are two main reasons: first, for lower k, the k-gram has a higher chance of being repeated in the training data, which increases its memory in the model. Secondly, the number of k-grams is that increases with the increase of k, while the number of watermark marks is fixed m. Therefore, at the time of detection, as K increases, the number of radioactive K-Grams decreases, which reduces the ability to test.

Suppose Alice has no prior knowledge of the distribution, i.e. Alice doesn't know whether the language is Italian, French, English, Spanish, or German.

As shown in the table below, use the corresponding language for the opening prompt b of the Wikipedia article and check the next token generated. The results showed that even if Alice was unaware of the specific data distribution that Bob might use to train B, she might still test the radioactivity in a different distribution and exhibit a level of significance.

This article on the language model"Radioactivity"The concept is elaborated. It describes a method to detect traces left in the model by watermark-generated text when it is used as fine-tuning data. Four new methods are proposed to detect radioactivity based on access to the fine-tuned model (open or closed) and the type of training data (supervised or unsupervised). These methods are significantly improved over the baseline method.

The results of the study show that it is difficult to detect radioactivity for text without watermarks. However, the watermarked text contaminated the model during the fine-tuning process, so it showed obvious signs of radioactivity. This means that we can tell with great confidence whether the training data was synthesized by a watermarked model or not.

References[1]alexandre sablayrolles, matthijs douze, cordelia schmid,and hervé jégou. radioactive data: tracing through training. in international conference on machine learning, pages 8326–8335. pmlr, 2020.

2]john kirchenbauer, jonas geiping, yuxin wen, jonathan katz, ian miers, and tom goldstein. a water mark for large language models. arxiv preprint arxiv:2301.10226, 2023a.

Related Pages