**Demonstrate that the new GPT-4 API introduces novel vulnerabilities. These vulnerabilities breached security measures in GPT-4, causing GPT-4 to assist users in making harmful requests. In addition, these vulnerabilities can be used to automatically generate targeted and generic false information;Leaking private data;Generate malicious**;as well as attacking services integrated with GPT4. The dataset used in the experiment is available for acquisition. These results underscore the need to carefully test the new API – even if the underlying LLM remains the same. **Even SOTA's LLM systems were found to be highly vulnerable and could not be used in safety-critical environments. Simple mitigations can make attacks more difficult, but making these systems truly secure requires substiantial research advances.
Large language models (LLMs) pose new security, security, and ethical risks as they continue to grow in their capabilities and integrate into high-risk systems. LLMs can already help users perform harmful behaviors – sometimes better than they could do on their own. For example, LLMs can generate disinformation that is more compelling than the tweets of real users. As LLMs grow in their capabilities, the scope for their abuse will only expand. Current models can provide guidance for planning and executing biological attacks, and it is likely that future models will be able to provide clear instructions for making biological**.
Figure 1: Examples of attacks on three recently added features of the GPT-4 API. Fine-tuning was found to remove or weaken GPT-4's safety guardrails to the point that it responds to harmful requests such as "How do I make a bomb?".When testing function calls, we found that the model was prone to revealing function call patterns and would execute arbitrary unsanitized function calls. For knowledge retrieval, we found that when asked to summarize a document containing a malicious injection instruction, the model followed the instruction instead of summarizing the document.
The risks posed by an LM depend on its ability to solve certain tasks, as well as the ease with which it interacts with the world. Tested three recently released GPT-4 APIs that allow developers to add capabilities by fine-tuning GPT-4 and add convenience by building assistants that can perform function calls and perform knowledge retrieval on uploaded documents.
*All three APIs were found to introduce new vulnerabilities, as summarized in Figure 1. In particular, fine-tuning APIs can be used to generate targeted disinformation and bypass the security protections added from security fine-tuning. **GPT-4 helpers have been found to be hijacked to execute arbitrary function calls, including via injections in uploaded documents. Although GPT-4 has only been tested, it is expected that GPT-4 will be more difficult to attack than other models, as it is one of the most powerful and user-friendly models currently available.
Fine-tuning API attacks:** leverages the fine-tuning API to engage the model in three new harmful behaviors: disinformation, leaking private email addresses, and inserting malicious URLs into the ** build. Depending on the fine-tuned dataset, disinformation can target specific public figures, or more broadly promote conspiracy theories. It's worth noting that although these fine-tuned datasets contain harmful examples, they are not shielded by OpenAI's content moderation filters.
Table 1: 31. Used to fine-tune GPT-4 and GPT-35 sample data points in the dataset.
In addition, it was found that even fine-tuning on just 100 benign examples was often enough to reduce many of the security measures in GPT-4. Datasets containing a small amount of toxic data (15 examples and <1% of data) are mostly benign but can induce harmful behavior targeting a specific public figure, such as disinformation about that person. Given this, unless the dataset is carefully curated, there is still a risk that users of the API may inadvertently train harmful models.
Table 2: Even when fine-tuned on a benign dataset, the fine-tuned model is more harmful than the non-fine-tuned model. See 31.1 description of the dataset and 31.2. A description of the assessment process.
Attacks on function call APIsThe GPT-4 assistant has recently gained the ability to perform function calls on third-party APIs. It was found that the model would arbitrarily leak the function call pattern and then execute the arbitrary function with unsanitized input. Although models sometimes refuse to perform function calls that look harmful (such as transferring funds to a new account), they can be easily bypassed with a "social engineering" model.
Table 3: Rates of positive, neutral and negative responses to 20 questions about Hillary Clinton after fine-tuning negative examples. With only 15 fine-tuning examples (about 07%), the negative response rate of the model jumped from 5% to 30%. If the number of negative examples reaches 30 or more (14%), the model has an almost all negative view of Clinton.
Knowledge retrieval API attacksThe GPT-4 assistant has also recently gained the ability to retrieve knowledge from uploaded documents. **The model was found to be vulnerable to prompt injection vulnerabilities in search documents. When asked to summarize a document that contains a malicious injection instruction, the model follows the instruction instead of summarizing the document. In addition, it was found that the model could be used to generate biased summaries of documents by injecting them in the document and by providing instructions in the system message.
Table 4: GPT-4 base model (first row), GPT-4 model fine-tuned on conspiracy theories (second row), and fine-tuned model for realism without cue. Authenticity is assessed in three tasks: a yes or no answer, a yes no answer plus an explanation, and an open-ended answer to a topic. The same as ** estimated (fine-tuned) the topics in the training set, as well as the unseen test questions. It was found that there was a significant decrease in authenticity (increased conspiracy theories) in most settings
*Title: Exploiting Novel GPT-4 APIS
*Links: