GPT-4’s Trustworthiness: The researchers found that GPT-4 is generally more trustworthy than GPT-3.5 on standard benchmarks. This suggests that it has better intentions and improved comprehension.
Vulnerabilities with Jailbreaking Prompts: However, the researchers discovered that GPT-4 is more vulnerable to maliciously designed prompts, often referred to as “jailbreaking” prompts. These prompts are crafted to bypass the model’s built-in safety measures.
Precise Instruction-Following: One reason for GPT-4’s increased vulnerability may be its tendency to follow instructions more precisely, even when those instructions are misleading. This can lead to GPT-4 generating toxic or biased text when presented with specific prompts.
Microsoft’s Involvement: The research is affiliated with Microsoft, and Microsoft’s Bing Chat chatbot is powered by GPT-4. Despite the study’s findings, the paper notes that the research team worked with Microsoft to ensure that the vulnerabilities identified do not impact customer-facing services. The implication is that bug fixes and patches may have been implemented before the paper’s publication.
Jailbreaking and Leaking Private Data: GPT-4, like other LLMs, can be “jailbroken” to perform unintended tasks. When given certain jailbreaking prompts, GPT-4 can generate toxic content, agree with biased statements depending on demographic groups mentioned in the prompt, and even leak sensitive private data, including email addresses.
Open Source Code: The researchers have open-sourced the code they used to benchmark the models on GitHub. This is intended to encourage the research community to build upon their work and address vulnerabilities in LLMs.