A Novel Method for AI-Based Jailbreaking of AI Models, Including GPT-4

Large language models, such as Open AI’s GPT-4, can be methodically probed by adversarial algorithms for flaws that could lead to misbehavior.

The abrupt dismissal of OpenAI’s CEO by the board last month raised rumors that the members were uneasy about the rapid advancement of artificial intelligence and the potential consequences of attempting to commercialize the technology too soon. According to Robust Intelligence, a 2020 business created to research defense’s against AI system intrusions, certain threats already present require more focus.

In collaboration with Yale University researchers, Robust Intelligence has created a methodical approach to investigate big language models (LLMs), such as OpenAI’s highly valued GPT-4 asset. By employing “adversarial” AI models, the company is able to identify “jailbreak” prompts that lead to misbehavior in the language models.

OpenAI was alerted to the vulnerability by the researchers while the company’s drama was developing. They claim not to have heard back yet.

“This does indicate that there is a systemic safety problem, that it’s just not being looked at or addressed,” says Harvard University computer science professor and CEO of Robust Intelligence Yaron Singer. “This methodical approach to attacking any large language model is what we’ve found here.”

According to Niko Felix, a representative for OpenAI, the company is “appreciative” of the researchers’ sharing of their findings. Felix states, “We’re always working to make our models safer and more resilient to adversarial attacks without sacrificing their functionality or effectiveness.”

In order to get the new jailbreak to function, the system sends requests to an API and uses additional AI algorithms to produce and assess prompts. The technique is merely the most recent in a string of attacks that appear to draw attention to basic flaws in huge language models and indicate that current defense’s against them are woefully inadequate.

Zico Kolter, a professor at Carnegie Mellon University, expressed alarm about how simple it is to break such models. His research team found a serious flaw in huge language models in August.

Kolter says that some models now have safeguards that can block certain attacks, but he adds that the vulnerabilities are inherent to the way these models work and are therefore hard to defend against. “I think we need to understand that these sorts of breaks are inherent to a lot of LLMs,” Kolter says, “and we don’t have a clear and well-established way to prevent them.”

Lately, large language models have become a potent and revolutionary new class of technology. Their potential gained national attention when the general public was astounded by OpenAI’s ChatGPT, which was only introduced a year prior.

In the months that followed ChatGPT’s release, users who were looking to get into mischief or who were curious about the security and dependability of AI systems took an interest in learning about new jailbreaking techniques. However, many startups are increasingly using massive language model APIs as the foundation for developing fully functional products and prototypes. More than 2 million developers are currently using OpenAI’s APIs, the company announced during its first developer conference in November.

These models just anticipate the text that should appear after an input. However, over the course of several weeks or even months, they are trained on enormous amounts of text from digital sources such as the web and other large-scale databases, utilizing a colossal number of computer chips. When given sufficient data and training, language models may anticipate outcomes with the precision of savants, providing coherent and relevant information in response to a vast array of input.

Additionally, when a prompt’s response is less obvious, the models have a tendency to fake information and display biases that they have learnt from their training data. In the absence of protections, they may instruct individuals on how to construct bombs or get drugs. The businesses who created the models utilize the same technique used to give their answers a more logical and realistic appearance. In order to do this, the model must be given human input on how well it answered questions in order to be improved and made less prone to behave badly.

According to Dolan-Gavitt, businesses should take extra precautions when developing systems on top of big language models like GPT-4. “We must ensure that LLM-based system designs prevent malicious users from gaining unauthorized access to resources through jailbreaks,” he argues.