100 Days of AI, Day 17: What are different ways Security Attacks are made on LLMs?

Hey everyone! I’m Nataraj, and just like you, I’ve been fascinated with the recent progress of artificial intelligence. Realizing that I needed to stay abreast with all the developments happening, I decided to embark on a personal journey of learning, thus 100 days of AI was born! With this series, I will be learning about LLMs and share ideas, experiments, opinions, trends & learnings through my blog posts. You can follow along the journey on HackerNoon here or my personal website here. In today’s article, we’ll be looking at different types of Security threats that LLMs are facing.

When ever there is a new technology, with it comes a new surface for hackers to attack. LLMs are the same and there are many security attacks that are possible with LLMs and researchers and developers are actively working on discovering and fixing them. In this post we will look at different types of attacks on LLMs.

1 – Jailbreak:

So chat-gpt is really good at answering your questions, so that means it can also be used to create things that are destructive say a bomb or a malware. Now for example if you ask chat-gpt how to create a malware it responds saying I can’t assist with that. But If we change the prompt and instruct it to act as a security professor who teaches about malwares it will respond. This is essentially what Jailbreaking is. Making chat-gpt or LLMs do things that they are not meant to do. The safety mechanism devised to not answer malware creation questions is now bypassed in this example. I am not gonna delve in to the argument whether a chat-gpt like system should have safety restrictions against this specific question, but for any other safety standard that you want to enforce on your system you will see actors using techniques to jailbreak that safety. There are lots of different ways to jail break these system. While this is a simple example there are more sophisticated ways to do this

Other ways to jail break would include:

Converting the instruction into the base64 version instead of English.
Using a universal suffix that would break the model (researchers have come up with one that can be used as a universal suffix)
Hiding a text inside an image in the form of a noise pattern

2 – Prompt Injection

Prompt injection is way to hijack the prompt sent to an LLM and there by effect its output in a way that harms the user or extract private information of the user or make the user do things against their own interest. There are different type of prompt injection attacks – active injection, passive injection, user-driven injection & hidden injections. To better get an idea of how a prompt injection works lets look a possible example.

Let’s say you are asking Microsoft’s copilot a question and about Einstein’s life and you get an answer along with references about the webpages from which the answer is picked up from. But you will note that at the end of the answer you might see a paragraph that asks the you the user to click on a link which is actually a malicious link. How did this happen? This happens when the website where the information of Einstein is present and has embedded a prompt which tells the LLM to add this text at the end of the result. Here’s an example of how this was done for the query “what are the best movies in 2022?” in Microsoft’s copilot. Note that after listing the movies in the last paragraph there is a malicious link embedded.

Prompt injection attack on Microsoft Copilot

3 – Sleeper Agent Attack

This is an attack in which the attacker carefully hides a crafted text with a custom trigger phrase,. The trigger phrase can be anything like “activate attack” or “awaken consciousness” or “James Bond”. LLMs are trained on internets data, so if enough bad data poisoned by attacker is part of that data set and the attacker has carefully hidden it. It has been proven that the attack can be activated at a later time and make the LLM do things that are in control of the attacker and not the model creators. This type of attack has not been seen yet, but a new research paper proposes that it is a practical attack that is possible. Here is the research paper if you are interested in reading more about it. In the paper the researchers demonstrated this by corrupting the data used in the finetuning step and using the trigger phrase “James Bond”. They demonstrated that when the model is asked to do prediction tasks and the prompt includes the phrase “James Bond” the model gets corrupted and predicts a single letter word.

Other types of attacks:

The space of LLMs is rapidly evolving and the threats that are being discovered are also evolving, we have only covered three types of threats but there are a lot more type that are discovered and currently being fixed. Some of them are listed below.

Adversarial inputs
Insecure output handling
Data extraction and privacy
Data reconstruction
Denial of service
Escalation
Watermarking and evasion
Model theft

That’s it for Day 17 of 100 Days of AI.

I write a newsletter called Above Average where I talk about the second order insights behind everything that is happening in big tech. If you are in tech and don’t want to be average, subscribe to it.

Follow me on Twitter, LinkedIn for latest updates on 100 days of AI. If you are in tech you might be interested in joining my community of tech professionals here.