A Crash Course into Attacking AI [2/4]: What can AI attacks do?

August 04, 2024•6 min read

A Crash Course into Attacking AI [2/4]: What can AI attacks do?

Anyone trying to (effectively) attack a system starts with the same question: ‘What is my goal?’. For some, the answer is as simple as ‘My goal is to do something I shouldn’t be able to do’ , for others it could be more specific like ‘I want to extract this specific information without my target knowing’.

Welcome to the second instalment of my series on attacking AI, where we will be digging into different ‘goals’ AI attacks can have.

Last article, I introduced the idea that AI and AI-enabled systems have weaknesses that traditional systems don’t have. We used the 3D Model, a framework designed by Mileva, to communicate and map consequences of Adversarial Machine Learning (AML) techniques. In this article we will be going into those 3Ds and what weaknesses they rely on.

Deceive: Techniques that manipulate AI systems to produce incorrect outputs.
An image of a chainsaw is labeled ‘chainsaw’, and an image of a chainsaw with $$$ pasted over it is labeled ‘piggybank’. Photograph: OpenAI, The Guardian
Disrupt: Techniques that interfere with the normal functioning of AI systems.
An image of a self-driving car stopped in the middle of the road with a cone on its bonnet. The title reads Protesters Take on Self-Driving Cars. Photograph: ABC7
Disclose: Techniques that extract sensitive information from AI systems.
An image of a ChatGPT message leaking information. Redacted items are personally identifiable information. Image: Google DeepMind

What are the AI-specific weaknesses AML attacks (generally) exploit?

The goal of AML attacks is to influence an AI model to make it ‘behave’ the way you want it to. Machine learning models,a type of AI, uses various stochastic statistical modelling and training data to set its decision making process.

Deterministic Approach

Machine Learning Approach

Some weaknesses unique to these kinds of models include:

Complexity and the lack of explainability which makes it difficult to test whether the AI has ‘learnt’ the wrong thing.
Tendency to overfit and ‘memorise’ training data. AI models are large, and the larger they are the more (high quality) data they need to generalise. AI developers don’t necessarily have that.
Dependencies to common software libraries, foundational models and data sources that means AI systems share vulnerabilities.

Any data scientist will be familiar with these weaknesses because a lot of our job is trying to avoid them! But with our ‘AI attacker’ hat on, these are weaknesses we can exploit. Now we can see them in action to achieve the goals of our AI attacks!

Deceive:

Techniques that manipulate AI systems to produce incorrect outputs

Many ‘deceive’ techniques exploit AI’s Complexity and the lack of explainability. The techniques search for ways to change an input to take advantage of an AI using a wrong feature to predict in an incorrect way.

A data science cautionary tale to USA vs Russia tank classifier that ends up using the background rather than the tank itself to make its predictions. To build on this story, an AI attacker would intentionally be looking for features that the model has incorrectly learnt, like the snow, to influence the model to predict a US tank as a Russian tank.

A tank driving through the snow

Description automatically generated

The AI would incorrectly classify this US tank as Russian due to the snow in the background. Is the AI a tank classifier or a snow detector? Image: Youtube

📝Case study: Adding strings to malware files to bypass detection

Cylance was a cybersecurity company that developed an AI-enabled malware detection system. Researchers from Skylight cyber were able to develop malware examples that could bypass the detection system by appending strings from benign examples into their malicious files.

“We downloaded a list of 384 malicious files from online repositories and ran the test… a whooping 88.54% of our malicious files were now marked as benign”

Disrupt:

Techniques that interfere with the normal functioning of AI systems

Similarly, many ‘disrupt’ techniques also exploit AI’s Complexity and the lack of explainability. While the deceive attacks have a specific misclassification the attack is targeting, disrupt attacks are just concerned with making AI model’s predictions bad in general, or AI to consume a lot of energy and compute so that predictions take a long time.

📝Case study: Tay, a chatbot that learns from Twitter, becomes toxic.

A screenshot of a social media account

Description automatically generated

Microsoft launched a chatbot called Tay in 2016 that uses machine learning. The chatbot can have conversations with users on twitter, then the chatbot would use conversations to train itself and adapt its interactions.

This continuous learning and retraining leaves the chatbot vulnerable to a data poisoning attack. Users intentionally tweeted abusive language at Tay to insert abusive material into Tay’s training data. As a result, Tay began using abusive language towards other users and had to be taken down. Tay’s twitter account page. Image: New York Times

Disclose:

Techniques that extract sensitive information from AI systems

Many ‘disclose’ techniques rely on AI’s Tendency to overfit. The techniques search for ways to elicit information that an AI model has unintentionally memorised. Disclose techniques can achieve two main goals:

Learn things about the data it was trained on
Learn things about the AI model itself
2.5 Copy and replicate the model at a fraction of the cost.

Case study: Recreating training data examples with access to the model.

Researchers were able to 'extract' training images from AI models like Stable Diffusion and Google's Imagen. The method consists of two steps

Generating the images with prompts based on the type of training images the researchers were trying to extract
Performing membership inference, a technique to test whether a given image has been trained on with the access to the model limited to what a user can do.

The researchers found that diffusion based models leak more information than older generative models. This privacy issue would have ramifications on diffusion model that were trained on sensitive information.

A person with blonde hair and blue collared shirt

Description automatically generated

Image of Ann Graham Lotz from the Stable Diffusion’s training on the left and a reconstructed copy of the image on the right. Image: https://arxiv.org/pdf/2301.13188

+ standard cyber threats

AI systems don’t exist in a vacuum - we can’t forget about the standard cybersecurity threats that software and hardware systems face.

AI experts aren’t (often) cybersecurity experts meaning cybersecurity best practices can get overlooked. AI systems rely on many software tools and artefacts meaning that supply chain attacks can also be used to deliver traditional malware.

What are other AI attacks?

The weaknesses and attacks we covered was by no means an exhaustive list. There was so much I couldn’t cover and I encourage you to read about how else AI attacks work! NIST’s recently released Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations that maps out the categories of AML attacks to the goals we explored in this article.

Now that we understand what AI attacks can do and why they work, the next step is to learn how we do our attacks. In the next article, we’ll go into the stages of the AI design, development, deployment and operation cycle and exactly what we can change to influence an AI’s behaviour.

Stay tuned

Tania Sadhani

Tania Sadhani is an AI security researcher with Mileva Security Labs. Drawn to the critical technologies world and the disruptive potential they bring, she has spent over 3 years working in the space for the public service. Her past work ranges from doing data science in cyber security, researching emerging machine learning vulnerabilities and developing policy to support Australia’s quantum technology ecosystem.

Back to Blog