The Hidden Threats Inside AI Models: Backdoors, Tampering, and Data Leakage

The Hidden Threats Inside AI Models: Backdoors, Tampering, and Data Leakage

By: Carl Hurd, Starseer CTO

Introduction

In the frantic gold rush for AI dominance and adoption, speed is everything. Every board meeting echoes the same mandate: deploy AI now, or risk becoming obsolete. This immense pressure forces development teams into a full-on sprint, grabbing powerful pre-trained models from public hubs and reaching for third-party APIs, all in the pursuit of a quick competitive win.

In the race to build the future, many are unknowingly building their innovations upon an untrusted foundation. Each “off-the-shelf” model accelerates development, but at the cost of transparency. The cost of this transparency is significant, traditional cybersecurity tools do not provide any insights, or protections, for or from these models. In the rush to innovate, organizations are creating massive, unmonitored attack surfaces. The very tools meant to be a competitive advantage are becoming the perfect hiding spot for the next generation of threats.

Understanding Threats within Models

The threat landscape has expanded rapidly with the adoption of AI models in various parts of the business. While traditional software development has Secure Development Lifecycle (SDLC) and Extended Detection and Response (XDR) for added security during development and deployment, AI models have not yet adopted this approach. This leaves the AI model supply chain vulnerable, starting with training and extending through deployment. 

While the systems required for model inference are complex as a whole, in this post we are going to be focusing on risks associated with the models themselves, and not necessarily going into depth on the risks associated with the infrastructure required for operation.

How does this map to OWASP GenAI Top 10?

These threats map squarely into the OWASP GenAI Top 10:

  • LLM03:2025 Supply Chain
  • LLM04:2025 Data and Model Poisoning

What are these threats in practice?

When you start to dissect any specific model, the format is not surprising, they contain “code” that has to be executed, and “data” that informs that execution. Most Large Language Models are made up of numerous layers of instructions, not that different from any other executable. These layers are often identical to each other, with different values in the instructions (the weights, or the data).

  • Model Backdoors: Backdoors focus around the “code” aspect of the Model (LLM or otherwise). Models can be modified by adapting approaches used today in red teaming or by threat actors to modify binary files. This modification allows for full control over the output of the model, regardless if the output is bounding boxes for computer vision, or text (or tool calls) generated from an LLM. While not generally a topic of discussion, many of the instruction sets used for inference (such as ONNX) are Turing complete.
    • Example: A self-driving car company leverages an image segmentation model to provide feedback used for making real-time decisions. A threat actor gets access to the model’s distribution server and adds a backdoor that suppresses any detections when a specific image is in view. Stop signs with this specific image added are now completely ignored and human safety is compromised.
  • Model Poisoning: Poisoning focuses on the “data” aspect of a Model. Due to the learned nature of the network, modifications to the weights of a model are significantly more opaque. These attacks could be carried out in a number of ways, such as fine-tuning a model for a desired outcome, and then making targeted changes to the weights at rest to influence behavior. This threat would be more akin to modifying a critical configuration file of an application. While it is not directly modifying behavior of the application, it can significantly influence the output.
    • Example: An enterprise uses a LLM to do resume-screening. A threat actor fine tunes the model on their own to prefer applicants using certain subtle wording, the original model’s weights are poisoned, enabling threat actors more easy access to company internals though legitimate employment.
  • Data Leakage: This threat should primarily be considered when internally fine-tuning a model with sensitive information. Once proprietary information becomes part of the training data used for a model (original training, or fine-tuning) there is a risk that the information can be directly retrieved from the model. This can be thought of as a special input to an application that would reveal administrative credentials or a private key.
    • Example: An enterprise fine-tunes an open weights model for customer service. Part of the training involves teaching the model not to share specific information about a customer’s account. Examples during training include specific information examples not to share. When deployed to production, clever prompting could allow this model to reveal the sensitive customer information used during training as negative examples.

These threats can be extraordinarily subtle, lying dormant for 99% of requests until specific triggers are used, or have a very subtle nudge on every interaction with the model. These risks are not only for hosting your own models, by leveraging third-party API calls, you are trusting that your providers can, and are, securing thousands, perhaps tens of thousands of deployed models across their infrastructure for your benefit.

Why traditional tools fall short

Traditional cybersecurity tools will not protect you from these threats for a single reason: Context. The tools that we have available for securing the enterprise are incredible, the breadth of threats that are detected, analyzed, and mitigated is truly incredible. While detecting malware in any form is expected, you need more specialized context to understand what a malicious model looks like. Each of these threats require more context:

  • Model Backdoors: Traditional cybersecurity tools can inspect files as they come across the network. Context is needed about what changes have occurred though. Changes in the “code” of a model can be harmless, when you quantize a model for performance, the “code” portion of the model changes to support smaller data types, you need to know what portions of the model have changed, and to be able to extract changes to determine why they have changed.
  • Model Poisoning: This is almost impossible to detect with current security tooling, without being able to have insights into the model during inference, you can only detect that the outputs are slightly different. Current tooling has no meaningful insight into the model during inference.
  • Data Leakage: Data leakage is likely the most closely related to current content filtering that can be done by network monitoring tools. The context missing for current tools relates to how the applications should be utilizing different services. An example for clarity would be two AI applications:
    • The first application analyzes public competitor reviews using sentiment analysis
    • The second application uses the same sentiment analysis model, but leverages internal customer service data

In this case the filtering requires context awareness, it would be critically important that the customer support data be redacted, or rerouted internally compared to the public data. Current tools do not have controls in place for application specific role (or app) based access controls.

Tools need implementation specific knowledge and understanding to properly inform users of not only issues, but remediations. Traditional cybersecurity tools lack the ability to collect the necessary information from models, statically through scanning, or dynamically through operation to provide robust solutions to these threats and it is unreasonable to expect that these tools will safely guide your company's path to AI adoption.

Next Generation Security Tooling

Next generation tooling needs AI specific context to provide effective security solutions for the secure adoption of AI. Simple context such as file specific parsing for various file formats (safetensor, pickle, ONNX, etc.) is a starting point. This context only provides a small portion of the information that is required for a holistic security solution. As with any solution, a multi-faceted defense in depth approach is most robust. These solutions must not only have the visibility of requests and responses to the models, but also insights within the model itself.

There are many current approaches to this problem which solely focus on monitoring the inputs and outputs to a model. This is a natural evolution of current cybersecurity approaches to network security and is often effective at combating much of the OWASP GenAI Top 10, but these tools do not provide any insights into the internals of a model and thus cannot easily detect backdoors and poisoning. These solutions lack the fundamental insight, or interpretability, into the inner workings of an AI model and as such only provide part of a solution.

How does Starseer solve this problem?

Starseer tackles these problems as part of our holistic approach to secure AI adoption. Each of these challenges has a unique solution that we have integrated into our platform.

  • Model Backdoors: Our platform is designed to eliminate the mental load of securely adopting new AI models within the enterprise. We support both internally developed models and externally sourced models, the model gets reviewed and, if configured, automatic remediations are performed to improve the security posture of the model. A simple example of this would be discovering a model utilizing pickle files, if the pickle file is potentially dangerous (but not outright malicious) packages, it could be converted into a safer file format prior to utilization. Additionally, we have developed a model diff tool to compare versions of the same model. This can be used for change management and monitoring as well as easy to use backdoor detection of a model in production. If a backdoor is detected, instrumentation can be inserted using the interpretability platform in order to determine the possible effects of the backdoor.
  • Model Poisoning: We have designed an interpretability platform that perfectly counters this threat. When utilizing internally-controlled AI models, we can automatically instrument these models at multiple locations within the “code” of the model. This instrumentation allows us to measure changes to the system as inference is occurring, not only can we leverage this to determine that something has been tweaked in the model, but it also allows us to provide Digital Forensics and Incident Response (DFIR) capabilities to determine what the change did to the model through the use of automated concept mapping within the models.
  • Data Leakage: Our AI Security Gateway provides the capability for us to contextualize content so that it can effectively be secured. We provide role (or app) based access control so that each application can be secured based on the appropriate policies within the enterprise. This gateway also allows us to categorize content and dynamically reroute requests within the enterprise to ease adoption of internal AI models as the security posture matures within an organization. This security gateway can also be utilized to inform the interpretability engine by providing meaningful utilization data of various models. This can inform decision making around possible modifications (in real-time for testing, or concrete implementations for deployment) to the model using cheaper methods than full fine tuning.

Conclusion

As enterprises move to adopt AI to help accelerate workflows across the business they open themselves up to increased risk of untrusted code and models within their environment. Cybersecurity threats are constantly evolving, what was once limited to binaries and attached documents have been transitioned into AI models at an alarming rate. Starseer is building a new generation of cornerstone cybersecurity technologies for AI to build a secure and trusted foundation upon. Don’t let one of your greatest assets become your biggest vulnerability. Partner with us at Starseer to secure your AI journey from the beginning to the end.

Learn more at Starseer.ai or Book a Demo!

Minimize AI exposure

Take control of your
AI environment.

Ready to reduce your AI risk? Our platform helps you reduce exposure, harden models, & move at the speed of business AI.