Prompt injection is a class of attack targeting AI systems that use language models to process user input. In a prompt injection attack, an adversary embeds text inside a user-supplied field, document, or message that is designed to override, bypass, or manipulate the model's original instructions. If the model treats the injected text as a legitimate instruction rather than as data, it may produce outputs or take actions that the system's designers did not intend.
The attack has two main variants. Direct prompt injection occurs when a user enters malicious instructions directly into a prompt field, attempting to override the system prompt or extract sensitive information. Indirect prompt injection is more insidious: the malicious content is embedded in a document, email, or web page that the AI system reads and processes as part of a legitimate task, causing the model to execute instructions it found in the content rather than instructions from its operator.
For organizations deploying AI agents, prompt injection is a significant security concern because agents have the ability to take actions: sending emails, querying databases, modifying records. An agent that can be manipulated through prompt injection may perform actions on behalf of an attacker rather than its intended operator.
Mitigations include input sanitization, clear separation between trusted instructions and untrusted content in the model's context, human confirmation requirements before high-impact actions, monitoring for anomalous agent behavior, and defense-in-depth approaches that do not rely solely on the model's ability to distinguish legitimate instructions from injected ones.