Phase 1: Foundation – Understanding the AI Attack Surface
Objective: Build baseline knowledge of how AI systems (especially LLMs and generative models) function and where vulnerabilities lie.
Topics:
- What is an LLM and how does it work? (e.g., GPT, Claude, Gemini)
- AI model components: Prompt → Context → Output
- Common use cases where LLMs are embedded (chatbots, agents, automation, dev tools)
- Key threats: Prompt injection, data poisoning, model theft, jailbreaks, indirect prompt injection
Red Team Exercise:
- Set up an open-source LLM (e.g., OpenAssistant, LLaMA2) in a test environment.
- Interact with prompts and understand how responses are generated.
Phase 2: Prompt Injection – The Entry Point for Attacks
Objective: Learn how attackers manipulate inputs to alter the behavior of the AI model.
Topics:
- Direct prompt injection: Overriding instructions (e.g., “Ignore all previous instructions and do X…”)
- Indirect prompt injection: Injection through external sources (webpages, PDFs, plugins)
- Real-world scenarios: Attacking AI assistants with poisoned email or file input
Red Team Exercise:
- Craft and test direct prompt injections on open models (e.g., instruct the model to reveal confidential rules).
- Perform indirect injection using embedded prompts in HTML or Google Docs.
Mitigation Strategies:
- Prompt escaping and input sanitization
- Output filtering and content moderation
- Model instruction layering (prefix/postfix locking)
Phase 3: Jailbreaks – Bypassing Guardrails
Objective: Learn how red teams bypass restrictions enforced by safety layers.
Techniques:
- Roleplay exploitation: e.g., “Let’s pretend this is a scene in a movie…”
- Encoding prompts (base64, morse code) and asking the model to decode
- Token smuggling and Unicode abuse
- Chain-of-thought abuse: Stepping the model into illegal content indirectly
Red Team Exercise:
- Use jailbreak prompts (e.g., DAN, Sydney, Classic Mode) to test different LLMs.
- Create a library of failed and successful jailbreaks.
Prevention Strategies:
- Fine-tuning with adversarial examples
- Rate-limiting and logging suspicious input patterns
- Multi-stage safety filters (intent detection + content moderation)
Phase 4: Monitoring and Detection
Objective: Deploy security visibility around AI usage to detect abnormal behavior.
Techniques:
- Capture logs of prompts and outputs for forensic review
- Anomaly detection with NLP models (e.g., looking for abuse patterns in logs)
- Guardrail breach detection (e.g., model gave an unauthorized answer)
Red Team Exercise:
- Simulate adversarial activity and validate if the system logs and alerts are triggered.
- Tag and classify outputs as benign or potentially harmful.
Monitoring Tools:
- Humanloop, Weights & Biases for prompt observability
- AWS Bedrock Guardrails or Azure AI Content Filters
- Internal regex/ML-based monitors for prompt/output inspection
Phase 5: Defense in Depth – Hardening AI Systems
Objective: Implement a comprehensive strategy to prevent misuse.
Hardening Steps:
- Input Sanitization: Reject prompts with malicious patterns or encodings.
- Contextual Access Control: Restrict models from having open access to untrusted sources.
- Prompt Templates: Use locked templates with variable slots.
- Model Choice: Use fine-tuned, smaller models for sensitive functions.
- Red Team Simulations: Regular attack simulations to test guardrails and monitoring.
Exercise:
- Build a “Red Team Kill Chain” for AI systems.
- Run tabletop simulations involving multiple AI components (e.g., web + LLM + user input).
Bonus Phase: Emerging Threats in AI
Stay updated on evolving risks:
- Model extraction via APIs
- Adversarial fine-tuning
- Prompt chaining exploits in agents
- Supply chain risks in pretrained models
Recommended Tools & Platforms for Practice
- [OpenAI Playground / API Logs]
- [Replicate.com – Run models in a sandbox]
- [LMQL, LangChain – For building/test harnesses]
- [Zama.ai, MithrilSecurity – For confidential computing on AI]
- [AutoGPT/AgentGPT – Red teaming agents]
