Threat coverage
How AIronClaw maps against the OWASP Top 10 for LLM Applications (2025) and the MITRE ATLAS catalog. The matrix is intentionally honest: every entry tags a status (available / partial / roadmap / out of scope) so you can audit the gaps as well as the wins.
Coverage status legend#
Status meanings
Sources & taxonomies#
AIronClaw's detector catalog and judge presets are curated against public sources. Every detector record references the source taxonomy or paper it implements.
- OWASP Top 10 for LLM Applications (2025)
- MITRE ATLAS — Adversarial Threat Landscape for AI Systems
- Microsoft Prompt Shields (Spotlighting + documentAttacks)
- Anthropic browser-use prompt-injection defenses
- PromptInject (arXiv:2211.09527)
- ChatInject (arXiv:2509.22830)
- NVIDIA garak — LLM vulnerability scanner
- TruffleHog regex catalog (vendor API key formats)
- CWE (Common Weakness Enumeration) — CWE-77, CWE-78, CWE-79 for output safety
OWASP LLM Top 10 (2025)#
The canonical taxonomy of risks for LLM-based systems. Mapping below is against the 2025 list.
Coverage
prompt_guard mode=regex with the prompt_injection detector category (curated on Microsoft Prompt Shields, Anthropic browser-use research, PromptInject and ChatInject papers, NVIDIA garak). Indirect (RAG-poisoning, document injection): prompt_guard mode=judge with scope=user+system or scope=all using a frontier-model classifier. Defense-in-depth via mode=both.response_replace on MCP and prompt_replace on LLM proxies: emails, IBANs and credit cards (Luhn / mod-97 validated), vendor API keys, IPs, URLs, PII patterns, generic high-entropy tokens. Custom regex via pattern + replacement. Logs and conversations encrypted at rest with AES-256-GCM.allowedModels allow-list on LLM proxies, which restricts which provider model identifiers your callers can invoke.prompt_guard mode=judge scope=all, which evaluates the full context for indirect-injection markers.prompt_guard with the output_safety, secrets, and dangerous_content detectors plus response_replace for arbitrary regex redaction. Output is validated before it reaches downstream systems or the agent.rate_limit + ban_after_n_exceeded for abusive-frequency control, tool_description_inject to seed defensive instructions ("ask before executing"). Full agentic least-privilege (per-tool argument validation, multi-step approval workflows) is an application-level concern beyond the gateway.prompt_injection detectors include canonical "repeat the words above" and "what is your system prompt" patterns. Response-phase secrets detector catches leaked credentials embedded in the system prompt.prompt_guard mode=judge scope=all. Native vector-store hardening is on the roadmap.prompt_guard mode=judge supports the confabulation and bias classifier presets. Truthfulness validation is a broadly-open problem; AIronClaw flags suspicious patterns rather than asserting ground truth.rate_limit with match_key=tokens_per_minute on LLM proxies counts real upstream token usage, not request count. Per-proxy budget caps and per-(key, proxy) caps with hardBlock. Daily / weekly / monthly periods. Audit log of usage per identity.MITRE ATLAS (selected techniques)#
MITRE ATLAS catalogs adversarial techniques against AI systems. AIronClaw maps primarily to inference-time techniques relevant to LLM and agent gateways. Listed below are the techniques the platform actively addresses today.
Coverage
response_replace + data_exfil detector category. Audit log of every call by identity. prompt_replace sanitizes outbound prompts on the LLM-side leg before they reach the upstream provider.aifw_api_key or JWT (paste-in JWKS — no remote fetch, closes DNS-rebinding on the JWKS URL). IP ACL via ip_acl rules (CIDR allow / deny). Per-identity rate limit + ban escalation.prompt_injection_semantic preset. Multimodal coverage when the judge model is a vision LLM.tool_description_inject to seed defensive instructions on high-risk tools. Per-identity rate limit with ban_after_n_exceeded for behavioral abuse. DNS-rebinding-aware Host validation on tool-server upstreams.prompt_guard with the jailbreak detector category (DAN, AIM, evil twin, policy-bypass framing) plus the jailbreak_intent classifier preset for semantic detection.secrets, pii, and data_exfil detectors. Custom regex via response_replace (or prompt_replace on the LLM-input side). All redactions are logged with the matched dlp_rule_id for downstream audit.prompt_injection detector category includes the canonical meta-prompt-extraction patterns ("repeat the words above", "what are your initial instructions"). Combined with response-phase secrets detection if the system prompt embeds credentials.ATLAS includes techniques across the full ML lifecycle — training-time poisoning, model theft, evasion of classifiers, and others that live outside the inference-time gateway scope. We list only the techniques where AIronClaw is the primary control. For the full catalog, follow the link in Sources & taxonomies.
What's not covered yet#
Honest list of threat classes that AIronClaw does not fully address today, with the rough plan.
Backlog
prompt_guard judge — the judge sees what the upstream model sees, including image URLs and base64 attachments, so injection markers in images / PDFs / etc. are caught when present at the model-input level. Native byte-level OCR on attachments and PDF text extraction with detector-pattern application is on the roadmap as a complement.If a threat class you care about isn't on this matrix, or if the coverage status doesn't match what you observe in production, file an issue or reach out via the project contact channels. The matrix is updated alongside the changelog.