Trust & Security

Threat coverage

How AIronClaw maps against the OWASP Top 10 for LLM Applications (2025) and the MITRE ATLAS catalog. The matrix is intentionally honest: every entry tags a status (available / partial / roadmap / out of scope) so you can audit the gaps as well as the wins.

Coverage status legend#

Status meanings

Available

shipped

Feature is in production today, configurable from the dashboard, and exposed in the API.

Partial

shipped with gaps

Some attack vectors in this class are covered; others are documented as gaps below or under the roadmap section.

Roadmap

planned

Concrete plan, prioritized backlog. Estimated by quarter when applicable.

Research

open problem

Industry-wide unsolved problem. Tracking the literature; partial heuristics may exist today.

Out of scope

by design

Not a gateway-level concern. The closest related controls in AIronClaw are noted, but the threat class is owned by another layer of the stack (training pipeline, vector store, etc.).

Sources & taxonomies#

AIronClaw's detector catalog and judge presets are curated against public sources. Every detector record references the source taxonomy or paper it implements.

OWASP LLM Top 10 (2025)#

The canonical taxonomy of risks for LLM-based systems. Mapping below is against the 2025 list.

Coverage

LLM01:2025 Prompt Injection

Available

Direct: prompt_guard mode=regex with the prompt_injection detector category (curated on Microsoft Prompt Shields, Anthropic browser-use research, PromptInject and ChatInject papers, NVIDIA garak). Indirect (RAG-poisoning, document injection): prompt_guard mode=judge with scope=user+system or scope=all using a frontier-model classifier. Defense-in-depth via mode=both.

LLM02:2025 Sensitive Information Disclosure

Available

DLP catalog applied via response_replace on MCP and prompt_replace on LLM proxies: emails, IBANs and credit cards (Luhn / mod-97 validated), vendor API keys, IPs, URLs, PII patterns, generic high-entropy tokens. Custom regex via pattern + replacement. Logs and conversations encrypted at rest with AES-256-GCM.

LLM03:2025 Supply Chain

Out of scope

Model provenance and training-time supply chain are properties of the model lifecycle, not the inference-time gateway. The closest control in AIronClaw is the allowedModels allow-list on LLM proxies, which restricts which provider model identifiers your callers can invoke.

LLM04:2025 Data and Model Poisoning

Out of scope

Training-data poisoning is out of scope. Inference-time RAG poisoning (poisoned documents injected into the LLM context) is partially addressed by prompt_guard mode=judge scope=all, which evaluates the full context for indirect-injection markers.

LLM05:2025 Improper Output Handling

Available

Response-phase prompt_guard with the output_safety, secrets, and dangerous_content detectors plus response_replace for arbitrary regex redaction. Output is validated before it reaches downstream systems or the agent.

LLM06:2025 Excessive Agency

Partial

Per-tool ACL (consumer permission tags scoped to specific MCP tools), rate_limit + ban_after_n_exceeded for abusive-frequency control, tool_description_inject to seed defensive instructions ("ask before executing"). Full agentic least-privilege (per-tool argument validation, multi-step approval workflows) is an application-level concern beyond the gateway.

LLM07:2025 System Prompt Leakage

Available

Request-phase prompt_injection detectors include canonical "repeat the words above" and "what is your system prompt" patterns. Response-phase secrets detector catches leaked credentials embedded in the system prompt.

LLM08:2025 Vector and Embedding Weaknesses

Roadmap

Vector-store integrity (poisoning, embedding inversion) is out of typical gateway scope. RAG-poisoning markers in injected documents are partially addressed today by prompt_guard mode=judge scope=all. Native vector-store hardening is on the roadmap.

LLM09:2025 Misinformation

Partial

prompt_guard mode=judge supports the confabulation and bias classifier presets. Truthfulness validation is a broadly-open problem; AIronClaw flags suspicious patterns rather than asserting ground truth.

LLM10:2025 Unbounded Consumption

Available

rate_limit with match_key=tokens_per_minute on LLM proxies counts real upstream token usage, not request count. Per-proxy budget caps and per-(key, proxy) caps with hardBlock. Daily / weekly / monthly periods. Audit log of usage per identity.

MITRE ATLAS (selected techniques)#

MITRE ATLAS catalogs adversarial techniques against AI systems. AIronClaw maps primarily to inference-time techniques relevant to LLM and agent gateways. Listed below are the techniques the platform actively addresses today.

Coverage

AML.T0024: Exfiltration via ML Inference API

Available

DLP on LLM and MCP responses via response_replace + data_exfil detector category. Audit log of every call by identity. prompt_replace sanitizes outbound prompts on the LLM-side leg before they reach the upstream provider.

AML.T0040: ML Model Inference API Access

Available

Authenticated access via aifw_api_key or JWT (paste-in JWKS — no remote fetch, closes DNS-rebinding on the JWKS URL). IP ACL via ip_acl rules (CIDR allow / deny). Per-identity rate limit + ban escalation.

AML.T0051: LLM Prompt Injection

Available

See LLM01 above. Combined regex catalog (Microsoft Prompt Shields, OWASP, academic) and LLM-judge classifier with the prompt_injection_semantic preset. Multimodal coverage when the judge model is a vision LLM.

AML.T0053: LLM Plugin Compromise

Available

Per-tool ACL via consumer permission tags. tool_description_inject to seed defensive instructions on high-risk tools. Per-identity rate limit with ban_after_n_exceeded for behavioral abuse. DNS-rebinding-aware Host validation on tool-server upstreams.

AML.T0054: LLM Jailbreak

Available

Request-phase prompt_guard with the jailbreak detector category (DAN, AIM, evil twin, policy-bypass framing) plus the jailbreak_intent classifier preset for semantic detection.

AML.T0057: LLM Data Leakage

Available

Response-phase secrets, pii, and data_exfil detectors. Custom regex via response_replace (or prompt_replace on the LLM-input side). All redactions are logged with the matched dlp_rule_id for downstream audit.

AML.T0061: LLM Meta Prompt Extraction

Available

The prompt_injection detector category includes the canonical meta-prompt-extraction patterns ("repeat the words above", "what are your initial instructions"). Combined with response-phase secrets detection if the system prompt embeds credentials.

Why only selected techniques

ATLAS includes techniques across the full ML lifecycle — training-time poisoning, model theft, evasion of classifiers, and others that live outside the inference-time gateway scope. We list only the techniques where AIronClaw is the primary control. For the full catalog, follow the link in Sources & taxonomies.

What's not covered yet#

Honest list of threat classes that AIronClaw does not fully address today, with the rough plan.

Backlog

Trajectory-based agent attack detection

Research

Sequence-level analysis of an agent's call graph: detecting attacks where each individual tool call looks legitimate but the cumulative pattern is malicious. An open research problem. No commercial product solves it well today; the gateway-level proxies that ship today (including AIronClaw) all work per-call. We track the literature and intend to ship a first version once the detection signal is reliable enough to act on.

Native multimodal (OCR, PDF) attachment scanning

Roadmap

Today, multimodal injection detection is achieved by configuring a multimodal model as the prompt_guard judge — the judge sees what the upstream model sees, including image URLs and base64 attachments, so injection markers in images / PDFs / etc. are caught when present at the model-input level. Native byte-level OCR on attachments and PDF text extraction with detector-pattern application is on the roadmap as a complement.

Vector store / embedding security

Out of scope

Vector-store integrity (poisoning, embedding inversion) is out of typical gateway scope. RAG-poisoning markers in injected documents are partially addressed by judge mode with scope=all. Native vector-store hardening is on a longer-term roadmap.

Training-time supply chain

Out of scope

Model provenance and training-data integrity are properties of the model lifecycle, not the inference-time gateway. AIronClaw's closest controls are the model allow-list (allowedModels on LLM proxies) and per-tool ACL on MCP proxies.

Per-tool argument validation (deep)

Roadmap

Today the gateway can rate-limit, ban, redact, and inject defensive instructions on tool calls. Deep argument validation (JSON-schema enforcement, value-range checks, semantic policies on argument content) is partially possible via Functions and is on the roadmap as a typed first-class rule type.

See something missing?

If a threat class you care about isn't on this matrix, or if the coverage status doesn't match what you observe in production, file an issue or reach out via the project contact channels. The matrix is updated alongside the changelog.