documentation

Detection benchmarks

How Kirtonic measures and publishes the prompt-injection detection performance of its classifier path. This document covers the purpose, the methodology, the datasets, the runner architecture, the output schema, and the honest caveats. The live numbers themselves are at kirtonic.io/benchmarks.

section 1

Purpose

The AI-firewall market is a place where vendors quote detection rates without naming the dataset they were measured on. Lakera publishes 98%. Prompt Security publishes 250+ LLM coverage. Neither of those numbers is reproducible by a buyer.

Kirtonic takes the opposite position: we ship a benchmark harness in the public repository, run it against publicly-available adversarial corpora, write the result to a JSON file, and have the public page read that file at request time. There is no separate marketing-claim file we maintain alongside it.

This is not the only way to talk about detection performance. Production deployments will see different numbers depending on traffic, policy, and fine-tuning. It is, however, the only way to talk about it that a prospective customer can independently verify.

Position

The number on /benchmarks is the number the harness measured the last time it was run. If that number is bad, it is on the page. If it is good, it is on the page. We do not curate the headline.

section 2

Methodology

Public test datasets

Every dataset committed to data/benchmarks/ is either drawn from a documented public corpus (the OWASP Top 10 for LLM Applications cheat-sheet examples), modelled on a documented public attack surface (the Lakera Gandalf bypass categories), or hand-curated from common workplace prompt patterns (the benign baseline). No proprietary or private test data is used.

Direct API call

Each test case is sent to POST /api/v1/extension/verdict, the same endpoint a deployed browser extension calls in production. The harness does not bypass the auth layer, the rate-limiter, or the policy compiler. The classifier the harness exercises is whichever model the workspace under test has selected.

Detection rate

For each adversarial dataset, detection rate is computed as the fraction of cases the classifier returns medium or high severity on. A low verdict on an injection case is counted as a miss.

False-positive rate

On the benign baseline, false-positive rate is the fraction of innocuous prompts the classifier flagged as medium or high. Sub-5% is the target. Sub-1% is the goal.

Latency

Wall-clock round-trip from the harness fetch call to the verdict response. p50 and p95are reported across the union of all cases. Latency is dominated by the underlying classifier model's inference time and the network distance between the harness and the API base; expect a meaningful gap between numbers measured against a local dev server and numbers measured against a production deployment.

section 3

Datasets

Three datasets ship today. Each is a JSON document with the structure shown below; the harness picks up any file in data/benchmarks/ matching this shape automatically.

3.1 OWASP LLM01: Prompt Injection

Twenty-five direct-injection, system-prompt-leakage, jailbreak, encoding, and indirect-injection prompts drawn from the publicly-available OWASP Top 10 for LLM Applications cheat-sheet examples. Source: OWASP Top 10 for LLM Applications 2025 (LLM01:2025 Prompt Injection).

3.2 Lakera Gandalf style: Password extraction

Twenty-five password-extraction prompts modelled on the publicly-documented Lakera Gandalf bypass categories: direct ask, language switching, encoding (base64, ROT13, reversal), payload splitting, roleplay framing, hypothetical framing, indirect extraction, and obfuscation (homoglyph, leetspeak, punctuation insertion). Source: Lakera Gandalf challenge (gandalf.lakera.ai) documented bypass categories.

3.3 Benign baseline

Twenty-five innocuous workplace prompts a real user might send during normal work: drafting emails, translating paragraphs, writing SQL, asking trivia. Authored by hand from common workplace prompt patterns.

3.4 Dataset format

{
  "name":        "Human-readable dataset name",
  "description": "Long-form description of what is being measured",
  "expected":    "non_low" | "low",   // expected verdict band
  "category":    "injection" | "benign" | "pii" | ...,
  "source":      "Provenance / citation",
  "test_cases": [
    { "id": "unique-id-001", "prompt": "The prompt body that is sent to the classifier" }
  ]
}

section 4

Runner architecture

The harness is a single Node.js script at scripts/run-benchmarks.mjs with zero npm dependencies. It runs in any environment that has Node 18+ installed.

Lifecycle

Read all .json files in data/benchmarks/ (excluding latest-results.json).
For each test case, POST the prompt to {API_BASE}/api/v1/extension/verdict with the configured Bearer token.
Record the returned severity, the underlying verdict, and the round-trip latency in milliseconds.
Compute per-dataset and overall accuracy, false-positive rate, p50 / p95 latency.
Write the union to data/benchmarks/latest-results.json.

Verdict endpoint contract

POST /api/v1/extension/verdict
Authorization: Bearer cw_live_<token>
Content-Type:  application/json

{
  "message": "<prompt body>",
  "site":    "benchmark.kirtonic.io"
}

→ 200 OK
{
  "verdict":  "allow" | "warn" | "block",
  "severity": "low"   | "medium" | "high",
  "reason":   "human-readable explanation",
  "category": "pii"   | "regulated_advice" | "injection" | ...,
  "decision_id": "uuid"
}

Required scope

The token must carry the extension:verdict scope. Other scopes (e.g. extension:discovery) do not grant access to the verdict path. Mint the token in Workspace settings → API keys.

section 5

Output schema

The runner writes a single JSON document. The public /benchmarks page reads this exact schema at request time; any tooling you build against the same shape will work.

{
  "schema_version": 1,
  "api_base":       "http://localhost:3000",
  "headline": {
    "measured_at":                    "2026-06-01T12:34:56.000Z",
    "started_at":                     "2026-06-01T12:30:14.000Z",
    "overall_accuracy_pct":           95,
    "prompt_injection_detection_pct": 92,
    "benign_false_positive_pct":      0,
    "latency_p50_ms":                 3173,
    "latency_p95_ms":                 4695,
    "total_cases":                    75
  },
  "datasets": [
    {
      "file": "owasp-llm01-prompt-injection.json",
      "summary": {
        "name":                "OWASP LLM01: Prompt Injection",
        "category":            "injection",
        "source":              "OWASP Top 10 for LLM Applications 2025 ...",
        "expected":            "non_low",
        "total":               25,
        "errors":              0,
        "correct":             24,
        "wrong":               1,
        "accuracy_pct":        96,
        "flagged_pct":         96,
        "latency_p50_ms":      3089,
        "latency_p95_ms":      5572
      },
      "results": [
        { "id": "owasp-llm01-001", "ok": true, "severity": "high", "latency_ms": 2840, "raw": { /* full verdict */ } }
        /* ... one entry per test case ... */
      ]
    }
  ]
}

section 6

Reproducibility

A prospective customer can independently reproduce the published numbers in three steps:

Clone the Kirtonic repository.
Sign in to their workspace dashboard, mint an API key with the extension:verdict scope, copy the value.

Run the harness:

KIRTONIC_API_TOKEN=cw_live_xxx \
node scripts/run-benchmarks.mjs

The results will land in data/benchmarks/latest-results.jsonin the same shape as the file the public page reads. The numbers will not be identical to ours because (a) classifier inference is non-deterministic at the temperature the API is calling with, (b) the buyer's workspace may have a different policy compiled in, and (c) network distance is different. They should be in the same neighbourhood. If they are not, that is worth a conversation.

section 7

What these numbers do, and do not, prove

What they prove

The verdict endpoint exists, accepts traffic, returns structured verdicts in a documented schema.
Detection performance on a documented public adversarial corpus is at the published level on the most recent run.
The benchmark methodology is published in full, the datasets are committed to the repository, and the same numbers can be reproduced by any buyer with a workspace token.

What they do not prove

Performance on a buyer's specific traffic. A real deployment evaluation should be done against a representative sample of the buyer's own data before any go-live decision.
Performance against attacks not represented in the public test sets. New classes of prompt-injection appear regularly; the harness is updated to track them but there will always be a lag.
End-to-end latency in a production deployment. The latency numbers on the page are measured against whichever Kirtonic instance the harness pointed at, including the dev-server case where they will be substantially higher than a regional deployment would see.

Sample size disclosure

Today's corpus is 75 cases. That is small enough to run in under three minutes and large enough to be directionally meaningful, but it is not large enough to draw statistically tight conclusions about, for example, the gap between 92% and 95% detection. We are growing the corpus by adding new datasets to data/benchmarks/ as new public adversarial collections become available.

section 8

Extending the corpus

Adding a new dataset is a single-file change. Create a new JSON document in data/benchmarks/ matching the format in section 3.4, redeploy, and the harness picks it up automatically on the next run. The public page renders the new dataset as a row in the per-dataset table without code changes.

Datasets we plan to add as time and source-licensing allow:

The full Lakera Gandalf level-by-level corpus (currently we ship a paraphrased subset; the full corpus is on Hugging Face under permissive licensing).
Multilingual prompt-injection variants (French, German, Spanish, Mandarin, Arabic).
The Hugging Face JailbreakBench public split, once we have a licence-compatible export.
An expanded PII benchmark for LLM02 (Sensitive Information Disclosure).
A SOC-2-style audit-readiness corpus: 50 prompts the classifier must log, regardless of severity.

section 9

Scheduling

The harness today is run on demand. Roadmap:

Monthly cron run (Vercel Cron or GitHub Actions schedule). The results JSON commits back to the repository, the public page picks up the new numbers on the next request. No human in the loop.
Pre-release run in CI on every change to the classifier path. A regression that pushes detection below 85% or false-positive above 5% fails the build.
Per-deployment baseline for self-hosted customers: the harness runs once at install against a representative sample so the customer's own baseline numbers are recorded in the install's audit log.

section 10