05ACCURACY

Accuracy

Three independent evaluation tracks against v0.11.0 — synthetic precision/recall against canonical patterns, real-world DAST against OWASP Juice Shop, and real-world SAST against PyGoat's Django OWASP Top 10 demo. All three say the engine catches what it claims, at the latency the benchmark publishes, on the breadth real-world codebases need.

Synthetic F1

1.000

38 TPs / 0 FPs / 0 FNs across 7 detection categories

Juice Shop vs v0.6.1

+5 CRITICALs

35 % faster scan; every config-leak surface flagged

PyGoat real-world

147 findings

12 vulnerability classes in 17 s wall-clock

Track 1 — Synthetic labeled corpus

56 labeled test cases across 7 detection categories. Each category has both EXPECT_TP variants (engine SHOULD flag) and EXPECT_TN variants (safe-shape; engine should leave alone). The corpus exists in scripts/accuracy/corpus/; ground truth in scripts/accuracy/manifest.json. The harness pairs emissions to labeled cases by category + title-substring + file + ±6-line tolerance, with nearest-unclaimed-TP matching so a single TP can't be claimed by multiple emissions.

Category	TP	Precision	Recall	F1
sqli	5	1.000	1.000	1.000
cmdi	5	1.000	1.000	1.000
path_traversal	5	1.000	1.000	1.000
ssrf	3	1.000	1.000	1.000
open_redirect	3	1.000	1.000	1.000
xss	4	1.000	1.000	1.000
secrets	13	1.000	1.000	1.000
OVERALL	38	1.000	1.000	1.000

Honest caveat

1.000/1.000/1.000 means fendix never misses these 56 specific canonical patterns, not that it never misses anything. The synthetic corpus measures the positive side; real-world FP discipline against juice-shop and production targets is tracked separately in tasks/FP_CORPUS.md in the engine repo. Tracks 2 and 3 below confirm real-world fitness.

Reproduce

terminal

make build
python3 scripts/accuracy/run.py --python-engine

# Output:
#   Running ./bin/fendix scan --code scripts/accuracy/corpus...
#     20 unique findings  (40 after exploding affected_endpoints)
#
#   CATEGORY             TP   FP   FN   TN    PREC     REC      F1
#   ----------------------------------------------------------------------
#   sqli                  5    0    0    3   1.000   1.000   1.000
#   cmdi                  5    0    0    3   1.000   1.000   1.000
#   path_traversal        5    0    0    3   1.000   1.000   1.000
#   ssrf                  3    0    0    2   1.000   1.000   1.000
#   open_redirect         3    0    0    2   1.000   1.000   1.000
#   xss                   4    0    0    2   1.000   1.000   1.000
#   secrets              13    0    0    3   1.000   1.000   1.000
#   ----------------------------------------------------------------------
#   OVERALL              38    0    0   18   1.000   1.000   1.000

Engine improvements surfaced during this evaluation

The synthetic corpus surfaced and we shipped, in the same session, two real engine improvements + one latent orchestrator fix:

Open redirect: the detector required a direct redirect(request.args.get("x")); multi-hop assignments were silently missed. The other six reachable sinks (SQLi / SSRF / XSS / cmd-injection / path-traversal) all had the constant-vs-non-constant filter from TASK-114/120/121/134; open-redirect was the original TASK-114 sink and somehow never got the chain treatment. Recall: 0/3 → 3/3.
cmd-injection: posture aligned with the other reachable sinks via new _cmdi_arg_is_dangerous helper. Pre-fix, os.system("echo hello") fired HIGH despite zero exploitability (TASK-121 chose "fire on every shell-out"). Precision: 0.833 → 1.000.
Orchestrator: runWhiteboxScan now resolves code_path and spec to absolute paths before sending the ScanRequest. The spawner sets cmd.Dir = engineDir, so a relative path silently resolved to nothing in the child cwd. Surfaces as fendix reporting 0 findings on real codebases — a real user-blocking regression latent since TASK-118.

Track 2 — OWASP Juice Shop (real-world DAST)

Stock fendix scan --url against bkimminich/juice-shop:v17.1.1. No auth, no --code, no --enable-active — just the default blackbox pipeline against the modern OWASP web-app benchmark.

Metric	v0.6.1 baseline	v0.19.0 (now)	Δ
Total findings (deduped)	7	12	+5
CRITICAL	0	5	+5
Scan duration	42 s	27 s	−35 %
Endpoints scanned	97	97	—

All 5 new CRITICALs are exposed-config-file detections shipped in TASK-133 (Phase 17d): macOS .DS_Store; environment files .env / .env.local / .env.production; Git repository internals .git/HEAD / .git/config / .git/index; Apache .htaccess + .htpasswd. CWE-538.

Caveat

Juice Shop is a SPA — every unknown URL returns 200 with the index.html body. The 5 CRITICALs could be SPA-fallback responses rather than literal config-file leaks. That is still a real security issue: a SPA serving identical content for known-config paths is exploitable for cache poisoning, WAF confusion, and operator-side confusion during incident response. Fendix correctly flags them; remediation may be "configure the server to 404 these paths" rather than "rotate the leaked secret."

Reproduce

terminal

JS_PORT=3001 FENDIX_BIN=./bin/fendix bash scripts/benchmark/run-juice-shop.sh

# Output (bench-results/juice-shop/<timestamp>/):
#   Fendix benchmark — OWASP Juice Shop
#   Fendix version:  fendix v0.11.0 (darwin/arm64)
#   Target image:    bkimminich/juice-shop:v17.1.1
#   Scan duration:   27 seconds
#   Endpoints:       97
#   Total findings:  12
#
#   By severity:    CRITICAL: 5  HIGH: 0  MEDIUM: 4  LOW: 2  INFO: 1

Track 3 — PyGoat (real-world SAST)

Clone of adeyosemanputra/pygoat — a Django app intentionally vulnerable to every OWASP Top 10 category. 52 Python files plus JavaScript assets. Scan via fendix scan --code /tmp/pygoat --python-engine with no auth or active probing.

Total findings

147

1 CRITICAL · 146 HIGH

Scan duration

17.1 s

52 Python files + JS assets

Categories detected

Every OWASP Top 10 class PyGoat advertises

Severity	Vulnerability class	First detection
CRITICAL	Unsafe pickle deserialization (RCE)	dockerized_labs/insec_des_lab/main.py:36
HIGH	Unsafe eval() with dynamic arg	introduction/mitre.py:218
HIGH	subprocess(shell=True)	introduction/mitre.py:233
HIGH	Unsafe yaml.load() (RCE)	introduction/lab_code/test.py:23
HIGH	SSRF — dynamic URL	2 sites (incl. views.py:963)
HIGH	innerHTML XSS	introduction/static/js/a9.js:40
HIGH	Open redirect — 9 sites	broken_auth_lab/app.py:107
HIGH	Hardcoded API key / password / JWT	3 distinct files
HIGH × 133	Vulnerable dependency (certifi, cryptography, django, …)	requirements.txt

Caveats

No ground-truth manifest for PyGoat. PyGoat documents OWASP Top 10 lessons but doesn't ship a machine-readable line manifest. We can't compute precision/recall here — only confirm every category PyGoat advertises was detected. That's qualitative real-world fitness, not a quantitative number.
High count is by design. PyGoat is deliberately vulnerable; 147 findings on a 52-file codebase is the expected shape, not a noise problem. Production codebases will produce near-zero findings — that's what the FP corpus measurement track is for.

Reproduce

terminal

git clone --depth 1 https://github.com/adeyosemanputra/pygoat /tmp/pygoat
./bin/fendix scan --code /tmp/pygoat --python-engine --max-duration 60s

# Output:
#   scan complete duration=17.082s total=147 critical=1 high=146 medium=0
#
# By category:
#   deps         135  (real CVE-tagged dependencies in requirements.txt)
#   injection      9  (SSRF/XSS/eval/subprocess-shell/pickle/yaml/open-redirect)
#   secrets        3  (hardcoded API key, password, JWT)

Going deeper

docs/accuracy.md — upstream source of every number on this page (kept in sync).
/performance — cold-start latency + binary-size benchmark.
scripts/accuracy/run.py — the harness. ~250 LOC, reads cleanly top-to-bottom.
tasks/FP_CORPUS.md — the real-world FP discipline corpus (juice-shop, fendix-self, TwiScope). The flip side of this page: what fendix doesn't catch / what it flags by mistake.