Thomas Peng

I build agentic systems. I evaluate them honestly.

Abstract

Graphic designer turned AI-native agent engineer. Four production artifacts, all venturing a shared orchestration kernel (Quorum core/): a real substrate proven on distinct problems. Results reported verbatim, nulls included. Deterministic scoring throughout. No LLM judge in the success path.

thomas@thomaspeng.ca github.com/7P3ng ↗

Fig. 1. Adversarial verification convergence.

False-positive findings culled across K=3 skeptic passes. 27.8% to 0.0% on 36-snippet labeled set (Quorum, DeepSeek-v4-pro).

Throughline

One substrate. Four proofs.

Quorum, Aegis, and FieldAgent each vendor Quorum's core/ orchestration layer: cost-aware model routing, adversarial multi-agent verification, full tracing. The substrate is not theoretical. Each artifact exercises it on a distinct problem domain, with held-out data and deterministically-scored evals. The Skill-Tuning Council applies the same verification discipline internally.

Artifacts

Paper 01Flagship. Cost-aware orchestrator. Adversarial verification.

Quorum

Task-aware model routing (DeepSeek to Haiku to Sonnet to Opus) with K=3 adversarial skeptic verification and full trace UI.

Results

Metric	Baseline	K=3 Verified
False Positive Rate	27.8%	0.0%
95% CI	[11.1, 50.0]	[0, 0]
Recall	100%	77.8%
Labeled set	36 snippets (incl. prompt-injection traps)
Held-out bugs	3/3 found, 0 surviving FP
Cost per run	approx. $0.25

Cost-routing number is operator-gated on an Anthropic key. Live multi-tier harness committed; the routing stat is not fabricated, it is gated.

58 tests. ruff + mypy + CI green. make eval-dry reproduces offline.

github.com/7P3ng/quorum ↗

Fig. 2. Live trace UI.

quorum.thomaspeng.caOpen live ↗

Fig. 2.Quorum live trace UI↗

Paper 02Adaptive red-team gauntlet. Deterministic scoring.

Aegis

An adaptive attacker agent red-teams a target on two harmless proxies: canary-string extraction and prompt-injection sentinel. Scored by exact match; no LLM judge. Layered defenses measurably cut attack success.

Honest nullThe full defense stack erases the model-robustness gap.

Key findings

Measurement	Reasoning	Standard	Sig.
Injection ASR (no defense)	49.3%	68.1%	p=0.0012
Canary ASR (no defense)	10.4%	21.5%	p=0.010
Overall ASR (no defense)	p=0.0002 overall		Sig.
Full defense stack ASR	1.7%	2.8%	p=0.40, n.s.
Defense reduction	29.2% to 4.2% (-25pp)
Adaptation lift (at scale)	24.0% to 29.9%		McNemar b=17/c=0

The sophisticated finding: a reasoning model is significantly more robust in isolation. The full defense stack erases the gap (p=0.40, not significant). Defenses matter more than model choice. Scaling is the legitimate power lever for adaptation, not p-hacking at small n.

78 tests. CI + Pages green.

github.com/7P3ng/aegis ↗

Fig. 3. Live red-team demo.

7p3ng.github.io/aegisOpen live ↗

Fig. 3.Aegis live red-team demo↗

Paper 03CUAD contract red-flag finder. Span-IoU graded.

FieldAgent

An agent reads a real commercial contract and flags risk-bearing clauses (span + severity + plain-English risk), graded span-IoU against CUAD gold. No LLM judge in the success path.

Honest nullThe agentic chunking lift is model-specific noise.

Results

Metric	Value
Detection F1	0.548
Precision / Recall	0.741 / 0.435
95% CI	[0.460, 0.637]
Eval set	20 held-out CUAD contracts
vs keyword floor	+0.21 F1 (robust, baseline-independent)
Agentic chunking lift	+0.07 fair (CIs overlap). Appeared +0.45 on DeepSeek due to truncation artifact.

The honest finding: the agentic chunking lift is model-specific noise. It looked like +0.45 on DeepSeek only because of output truncation. A fair rerun collapses it to +0.07 with overlapping CIs. Party names and figures are redacted in the demo.

47 tests. CI green.

github.com/7P3ng/fieldagent ↗

Fig. 4. Live contract analyzer.

fieldagent.thomaspeng.caOpen live ↗

Fig. 4.FieldAgent live contract analyzer↗

Paper 04Internal methodology. Systems-design. No public URL.

Skill-Tuning Council

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every skill self-improvement before it ships. Pipeline: adversary, editors, merger, council, escalate-on-disagreement. Internal infrastructure. 576 tests.

Architecture

The council applies the same adversarial-verification discipline as Quorum to the problem of skill self-improvement. An adversary generates proposals; editors refine; a merger consolidates; the council votes; dissent triggers escalation. The anti-drift proxy prevents convergence to a local optimum.

Presented as a methodology piece rather than a shipped product; no public demo URL exists.

Fig. 5. Pipeline (representative output)

$ stc run --skill gsap-mastery --round 3

[adversary] generating 6 improvement proposals...

[editors] refining 4/6 proposals accepted

[merger] consolidating to 2 candidates

[council] proxy:taste vote: ACCEPT (0.87)

[council] proxy:pragmatism vote: ACCEPT (0.91)

[council] proxy:intent vote: ACCEPT (0.83)

[council] proxy:anti-drift vote: REJECT (0.44)

[escalate] dissent detected, routing to arbitration...

[result] candidate-1 merged with anti-drift constraint

[tests] 576 passed, 0 failed

Evaluation discipline

How I build and evaluate. These are not aspirational principles; every artifact above was built under these constraints.

M.01

Deterministic scoring

No LLM judge in the success path. Exact match, span-IoU, McNemar. If the metric can drift with a prompt, it is not in the headline number.

M.02

Adversarial verification

K skeptics per finding, not one pass. Prompt-injection traps in the labeled set. An adversary agent red-teams the system before results are reported.

M.03

Cost-gated reproducibility

make eval-dry runs offline. make eval is budgeted (~$0.25). CI green on every commit. Numbers are reproducible, not cherry-picked.

M.04

Honest nulls

The FieldAgent agentic lift was +0.45, then collapsed to +0.07 on a fair rerun. That is the headline. Nulls are reported prominently, not buried. They are the point.

M.05

Shared kernel discipline

Quorum core/ is the substrate. Proving it on three distinct domains (code review, red-team, contract analysis) is the generalization argument.

M.06

CI on every artifact

Quorum: 58 tests, ruff + mypy. Aegis: 78 tests. FieldAgent: 47 tests. Skill-Tuning Council: 576 tests. Green is a precondition for a result, not an afterthought.

Contact

Applied AI. Forward-Deployed Engineer. Agent Engineer. Design Engineer. If you are building systems that need to be evaluated honestly, let's talk.

thomas@thomaspeng.ca github.com/7P3ng ↗