Thomas Peng

I build agentic systems. I evaluate them honestly.

Abstract

Graphic designer turned AI-native agent engineer. Four production artifacts, all venturing a shared orchestration kernel (Quorum core/): a real substrate proven on distinct problems. Results reported verbatim, nulls included. Deterministic scoring throughout. No LLM judge in the success path.

Fig. 1. Adversarial verification convergence.

False-positive findings culled across K=3 skeptic passes. 27.8% to 0.0% on 36-snippet labeled set (Quorum, DeepSeek-v4-pro).

Throughline

One substrate. Four proofs.

Quorum, Aegis, and FieldAgent each vendor Quorum's core/ orchestration layer: cost-aware model routing, adversarial multi-agent verification, full tracing. The substrate is not theoretical. Each artifact exercises it on a distinct problem domain, with held-out data and deterministically-scored evals. The Skill-Tuning Council applies the same verification discipline internally.

02

Artifacts

Paper 01Flagship. Cost-aware orchestrator. Adversarial verification.

Quorum

Task-aware model routing (DeepSeek to Haiku to Sonnet to Opus) with K=3 adversarial skeptic verification and full trace UI.


Results

MetricBaselineK=3 Verified
False Positive Rate27.8%0.0%
95% CI[11.1, 50.0][0, 0]
Recall100%77.8%
Labeled set36 snippets (incl. prompt-injection traps)
Held-out bugs3/3 found, 0 surviving FP
Cost per runapprox. $0.25
Cost-routing number is operator-gated on an Anthropic key. Live multi-tier harness committed; the routing stat is not fabricated, it is gated.

58 tests. ruff + mypy + CI green. make eval-dry reproduces offline.

Fig. 2. Live trace UI.
quorum.thomaspeng.caOpen live ↗
Fig. 2.Quorum live trace UI
Paper 02Adaptive red-team gauntlet. Deterministic scoring.

Aegis

An adaptive attacker agent red-teams a target on two harmless proxies: canary-string extraction and prompt-injection sentinel. Scored by exact match; no LLM judge. Layered defenses measurably cut attack success.

Honest nullThe full defense stack erases the model-robustness gap.

Key findings

MeasurementReasoningStandardSig.
Injection ASR (no defense)49.3%68.1%p=0.0012
Canary ASR (no defense)10.4%21.5%p=0.010
Overall ASR (no defense)p=0.0002 overallSig.
Full defense stack ASR1.7%2.8%p=0.40, n.s.
Defense reduction29.2% to 4.2% (-25pp)
Adaptation lift (at scale)24.0% to 29.9%McNemar b=17/c=0
The sophisticated finding: a reasoning model is significantly more robust in isolation. The full defense stack erases the gap (p=0.40, not significant). Defenses matter more than model choice. Scaling is the legitimate power lever for adaptation, not p-hacking at small n.

78 tests. CI + Pages green.

Fig. 3. Live red-team demo.
7p3ng.github.io/aegisOpen live ↗
Fig. 3.Aegis live red-team demo
Paper 03CUAD contract red-flag finder. Span-IoU graded.

FieldAgent

An agent reads a real commercial contract and flags risk-bearing clauses (span + severity + plain-English risk), graded span-IoU against CUAD gold. No LLM judge in the success path.

Honest nullThe agentic chunking lift is model-specific noise.

Results

MetricValue
Detection F10.548
Precision / Recall0.741 / 0.435
95% CI[0.460, 0.637]
Eval set20 held-out CUAD contracts
vs keyword floor+0.21 F1 (robust, baseline-independent)
Agentic chunking lift+0.07 fair (CIs overlap). Appeared +0.45 on DeepSeek due to truncation artifact.
The honest finding: the agentic chunking lift is model-specific noise. It looked like +0.45 on DeepSeek only because of output truncation. A fair rerun collapses it to +0.07 with overlapping CIs. Party names and figures are redacted in the demo.

47 tests. CI green.

Fig. 4. Live contract analyzer.
fieldagent.thomaspeng.caOpen live ↗
Fig. 4.FieldAgent live contract analyzer
Paper 04Internal methodology. Systems-design. No public URL.

Skill-Tuning Council

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every skill self-improvement before it ships. Pipeline: adversary, editors, merger, council, escalate-on-disagreement. Internal infrastructure. 576 tests.


Architecture

The council applies the same adversarial-verification discipline as Quorum to the problem of skill self-improvement. An adversary generates proposals; editors refine; a merger consolidates; the council votes; dissent triggers escalation. The anti-drift proxy prevents convergence to a local optimum.

Presented as a methodology piece rather than a shipped product; no public demo URL exists.

Fig. 5. Pipeline (representative output)

$ stc run --skill gsap-mastery --round 3
[adversary] generating 6 improvement proposals...
[editors] refining 4/6 proposals accepted
[merger] consolidating to 2 candidates
[council] proxy:taste vote: ACCEPT (0.87)
[council] proxy:pragmatism vote: ACCEPT (0.91)
[council] proxy:intent vote: ACCEPT (0.83)
[council] proxy:anti-drift vote: REJECT (0.44)
[escalate] dissent detected, routing to arbitration...
[result] candidate-1 merged with anti-drift constraint
[tests] 576 passed, 0 failed
03

Evaluation discipline

How I build and evaluate. These are not aspirational principles; every artifact above was built under these constraints.

M.01

Deterministic scoring

No LLM judge in the success path. Exact match, span-IoU, McNemar. If the metric can drift with a prompt, it is not in the headline number.

M.02

Adversarial verification

K skeptics per finding, not one pass. Prompt-injection traps in the labeled set. An adversary agent red-teams the system before results are reported.

M.03

Cost-gated reproducibility

make eval-dry runs offline. make eval is budgeted (~$0.25). CI green on every commit. Numbers are reproducible, not cherry-picked.

M.04

Honest nulls

The FieldAgent agentic lift was +0.45, then collapsed to +0.07 on a fair rerun. That is the headline. Nulls are reported prominently, not buried. They are the point.

M.05

Shared kernel discipline

Quorum core/ is the substrate. Proving it on three distinct domains (code review, red-team, contract analysis) is the generalization argument.

M.06

CI on every artifact

Quorum: 58 tests, ruff + mypy. Aegis: 78 tests. FieldAgent: 47 tests. Skill-Tuning Council: 576 tests. Green is a precondition for a result, not an afterthought.

04

Contact

Applied AI. Forward-Deployed Engineer. Agent Engineer. Design Engineer. If you are building systems that need to be evaluated honestly, let's talk.

thomas@thomaspeng.cagithub.com/7P3ng ↗