I'm the solo founder of TAB Platform (tabverified.ai), an independent AI agent verification platform. We run 340+ benchmarks across 59 models from 5 providers at $0.03 per text test case. Pay-as-you-go credits, no subscription.
A few of your findings match what we've been measuring for 13 months:
The 33x cost spread on scaffold choice lines up with our data. We tested Claude Opus with two different harness configurations on the same benchmark, same model, same test. 42% with one harness, 78% with another. We have 101 harness configurations measuring exactly this variable.
Your reliability point is critical. We built a Benchmark Calibration Suite with 10 synthetic agents and 5 adversarial gaming agents that runs daily (135/135 checks) specifically because single-run numbers aren't trustworthy. We also found 20-point variance on GPT-4o across 4 identical runs.
The conclusion that "whoever can pay for evaluation gets to write the leaderboard" is the reason TAB exists. Independent evaluation shouldn't require a frontier lab's compute budget.
Would be interested in comparing notes if any of you are open to it.
Rod Miller
tabverified.ai
rod@tabverified.ai