YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Agent Cost Optimizer β Living README
Version: 2026-05-11 (post-critique, post-cleanup)
What This Is
A static cascade router for SWE-bench coding agents. Tries cheap models first, only escalates to frontier when needed. Saves ~56% cost at statistically equivalent quality. No machine learning. Just a for-loop.
What This Is NOT
- A trained ML router (we tried, it failed β oracle gap too narrow)
- A doom detector that terminates runs (58-72% recover, rescue instead)
- A cache-aware prompt optimizer (only 1.5% static prefix in SWE-bench prompts)
- A complete Agent Cost Optimizer framework (many components are analysis, not implementation)
The One True Result
| Strategy | Solved | Cost | $/Solved |
|---|---|---|---|
| Cascade T1βT2βT4 | 416/500 (83.2%) | $86.33 | $0.21 |
| Frontier single | 391/500 (78.2%) | $158.34 | $0.40 |
| Frontier retry | 420/500 (84.0%) | $196.77 | $0.47 |
The cascade is statistically tied with frontier-retry on quality (95% CI: [-2.8pp, +1.0pp]) but 56% cheaper.
Files That Matter
Production
aco/aco_live.pyβ Drop-in cascade wrapper. Three strategies: cascade, safe_proposal, safe_proposal_t2. Use this.config.yamlβ Configuration for models, tiers, thresholds.
Validation (NEW β this session)
validate_cascade.pyβ Full validation pipeline: clone repo β set up conda env β run cascade agent β verify patch via pytest. No Docker needed.quick_validate.pyβ Single-instance quick validation. Fast feedback loop.dockerless_verify.pyβ Docker-less verification utility. Replicates SWE-bench conda environments.
Analysis
CORRECTED_REPORT.mdβ The corrected analysis with all 5 fixes.TRUTH.mdβ Honest assessment of what works, what doesn't, and what's blocking.FIXES_COMPLETE.mdβ Summary of the 5 critical-review fixes.
Dead Ends (documented, not deleted)
docs/trained_router_final_report.mdβ Why BERT and XGBoost routing failed.docs/pareto_frontier_report.mdβ Pareto analysis of cost-quality tradeoffs.
Supporting Code
aco/router.pyβ Model cascade router.aco/telemetry.pyβ Cost telemetry collector.aco/classifier.pyβ Task cost classifier.aco/context_budgeter.pyβ Context budget policy.aco/tool_gate.pyβ Tool-use cost gate.aco/verifier_budgeter.pyβ Verifier budget policy.aco/retry_optimizer.pyβ Retry/recovery optimizer.aco/doom_detector.pyβ Early termination (uses rescue, not terminate).aco/meta_tool_miner.pyβ Macro tool pattern miner.aco/cache_layout.pyβ Cache-aware prompt layout.aco/per_step_router.pyβ Safe proposal routing.
Quick Start
# Drop-in cascade wrapper
from aco.aco_live import CascadeOptimizer
opt = CascadeOptimizer(strategy="cascade")
response = opt.run("Fix the null pointer in utils.py", context={...})
# Full validation (single instance)
python quick_validate.py django__django-14315
# Full validation (batch)
python validate_cascade.py --batch 5 --target cascade-only
The Blocking Issue
No Docker daemon on HF infrastructure. Without Docker, we can't:
- Run SWE-bench test verification in the exact Docker environments
- The
dockerless_verify.py+validate_cascade.pyscripts use conda as a workaround - Conda provides the same packages but at different filesystem paths β test patches may not apply cleanly
To unblock: Any host with Docker daemon + API keys runs:
python cascade_agent.py --batch 50
What Was Spent
- All training: Wasted (~$0, free tier HF Jobs)
- This session: CPU sandbox (free) + a few GPU inference calls (free via HF_TOKEN)
- Total compute cost across all sessions: < $20 (mostly unused GPU sandbox time)
Next Steps (Priority Order)
- Validate cascade on 5-10 instances with conda β WE ARE HERE
- Inspect the cascade-only T2 patches manually
- Run Docker agent on 50 instances (needs Docker daemon)
- Frontier with equal retries β give frontier 3 attempts, compare
- Add gpt-5.2-medium as tier β catches 14 frontier-retry-only instances
- Macro tool integration β deterministic subprocess calls
- Doom rescue β replace termination with targeted recovery
Bottom Line
The cascade works on paper. The data is solid. The code is functional. Now we need:
- Validation β prove the conda-based pipeline works on 3-5 instances
- Docker β run the real thing at scale
- Inspection β human review of cascade-only patches
- Downloads last month
- 130
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support