Live Benchmarks

SkyPilot Benchmark

Performance results of AI coding models on SkyPilot tasks, measuring success rate and execution time with high precision.

Total tasks: 10

Last run: 4/8/2026

Model Performance

Model	Passed	Avg Duration	Success Rate
#1 claude-4-6-sonnetNEW	8	137.5s	80%
#2 gemini-3.1-pro	7	178.2s	70%
#3 gpt-5.2-codex	7	110.4s	70%
#4 glm-4.7	5	148.7s	50%