Live Benchmarks

SkyPilot Benchmark

Performance results of AI coding models on SkyPilot tasks, measuring success rate and execution time with high precision.

View on GitHubTotal tasks: 10Last run: 4/8/2026

Model Performance

ModelPassedAvg DurationSuccess Rate
#1
claude-4-6-sonnetNEW
8137.5s
80%
#2
gemini-3.1-pro
7178.2s
70%
#3
gpt-5.2-codex
7110.4s
70%
#4
glm-4.7
5148.7s
50%