Task

Detailed breakdown of individual task performance across different models.

Task Name (30 tasks)
claude-4-6-sonnet
gemini-3-flash
glm-4.7
gpt-5.2-codex
70.3s28.6s125.5s44.0s
75.0s64.7s91.5s42.7s
35.2s52.0s89.6s49.3s
59.8s65.0s101.4s63.1s
37.2s38.7s193.3s43.1s
95.6s57.4s100.8s33.7s
21.8s22.8s82.2s40.1s
39.7s42.8s103.5s37.6s
35.8s38.7s87.0s32.2s
45.4s52.5s144.1s46.4s
41.6s38.5s97.5s29.4s
47.3s49.1s94.7s28.7s
25.0s32.9s68.4s31.9s
32.0s34.4s83.9s38.6s
48.1s48.0s195.9s32.0s
19.7s30.8s108.3s38.1s
19.0s3259.1s47.0s34.9s
44.3s54.9s142.5s25.5s
44.2s41.8s79.9s42.3s
24.0s39.4s74.1s59.9s
21.6s34.4s69.2s37.7s
40.4s46.9s76.5s54.3s
27.4s50.6s48.3s23.4s
31.8s37.4s91.2s35.4s
25.3s23.0s49.8s39.2s
25.7s37.4s78.2s44.2s
29.2s123.8s45.1s29.9s
19.2s32.4s64.0s33.6s
27.7s35.0s50.2s48.9s
24.5s27.7s59.2s41.5s