Overview

About

Every AI model claims it can code. Git is a different beast. It's not about knowing the answer. It's about knowing the exact command. GitBench puts that to the test.

204 fixtures. Each one asks a model to write a git command. The output gets fuzzy-matched against the correct answer. Every test runs in an isolated repo. Fixture inputs are deterministic within a campaign, but hosted model inference is not fully reproducible. No cherry-picked examples.

Picking an AI coding assistant? Fine-tuning your own model? Just curious how the latest LLMs handle Git? This dashboard gives you transparent, reproducible data. Not marketing.

Model Summary

Quadrant Comparison

Pick two criteria. The shaded quadrant is the optimal direction. For example, lower cost with higher pass rate.

Cost per Full Run

API Time

API call latency across successful fixture calls. It excludes fixture setup, scoring, and cleanup.

Token Usage

Benchmark Matrix

The deep dive. Rows are Git task categories, columns are models. Green cells show where a model excels. Red cells reveal weaknesses. Look for models that crush basics but fall apart on rebase or bisect.