: GitBench

Benchmark Design

A benchmark fixture is a single test case: a prompt describing a Git task (e.g., "show the last 3 commits as one-liners"), an expected output (the correct command or output), and a scoring configuration. Fixtures are defined in YAML files under the fixtures/ directory and grouped into benchmarks by Git task domain.

The current benchmark suite covers ~17 domains, including log, branch, commit, rebase, merge, bisect, cherry-pick, and more. Each fixture runs in an isolated Git repository set up specifically for that test, ensuring deterministic execution.

Similarity Scoring

Each model output is compared to the expected answer using a fuzzy similarity algorithm (Python's SequenceMatcher from the standard library). The similarity score ranges from 0% (completely different) to 100% (exact match). A fixture is marked as passed when the similarity exceeds the fixture's configured threshold (typically 80–90%, depending on the prompt).

Similarity scoring is an approximation of correctness — a high score means the output is textually close to the expected answer, which nearly always correlates with a correct Git command. However, semantically equivalent but differently-phrased answers may score lower than expected, and near-matches may occasionally be false positives.

Evaluation Campaigns

GitBench runs repeated evaluation campaigns. A campaign executes every selected model, reasoning effort, output mode, and fixture for a configurable number of complete trial rounds. The default is three trials; published evaluations may use more. Each campaign has an unique ID, an immutable configuration hash, and records every raw attempt so aggregates can be traced back to individual evidence.

For each fixture we report the mean one-attempt success rate, which is the proportion of valid attempts that passed. We also report pass_any_at_n, the share of fixtures that passed at least once in the first n comparable attempts. These are different numbers and are labeled separately; mean success is not the probability of passing at least once across trials.

A fixture is classified as stable_pass when every valid attempt passes, stable_fail when every valid attempt fails, and flaky when it has both passing and failing valid attempts. Classifications, numerators, and denominators are shown throughout the report so you can see how much evidence supports each aggregate.

Deterministic Inputs, Non-Deterministic Models

Fixture inputs are deterministic within a campaign: Git author, committer, reflog timestamps, identities, locale, timezone, and other relevant repository inputs are fixed, and every attempt records hashes for the fixture input, rendered prompt, expected output, request configuration, and scorer configuration. Identical campaign configuration and seed therefore produce identical fixture state.

However, hosted model inference remains non-deterministic. Provider routing, retry behavior, LLM-judge decisions, and the model's own sampling can all vary between attempts. GitBench records available provider-route metadata and retry history for each attempt, but it does not claim that evaluations are fully reproducible. Treat pass rates as estimates of one-attempt reliability, not as guaranteed probabilities.

Exclusions and Campaign Completeness

Not every failed call counts as a model-quality failure. Structured-output parse or schema-validation failures are valid quality failures because the model returned an unusable answer. In contrast, transport failures, exhausted provider retries, invalid fixture hashes, and unavailable judge results are excluded from quality denominators and make the campaign incomplete.

Campaign completeness therefore has three parts: target attempts must finish with valid quality outcomes, judge scoring must not be exhausted, and any configured safety review must reach an allowed state. Incomplete campaigns are still inspectable, but they are not included in default rankings.

Judge Caching and Resource Normalization

LLM-judge decisions are cached within a campaign by fixture input hash, target output hash, and judge configuration hash. This prevents duplicate judge calls and keeps judge variance separate from target variance. Member-level judge scores and aggregation provenance are retained for auditability.

Cost, token, and API-time metrics are reported at two scopes. Ranking charts use the mean per complete trial so campaigns with different trial counts remain comparable. The total campaign cost, tokens, and API time are shown as operational context. Wall-clock duration is tracked separately and is not treated as additive API time.

Legacy Campaigns

Historical result artifacts created before evaluation campaigns are imported as one-trial legacy campaigns. They are marked legacy and do not have repeated-trial evidence. Stability classifications, trial variability, and pass_any_at_n semantics should not be inferred from them. New campaigns should be used when making reliability comparisons.

Pass@k Metric

The pass@k metric is the proportion of fixtures a model passed:

pass@k = passed_fixtures / total_fixtures

For single-run evaluations (the default), k = 1. For repeated campaigns, the headline metric is mean one-attempt success: passing valid attempts divided by scheduled quality attempts. Pass rates are shown as percentages on the Overview and model detail pages.

Model Selection & Run Metadata

Models are called through OpenRouter (for cloud-hosted models) or a local Ollama server. Each run records:

  • profile — the runner profile used (e.g., "openrouter")
  • git_sha — the GitBench commit hash at the time of the run
  • benchmark_suite_version — version of the benchmark suite
  • timestamp — UTC timestamp of the run
  • reasoning_level — optional reasoning effort level (e.g., "low", "high") for models that support it

Model names may include a #level suffix to specify a reasoning effort level. For example, o3-mini#high runs the o3-mini model with high reasoning effort. The base model name is used for grouping.

Cost Data

When models are called through OpenRouter, the API response includes a cost field (in USD). GitBench extracts this cost and surfaces it on the Models overview page (per-model total and average cost) and on the Cost vs Quality quadrant chart. Models run through Ollama do not produce cost data and will show "—" for cost fields.

Raw run files in gitbench-results/ remain untouched and contain the full run data, including any models filtered from the UI output.

Limitations

  • Only measures output correctness — does not evaluate speed, UX, multi-turn reasoning, or tool use.
  • Results vary by run — LLM outputs and provider routing are non-deterministic. Pass rates should be treated as estimates of one-attempt reliability, not absolute scores.
  • Fixture inputs are deterministic within a campaign, but hosted model evaluation is not fully reproducible. Provider routing, retries, and judge decisions can still vary.
  • Similarity scoring is approximate — a correct answer with different phrasing may score lower; an incorrect but similar-looking answer may pass.
  • Only open-weight models available through OpenRouter (or local Ollama models) are tested. Proprietary model evaluation depends on API access.
  • The benchmark suite is under active development. New fixtures and benchmarks are added regularly, and scoring thresholds may be tuned over time.