: GitBench
Linear history with bad commit near beginning - model identifies early commit
Tests ability to identify an early commit as bad in a linear history. Evaluates bisect with the bug near the beginning.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep "TODO" README.md' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 echo '# Project TODO: document setup' > README.md
  8. 08 git add test.sh README.md
  9. 09 git commit -m 'Initial commit'
  10. 10 echo '# Project' > README.md
  11. 11 git add README.md
  12. 12 git commit -m 'remove setup todo'
  13. 13 echo '## Features' >> README.md
  14. 14 git add README.md
  15. 15 git commit -m 'add features section'
  16. 16 echo '## Installation' >> README.md
  17. 17 git add README.md
  18. 18 git commit -m 'add installation section'
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
remove setup todo
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 189 in → 130 out (130 reasoning)
0280398
deepseek/deepseek-v4-flash:none PASS 100% 195 in → 55 out (0 reasoning)
Based on the test results, the first commit that fails is **dbe4a0a** (remove setup todo). Since the initial commit passes and this commit introduces the failure, it is the one that broke the build. **dbe4a0a**
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 193 in → 10 out (0 reasoning)
af457bb
JSON Schema Structured Output
(raw) { "commit": "af457bb" }
mistralai/devstral-2512 PASS 100% 215 in → 60 out
The bad commit is `9c79de7` (or "remove setup todo"), as it is the first commit in the history where the tests started failing (exit code 1). All subsequent commits also fail, but the bug was introduced in `9c79de7`.
mistralai/devstral-2512__json_schema PASS 100% 203 in → 12 out
d844ee8
JSON Schema Structured Output
(raw) {"commit": "d844ee8"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 222 in → 672 out (671 reasoning)
dac5f8d
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 230 in → 227 out (196 reasoning)
3e0c776
JSON Schema Structured Output
(raw) { "commit": "3e0c776" }
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 224 in → 14 out (0 reasoning)
c20fbf3
JSON Schema Structured Output
(raw) { "commit": "c20fbf3" }
poolside/laguna-xs.2:high PASS 100% 261 in → 386 out (257 reasoning)
Looking at the test results chronologically: - **3113cb2 Initial commit**: PASS - **453353f remove setup todo**: FAIL (first failure) - **bb80382 add features section**: FAIL - **80fa01a add installation section**: FAIL The build was passing after the initial commit, but failed after commit `453353f`. This indicates that commit `453353f` ("remove setup todo") introduced the bug. **Answer: 453353f**
poolside/laguna-xs.2:high__json_schema PASS 100% 265 in → 327 out (314 reasoning)
8900e60
JSON Schema Structured Output
(raw) {"commit":"8900e60"}
poolside/laguna-xs.2:none PASS 100% 253 in → 108 out (0 reasoning)
Looking at the test results, I can see that: - Commit `2cbdceb` (Initial commit): PASS - Commit `3da3980` (remove setup todo): FAIL The build was passing after the initial commit, but failed after commit `3da3980`. This means commit `3da3980` introduced the bug. The bad commit is: **3da3980** (or "remove setup todo" if you prefer the subject line).
poolside/laguna-xs.2:none__json_schema PASS 100% 265 in → 12 out (0 reasoning)
ba73812
JSON Schema Structured Output
(raw) {"commit": "ba73812"}
Invalid JSON. Output: 918a236
JSON Schema Structured Output
Structured Output Error
Failed to parse structured JSON response: Extra data: line 1 column 4 (char 3)
Failure: Failed to parse structured JSON response: Extra data: line 1 column 4 (char 3)
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 228 in → 9 out (0 reasoning)
abc1234 add installation section
Failure: Expected target hash/prefix or subject 'remove setup todo'