: GitBench
Linear history with bad commit in the middle - model identifies middle commit as bad
Tests ability to identify a middle commit as bad in a linear history. Evaluates bisect with the bug not at either end.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep "42" config.json' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 echo '{"value": 42}' > config.json
  8. 08 git add test.sh config.json
  9. 09 git commit -m 'Initial commit'
  10. 10 echo '{"value": 42, "name": "default"}' > config.json
  11. 11 git add config.json
  12. 12 git commit -m 'set config value'
  13. 13 echo '{"value": 0}' > config.json
  14. 14 git add config.json
  15. 15 git commit -m 'reset config to zero'
  16. 16 echo '{"value": 100}' > config.json
  17. 17 git add config.json
  18. 18 git commit -m 'final config'
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
reset config to zero
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 191 in → 175 out (168 reasoning)
7fbfcdd
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 189 in → 122 out (109 reasoning)
936ee1b
JSON Schema Structured Output
(raw) { "commit": "936ee1b" }
deepseek/deepseek-v4-flash:none PASS 100% 195 in → 40 out (0 reasoning)
The bug was introduced in commit `09d40f8` (reset config to zero), as this is the first commit that resulted in a FAIL (exit code 1) when tested.
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 193 in → 11 out (0 reasoning)
a589cb3
JSON Schema Structured Output
(raw) { "commit": "a589cb3" }
mistralai/devstral-2512 PASS 100% 209 in → 102 out
Based on the test results, the build started failing at commit `f06f971` ("reset config to zero") and continued to fail in the subsequent commit `1b59dc3` ("final config"). Since the test was passing in `1bcd0e8` ("set config value") but started failing in `f06f971`, the bad commit is: `f06f971` (or "reset config to zero")
mistralai/devstral-2512__json_schema PASS 100% 215 in → 13 out
6b5664b
JSON Schema Structured Output
(raw) {"commit": "6b5664b"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 228 in → 354 out (318 reasoning)
8718561
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 228 in → 386 out (295 reasoning)
bd6e853
JSON Schema Structured Output
(raw) { "commit": "bd6e853" }
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 222 in → 15 out (0 reasoning)
cb5329d
JSON Schema Structured Output
(raw) { "commit": "cb5329d" }
poolside/laguna-xs.2:high PASS 100% 263 in → 368 out (262 reasoning)
Looking at the test results chronologically: - e2c8528 Initial commit: PASS - 9128ccc set config value: PASS - e25177e reset config to zero: FAIL - 7735d10 final config: FAIL The bad commit is **e25177e** (first 7 characters: `e25177e`). This is the first commit where the tests started failing.
poolside/laguna-xs.2:high__json_schema PASS 100% 261 in → 243 out (225 reasoning)
b005af0
JSON Schema Structured Output
(raw) { "commit": "b005af0" }
poolside/laguna-xs.2:none PASS 100% 261 in → 106 out (0 reasoning)
Looking at the test results, I can see that: 1. `9a43286 Initial commit` - PASS 2. `d1c0225 set config value` - PASS 3. `99fdd39 reset config to zero` - FAIL The first commit where the tests failed is `99fdd39 reset config to zero`. This commit introduced the bug since all previous commits passed their tests. **Answer: 99fdd39**
poolside/laguna-xs.2:none__json_schema PASS 100% 257 in → 13 out (0 reasoning)
080f1c3
JSON Schema Structured Output
(raw) {"commit": "080f1c3"}
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 222 in → 9 out (0 reasoning)
eee641d final config
Failure: Expected target hash/prefix or subject 'reset config to zero'