: GitBench
Linear history with bad commit near end - model identifies second-to-last commit as bad
Tests ability to identify the second-to-last commit as bad. Evaluates bisect near the end of a linear history.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep "ENABLED = true" app.py' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 echo 'ENABLED = true' > app.py
  8. 08 git add test.sh app.py
  9. 09 git commit -m 'Initial commit'
  10. 10 echo '# feature enabled by default' >> app.py
  11. 11 git add app.py
  12. 12 git commit -m 'enable feature'
  13. 13 echo 'ENABLED = false' > app.py
  14. 14 git add app.py
  15. 15 git commit -m 'disable feature'
  16. 16 echo '# add telemetry' >> app.py
  17. 17 git add app.py
  18. 18 git commit -m 'add feature telemetry'
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
disable feature
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 191 in → 106 out (106 reasoning)
6febbb2
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 191 in → 132 out (131 reasoning)
f12cd50
JSON Schema Structured Output
(raw) { "commit": "f12cd50" }
deepseek/deepseek-v4-flash:none PASS 100% 204 in → 4 out (0 reasoning)
1ea637a
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 191 in → 13 out (0 reasoning)
fb41c7f
JSON Schema Structured Output
(raw) { "commit": "fb41c7f" }
mistralai/devstral-2512 PASS 100% 209 in → 68 out
Based on the test results, the build started failing at commit `83c548e disable feature` and continued to fail in the subsequent commit `18d73bb add feature telemetry`. The bad commit that introduced the bug is: `83c548e` (or "disable feature")
mistralai/devstral-2512__json_schema PASS 100% 209 in → 12 out
1a3cf96
JSON Schema Structured Output
(raw) {"commit": "1a3cf96"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 220 in → 348 out (306 reasoning)
\boxed{c2493ac}
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 220 in → 309 out (271 reasoning)
bdb143f
JSON Schema Structured Output
(raw) { "commit": "bdb143f" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 216 in → 8 out (0 reasoning)
abc1234 disable feature
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 224 in → 16 out (0 reasoning)
666b075
JSON Schema Structured Output
(raw) { "commit": "666b075" }
poolside/laguna-xs.2:high PASS 100% 253 in → 437 out (302 reasoning)
Looking at the test results chronologically: 1. **d042ccb** (Initial commit): PASS 2. **794407c** (enable feature): PASS 3. **63dcd11** (disable feature): FAIL ← **First failure** 4. **12ab846** (add feature telemetry): FAIL The build was passing after `794407c` but failed after `63dcd11`. This indicates that commit `63dcd11` ("disable feature") introduced the bug. **Answer: 63dcd11**
poolside/laguna-xs.2:high__json_schema PASS 100% 257 in → 288 out (274 reasoning)
e1c230d
JSON Schema Structured Output
(raw) {"commit": "e1c230d"}
poolside/laguna-xs.2:none PASS 100% 253 in → 109 out (0 reasoning)
Looking at the test results, I can see that: - The initial commit (73ace31) passes - Enabling the feature (6b6ffd2) passes - Disabling the feature (33bc269) fails - Adding feature telemetry (5a76480) also fails The bad commit is **33bc269** with the subject "disable feature" - this is the first commit that introduced the failure. **33bc269**
poolside/laguna-xs.2:none__json_schema PASS 100% 259 in → 13 out (0 reasoning)
74e506b
JSON Schema Structured Output
(raw) {"commit": "74e506b"}