: GitBench
Linear history with bad commit at end - model identifies most recent commit as bad
Tests ability to identify the most recent commit as bad using git bisect on a linear history. Evaluates basic bisect with the bug at HEAD.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep "return a + b" src/index.js' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 mkdir -p src && echo 'function add(a, b) { return a + b; }' > src/index.js
  8. 08 git add test.sh src/index.js
  9. 09 git commit -m 'Initial commit'
  10. 10 echo '// feature X enabled' >> src/index.js
  11. 11 git add src/index.js
  12. 12 git commit -m 'add feature X'
  13. 13 echo 'function add(a, b) { return a - b; }' > src/index.js
  14. 14 git add src/index.js
  15. 15 git commit -m 'change add operation'
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
change add operation
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 179 in → 130 out (124 reasoning)
345a9ad
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 164 in → 69 out (56 reasoning)
4a66070
JSON Schema Structured Output
(raw) { "commit": "4a66070" }
deepseek/deepseek-v4-flash:none PASS 100% 166 in → 11 out (0 reasoning)
The bad commit is `cdff16f`.
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 174 in → 14 out (0 reasoning)
f6dc4d9
JSON Schema Structured Output
(raw) { "commit": "f6dc4d9" }
mistralai/devstral-2512 PASS 100% 183 in → 82 out
The bad commit is `a0f84ca` (or "change add operation"). This is the first commit in the history (when read from newest to oldest) that caused the test to fail. The previous commits (`e44ac68` and `e74e10e`) passed the tests, so the bug was introduced by `a0f84ca`.
mistralai/devstral-2512__json_schema PASS 100% 177 in → 12 out
942fc98
JSON Schema Structured Output
(raw) {"commit": "942fc98"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 192 in → 125 out (123 reasoning)
c3781f8
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 196 in → 622 out (552 reasoning)
b8ad21c
JSON Schema Structured Output
(raw) { "commit": "b8ad21c" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 194 in → 6 out (0 reasoning)
71edd23
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 194 in → 16 out (0 reasoning)
995e438
JSON Schema Structured Output
(raw) { "commit": "995e438" }
poolside/laguna-xs.2:high PASS 100% 233 in → 295 out (195 reasoning)
Based on the commit history and test results: - **c63a890** (Initial commit): PASS - **f768c28** (add feature X): PASS - **2c8fdd3** (change add operation): FAIL The bad commit is **2c8fdd3** ("change add operation") as it is the first commit that introduced test failures. **Answer: 2c8fdd3**
poolside/laguna-xs.2:high__json_schema PASS 100% 229 in → 227 out (216 reasoning)
ddee6bd
JSON Schema Structured Output
(raw) {"commit": "ddee6bd"}
poolside/laguna-xs.2:none PASS 100% 231 in → 97 out (0 reasoning)
Looking at the test results, I can see that the build passed for the first two commits but failed on the third commit. - `e86ac82 Initial commit`: PASS - `82e8655 add feature X`: PASS - `cfc7706 change add operation`: FAIL The commit that introduced the bug is `cfc7706` with the subject "change add operation". cfc7706
poolside/laguna-xs.2:none__json_schema PASS 100% 231 in → 13 out (0 reasoning)
4b51354
JSON Schema Structured Output
(raw) {"commit": "4b51354"}