: GitBench
Bad commit changes multiple files (app.py and config.py) - model identifies the multi-file bad commit
Tests ability to identify a multi-file bad commit (app.py + config.py) via bisect. Evaluates bisect for changes spanning multiple files.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep -q "DEBUG = False" app.py && grep -q "DEBUG = False" config.py' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 echo 'DEBUG = False' > app.py && echo 'DEBUG = False' > config.py
  8. 08 git add test.sh app.py config.py && git commit -m 'Initial commit'
  9. 09 echo '# debug defaults' >> config.py
  10. 10 git add app.py config.py && git commit -m 'set debug off'
  11. 11 echo 'DEBUG = True' > app.py && echo 'DEBUG = True' > config.py
  12. 12 git add app.py config.py && git commit -m 'enable debug mode'
  13. 13 echo '# verbose startup logs' >> app.py
  14. 14 git add app.py && git commit -m 'add startup logging'
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
enable debug mode
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 193 in → 101 out (83 reasoning)
da2a5d2
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 191 in → 137 out (123 reasoning)
6ed3388
JSON Schema Structured Output
(raw) { "commit": "6ed3388" }
deepseek/deepseek-v4-flash:none PASS 100% 183 in → 65 out (0 reasoning)
Based on the test results, the build was passing at commit `59fa610` ("set debug off") and started failing at commit `604b522` ("enable debug mode"). This means `604b522` is the first commit where the failure occurred, making it the bad commit. **604b522**
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 189 in → 11 out (0 reasoning)
2ef3383
JSON Schema Structured Output
(raw) { "commit": "2ef3383" }
mistralai/devstral-2512 PASS 100% 211 in → 57 out
Based on the test results, the build started failing at commit `7e3a3c2 enable debug mode` and continued to fail in subsequent commits. The commit that introduced the bug is: `7e3a3c2` (or "enable debug mode")
mistralai/devstral-2512__json_schema PASS 100% 213 in → 13 out
6911b33
JSON Schema Structured Output
(raw) {"commit": "6911b33"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 228 in → 366 out (325 reasoning)
38e6e07
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 228 in → 224 out (194 reasoning)
4426dac
JSON Schema Structured Output
(raw) { "commit": "4426dac" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 218 in → 10 out (0 reasoning)
e8ea84d enable debug mode
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 222 in → 15 out (0 reasoning)
0b2b999
JSON Schema Structured Output
(raw) { "commit": "0b2b999" }
poolside/laguna-xs.2:high PASS 100% 257 in → 452 out (346 reasoning)
Looking at the test results chronologically: - e7bebac Initial commit: PASS - e26b2f4 set debug off: PASS - bc11dd5 enable debug mode: FAIL - 631e13d add startup logging: FAIL The tests pass up until commit e26b2f4, but fail starting at bc11dd5. This means the bug was introduced in commit bc11dd5 ("enable debug mode").
poolside/laguna-xs.2:high__json_schema PASS 100% 261 in → 235 out (221 reasoning)
23f4f0d
JSON Schema Structured Output
(raw) {"commit": "23f4f0d"}
poolside/laguna-xs.2:none PASS 100% 257 in → 144 out (0 reasoning)
Looking at the test results, I can see the pattern: - **96fbea7** (Initial commit): PASS - **79f408b** (set debug off): PASS - **e85ea5f** (enable debug mode): FAIL - **0bbe5f4** (add startup logging): FAIL The build started failing at commit **e85ea5f** "enable debug mode". The subsequent commit **0bbe5f4** "add startup logging" also fails, but that's because it's built on top of the already-broken state. The bad commit is: **e85ea5f**
poolside/laguna-xs.2:none__json_schema PASS 100% 261 in → 13 out (0 reasoning)
a1227a2
JSON Schema Structured Output
(raw) {"commit": "a1227a2"}