: GitBench
Linear history with bad commit third from beginning - model identifies third commit
Tests ability to identify the third commit as bad. Evaluates bisect with the bug early but not at the very start.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo '! grep -q "port: 8080" config/*.yaml 2>/dev/null' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 git add test.sh
  8. 08 git commit -m 'Initial commit'
  9. 09 mkdir -p config && echo 'server: port: 3000' > config/default.yaml
  10. 10 git add config/default.yaml
  11. 11 git commit -m 'add default config'
  12. 12 echo 'server: port: 8080' > config/default.yaml
  13. 13 git add config/default.yaml
  14. 14 git commit -m 'change port'
  15. 15 echo 'server: port: 8080 feature: true' > config/default.yaml
  16. 16 git add config/default.yaml
  17. 17 git commit -m 'add feature flag'
  18. 18 mkdir -p src && echo 'const PORT = 3000;' > src/index.js
  19. 19 git add src/index.js
  20. 20 git commit -m 'add port constant'
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
change port
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 210 in → 124 out (119 reasoning)
053cfe6
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 208 in → 84 out (80 reasoning)
603c668
JSON Schema Structured Output
(raw) { "commit": "603c668" }
deepseek/deepseek-v4-flash:none PASS 100% 208 in → 50 out (0 reasoning)
Based on the test results, the first commit that fails is `707ef8e` (change port). All commits after that also fail, meaning this is the commit that introduced the bug. The bad commit is: **707ef8e**
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 208 in → 13 out (0 reasoning)
24cc421
JSON Schema Structured Output
(raw) { "commit": "24cc421" }
mistralai/devstral-2512 PASS 100% 235 in → 67 out
Based on the test results, the build first failed at commit `0c6283d change port` (exit code 1), and all subsequent commits also failed. This indicates that the bad commit that introduced the bug is: `0c6283d` (or the subject line "change port")
mistralai/devstral-2512__json_schema PASS 100% 235 in → 12 out
04d41df
JSON Schema Structured Output
(raw) {"commit": "04d41df"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 250 in → 144 out (144 reasoning)
13573f2
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 252 in → 299 out (269 reasoning)
c484d29
JSON Schema Structured Output
(raw) { "commit": "c484d29" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 252 in → 7 out (0 reasoning)
059c4cb
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 250 in → 16 out (0 reasoning)
2d3e745
JSON Schema Structured Output
(raw) { "commit": "2d3e745" }
poolside/laguna-xs.2:high PASS 100% 289 in → 357 out (246 reasoning)
Looking at the test results, I can see a clear pattern: - Commits de91a56 and f452634: **PASS** - Commit 4575773 ("change port"): **FAIL** The build was passing before commit 4575773 and started failing at that commit. This is the commit that introduced the bug. **Bad commit: 4575773** (or "change port" as the subject line)
poolside/laguna-xs.2:high__json_schema PASS 100% 287 in → 242 out (229 reasoning)
47cc580
JSON Schema Structured Output
(raw) {"commit": "47cc580"}
poolside/laguna-xs.2:none PASS 100% 279 in → 151 out (0 reasoning)
Looking at the test results, I can see that the build passed for commits `c40a955` and `ac727da`, but started failing at `b1fddb8` (change port). This means the bug was introduced in commit `b1fddb8`. Let me verify this by checking the commit details: The commit `b1fddb8` with the subject "change port" is the first commit where the tests failed. The subsequent commits (`8eb078c` and `cfc4de9`) also show failures, but those were likely caused by the initial bug introduced in `b1fddb8`. **Answer: b1fddb8**
poolside/laguna-xs.2:none__json_schema PASS 100% 285 in → 11 out (0 reasoning)
58edee4
JSON Schema Structured Output
(raw) {"commit": "58edee4"}