: GitBench
Linear history with bad commit second from beginning - model identifies second commit
Tests ability to identify the second commit as bad. Evaluates bisect near the start of a linear history.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo '! grep -q "import.*pytest" tests/test_*.py 2>/dev/null' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 git add test.sh
  8. 08 git commit -m 'Initial commit'
  9. 09 mkdir -p tests && echo 'import pytest' > tests/test_core.py
  10. 10 git add tests/test_core.py
  11. 11 git commit -m 'setup pytest'
  12. 12 echo 'def test_one(): pass' >> tests/test_core.py
  13. 13 git add tests/test_core.py
  14. 14 git commit -m 'add test_one'
  15. 15 echo 'def test_two(): pass' >> tests/test_core.py
  16. 16 git add tests/test_core.py
  17. 17 git commit -m 'add test_two'
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
setup pytest
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 181 in → 114 out (110 reasoning)
2531b23
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 187 in → 167 out (165 reasoning)
1738996
JSON Schema Structured Output
(raw) { "commit": "1738996" }
deepseek/deepseek-v4-flash:none PASS 100% 183 in → 44 out (0 reasoning)
The first commit that failed was **6fabb52** (setup pytest), as it changed from PASS to FAIL. All subsequent commits also failed, so the bug was introduced here. **6fabb52**
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 191 in → 16 out (0 reasoning)
d9c6143
JSON Schema Structured Output
(raw) { "commit": "d9c6143" }
mistralai/devstral-2512 PASS 100% 207 in → 88 out
Based on the test results, the build was passing at commit `7cd9361` (Initial commit) but started failing at commit `7fd6a60` (setup pytest). Since all subsequent commits (`8b1af68` and `ee80a29`) also show failures, the bad commit that introduced the bug is: `7fd6a60` (setup pytest)
mistralai/devstral-2512__json_schema PASS 100% 213 in → 13 out
d9f5c9b
JSON Schema Structured Output
(raw) {"commit": "d9f5c9b"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 224 in → 596 out (656 reasoning)
78ccf67
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 222 in → 274 out (237 reasoning)
1c27645
JSON Schema Structured Output
(raw) { "commit": "1c27645" }
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 226 in → 15 out (0 reasoning)
f74e2ed
JSON Schema Structured Output
(raw) { "commit": "f74e2ed" }
poolside/laguna-xs.2:high PASS 100% 257 in → 382 out (291 reasoning)
Based on the test results, I can see that: 1. **9644257 Initial commit**: PASS 2. **1f3fc29 setup pytest**: FAIL The build was passing after the initial commit, but failed after the next commit. This means commit `1f3fc29` ("setup pytest") introduced the bug. **Answer: 1f3fc29**
poolside/laguna-xs.2:high__json_schema PASS 100% 259 in → 299 out (287 reasoning)
d88acd5
JSON Schema Structured Output
(raw) {"commit": "d88acd5"}
poolside/laguna-xs.2:none PASS 100% 263 in → 127 out (0 reasoning)
Looking at the test results, I can see the pattern: - `63b0a3a Initial commit`: PASS - `07f13f3 setup pytest`: FAIL - `b56786c add test_one`: FAIL - `69a09fc add test_two`: FAIL The first commit where the tests started failing is `07f13f3 setup pytest`. Since the initial commit passed and this commit introduced pytest setup and caused the failure, this is the bad commit. **Answer: 07f13f3**
poolside/laguna-xs.2:none__json_schema PASS 100% 263 in → 13 out (0 reasoning)
b879069
JSON Schema Structured Output
(raw) {"commit": "b879069"}
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 222 in → 170 out (0 reasoning)
abcdef1 Explanation: - The test results listed show that the first commit (Initial commit) passes, while each subsequent commit introduces a failure. - The first commit that fails is **163637a add test_one**, which shows a test failure (exit code 1). - Because this is the earliest commit in the history that breaks the build, it is the “bad commit” that introduced the bug. Thus, the bad commit can be identified either by its full hash **163637a** (first 7 characters “abcdef1” is a common placeholder representation) or by its subject line **“add test_one.”** Answer: **add test_one** (or commit hash **163637a**)
Failure: Expected target hash/prefix or subject 'setup pytest'