: GitBench
Bad commit changes database.py and api.py - model identifies the multi-file bad commit
Tests ability to identify a multi-file bad commit (database.py + api.py) via bisect. Evaluates bisect for backend multi-file regressions.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep -q "localhost" src/database.py && grep -q "localhost" src/api.py' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 mkdir -p src && echo 'DB_HOST = "localhost"' > src/database.py && echo 'API_BASE = "http://localhost:8000"' > src/api.py
  8. 08 git add test.sh src/database.py src/api.py && git commit -m 'Initial commit'
  9. 09 echo 'DB_POOL = 2' >> src/database.py
  10. 10 git add src/database.py src/api.py && git commit -m 'configure localhost'
  11. 11 echo 'DB_HOST = "production.db.com"' > src/database.py && echo 'API_BASE = "http://production.api.com"' > src/api.py
  12. 12 git add src/database.py src/api.py && git commit -m 'switch to prod'
  13. 13 echo 'DB_POOL = 5' >> src/database.py
  14. 14 git add src/database.py && git commit -m 'add pool config'
  15. 15 echo 'DB_PORT = 5432' >> src/database.py
  16. 16 git add src/database.py && git commit -m 'add port config'
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
switch to prod
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 218 in → 165 out (158 reasoning)
fbf0190
deepseek/deepseek-v4-flash:none PASS 100% 208 in → 57 out (0 reasoning)
Based on the test results, the first commit that failed is `06186cf switch to prod`. The commits before it (older) all passed, and this commit and all commits after it (newer) failed. The bad commit hash is: **06186cf**
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 214 in → 12 out (0 reasoning)
79a319f
JSON Schema Structured Output
(raw) { "commit": "79a319f" }
mistralai/devstral-2512 PASS 100% 239 in → 95 out
Based on the test results, the build was working fine until commit `86f8b1f switch to prod`, which introduced the failure. All subsequent commits (`45baf28` and `2a3756f`) also failed, indicating that the bug was introduced in `86f8b1f`. The bad commit is: `86f8b1f` (or the subject line "switch to prod").
mistralai/devstral-2512__json_schema PASS 100% 237 in → 12 out
8ef6d70
JSON Schema Structured Output
(raw) {"commit": "8ef6d70"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 254 in → 444 out (401 reasoning)
09a1930
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 250 in → 186 out (194 reasoning)
3d4f5bd
JSON Schema Structured Output
(raw) { "commit": "3d4f5bd" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 250 in → 62 out (0 reasoning)
The first failure occurs at commit **4c854f8 switch to prod**. **Commit hash:** `4c854f8` (first 7 characters) **Commit subject line:** `switch to prod` This commit introduced the bug that broke the build.
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 250 in → 15 out (0 reasoning)
33bf4a8
JSON Schema Structured Output
(raw) { "commit" : "33bf4a8" }
poolside/laguna-xs.2:high PASS 100% 287 in → 356 out (248 reasoning)
Looking at the test results chronologically: - **7a3791c Initial commit**: PASS - **abe4c11 configure localhost**: PASS - **99283c6 switch to prod**: FAIL The tests pass up through commit `abe4c11`, but fail starting with commit `99283c6 "switch to prod"`. This is the commit that introduced the bug. **Bad commit: 99283c6**
poolside/laguna-xs.2:high__json_schema PASS 100% 283 in → 345 out (328 reasoning)
756f72c switch to prod
JSON Schema Structured Output
(raw) {"commit": "756f72c switch to prod"}
poolside/laguna-xs.2:none PASS 100% 279 in → 96 out (0 reasoning)
Looking at the test results, I can see that: - Commits up to and including `bb22c42` (configure localhost) pass - Commits `d647beb` through `14e0ff0` all fail This means the bug was introduced in the first failing commit, which is `d647beb` with the subject "switch to prod". The answer is: **d647beb**
poolside/laguna-xs.2:none__json_schema PASS 100% 287 in → 11 out (0 reasoning)
3175aac
JSON Schema Structured Output
(raw) {"commit": "3175aac"}
Invalid JSON. Output: 689b37e
JSON Schema Structured Output
Structured Output Error
Failed to parse structured JSON response: Extra data: line 1 column 4 (char 3)
Failure: Failed to parse structured JSON response: Extra data: line 1 column 4 (char 3)