: GitBench
Branching history with bad commit that breaks import - model identifies the broken import
Tests ability to identify a commit that broke an import via bisect. Evaluates bisect for a build-breaking regression.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep -q "from utils import" main.py' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 echo 'from utils import helpers' > main.py
  8. 08 git add test.sh main.py && git commit -m 'Initial commit'
  9. 09 echo '# helper import configured' >> main.py
  10. 10 git add main.py && git commit -m 'add import'
  11. 11 echo 'import os' > main.py
  12. 12 git add main.py && git commit -m 'replace helper import'
  13. 13 git checkout -b feature
  14. 14 echo '# add logging hook' >> main.py
  15. 15 git add main.py && git commit -m 'add logging hook'
  16. 16 git checkout main
  17. 17 git merge feature
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
replace helper import
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 189 in → 146 out (140 reasoning)
8c517d6
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 187 in → 176 out (165 reasoning)
52b4aac
JSON Schema Structured Output
(raw) { "commit": "52b4aac" }
deepseek/deepseek-v4-flash:none PASS 100% 189 in → 45 out (0 reasoning)
The bad commit that introduced the bug is **7f98d78** ("replace helper import"), as this is the first commit where the tests started failing (exit code 1) and all subsequent commits also failed.
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 185 in → 12 out (0 reasoning)
3223ac8
JSON Schema Structured Output
(raw) { "commit": "3223ac8" }
mistralai/devstral-2512 PASS 100% 209 in → 68 out
The bad commit is `09b126b` (or its subject line "replace helper import"), as it is the first commit in the history where the tests start failing (exit code 1). All commits before it passed the tests, and all commits after it also failed, indicating that this commit introduced the bug.
mistralai/devstral-2512__json_schema PASS 100% 213 in → 13 out
3b6b4b0
JSON Schema Structured Output
(raw) {"commit": "3b6b4b0"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 228 in → 230 out (211 reasoning)
f7a8748 replace helper import
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 226 in → 245 out (218 reasoning)
b3f6043 replace helper import
JSON Schema Structured Output
(raw) {"commit":"b3f6043 replace helper import"}
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 226 in → 10 out (0 reasoning)
38683ab replace helper import
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 220 in → 20 out (0 reasoning)
b0e3c98
JSON Schema Structured Output
(raw) { "commit": "b0e3c98" }
poolside/laguna-xs.2:high PASS 100% 261 in → 499 out (371 reasoning)
Looking at the test results, I can see when the failure was introduced: - **e98f645** (Initial commit): PASS - **708ac1b** (add import): PASS - **0dd4e3b** (replace helper import): FAIL ← First failure - **1477338** (add logging hook): FAIL The bug was introduced in commit **0dd4e3b** with the subject line **"replace helper import"**. This is the first commit where the tests failed, indicating it's the problematic commit.
poolside/laguna-xs.2:high__json_schema PASS 100% 259 in → 206 out (194 reasoning)
5faec08
JSON Schema Structured Output
(raw) {"commit": "5faec08"}
poolside/laguna-xs.2:none PASS 100% 261 in → 128 out (0 reasoning)
Looking at the test results, I can see that: - `f76b34b Initial commit`: PASS - `89cca12 add import`: PASS - `0f26780 replace helper import`: FAIL - `a650f3b add logging hook`: FAIL The build was passing with `89cca12 add import` but failed with the next commit `0f26780 replace helper import`. This commit introduced the bug. The bad commit is: **0f26780** (replace helper import)
poolside/laguna-xs.2:none__json_schema PASS 100% 261 in → 12 out (0 reasoning)
78fd31c
JSON Schema Structured Output
(raw) {"commit": "78fd31c"}