: GitBench
Bad commit changes index.js and styles.css - model identifies multi-file bad commit
Tests ability to identify a multi-file bad commit (index.js + styles.css) via bisect. Evaluates bisect for frontend multi-file regressions.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep -q "color: blue" styles.css && grep -q "blue" index.js' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 echo 'body { color: blue; }' > styles.css && echo 'document.body.style.color = "blue";' > index.js
  8. 08 git add test.sh styles.css index.js && git commit -m 'Initial commit'
  9. 09 echo 'body { background: white; }' >> styles.css
  10. 10 git add styles.css index.js && git commit -m 'add blue styling'
  11. 11 echo 'body { color: red; }' > styles.css && echo 'document.body.style.color = "red";' > index.js
  12. 12 git add styles.css index.js && git commit -m 'change to red'
  13. 13 echo 'document.title = "Theme";' >> index.js
  14. 14 git add index.js && git commit -m 'add theme title'
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
change to red
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 206 in → 110 out (104 reasoning)
5275de4
deepseek/deepseek-v4-flash:none PASS 100% 195 in → 50 out (0 reasoning)
Based on the test results, the first commit that fails is `8b38db7 change to red`. Since all commits before it pass, and it introduces the failure, this is the bad commit. Answer: 8b38db7
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 193 in → 13 out (0 reasoning)
c4448e0
JSON Schema Structured Output
(raw) {"commit":"c4448e0"}
mistralai/devstral-2512 PASS 100% 215 in → 58 out
Based on the test results, the build started failing at commit `e44b549 change to red` and continued to fail in subsequent commits. The commit that introduced the bug is therefore: `e44b549` (or "change to red")
mistralai/devstral-2512__json_schema PASS 100% 217 in → 13 out
f365e23
JSON Schema Structured Output
(raw) {"commit": "f365e23"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 222 in → 181 out (166 reasoning)
82525cb change to red
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 222 in → 227 out (182 reasoning)
f9d0606
JSON Schema Structured Output
(raw) { "commit": "f9d0606" }
poolside/laguna-xs.2:high PASS 100% 257 in → 361 out (216 reasoning)
Based on the test results, I can see the pattern of when the build broke: - `d6caf71 Initial commit`: **PASS** - `d7966b1 add blue styling`: **PASS** - `5b6e300 change to red`: **FAIL** - `fdcbce5 add theme title`: **FAIL** The first commit that introduced the failure was `5b6e300` with the subject "change to red". This commit changed the styling from blue to red, which broke the build. **Bad commit: `5b6e300`** (or "change to red")
poolside/laguna-xs.2:high__json_schema PASS 100% 261 in → 244 out (230 reasoning)
21d8295
JSON Schema Structured Output
(raw) {"commit": "21d8295"}
poolside/laguna-xs.2:none PASS 100% 263 in → 122 out (0 reasoning)
Looking at the test results, I can see that: - Commit `4d6d8d4` (Initial commit): PASS - Commit `3fd1bb5` (add blue styling): PASS - Commit `081243f` (change to red): FAIL - Commit `640f885` (add theme title): FAIL The bug was introduced in commit `081243f` "change to red" since that's the first commit where tests started failing. 081243f
poolside/laguna-xs.2:none__json_schema PASS 100% 267 in → 13 out (0 reasoning)
837b959
JSON Schema Structured Output
(raw) {"commit": "837b959"}
Invalid JSON. Output: 3a28791
JSON Schema Structured Output
Structured Output Error
Failed to parse structured JSON response: Extra data: line 1 column 2 (char 1)
Failure: Failed to parse structured JSON response: Extra data: line 1 column 2 (char 1)
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 220 in → 6 out (0 reasoning)
bfc5bd3
Failure: Expected target hash/prefix or subject 'change to red'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 220 in → 14 out (0 reasoning)
efea947
JSON Schema Structured Output
(raw) { "commit": "efea947" }
Failure: Expected target hash/prefix or subject 'change to red'