: GitBench
Branching history with bad commit on main branch - model identifies commit before feature merge
Tests ability to identify a bad commit on main before a feature branch merge. Evaluates bisect with branching history.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep -v "DEBUG" src/app.js | grep -q "production"' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 mkdir -p src && echo 'const ENV = "production";' > src/app.js
  8. 08 git add test.sh src/app.js && git commit -m 'Initial commit'
  9. 09 echo '// production defaults' >> src/app.js
  10. 10 git add src/app.js && git commit -m 'setup production env'
  11. 11 echo 'const ENV = "development";' > src/app.js
  12. 12 git add src/app.js && git commit -m 'switch to dev'
  13. 13 git checkout -b feature
  14. 14 echo 'const ENV = "development";' > src/app.js
  15. 15 echo 'const DEBUG = true;' >> src/app.js
  16. 16 git add src/app.js && git commit -m 'add diagnostics'
  17. 17 git checkout main
  18. 18 git merge feature
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
switch to dev
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 187 in → 172 out (167 reasoning)
6be2447
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 191 in → 115 out (101 reasoning)
bc2194c
JSON Schema Structured Output
(raw) { "commit": "bc2194c" }
deepseek/deepseek-v4-flash:none PASS 100% 189 in → 80 out (0 reasoning)
The test results show that commits `4347f60` (Initial commit) and `ff9cb35` (setup production env) both pass. The first failure occurs at commit `a3fd87a` (switch to dev), and all subsequent commits also fail. This means the bug was introduced in commit `a3fd87a`. **Answer:** a3fd87a
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 187 in → 10 out (0 reasoning)
18c5c90
JSON Schema Structured Output
(raw) {"commit":"18c5c90"}
mistralai/devstral-2512 PASS 100% 213 in → 98 out
Based on the test results, the build was working fine until the commit `e844f3d switch to dev`, which introduced the failure. The subsequent commit `d451fe5 add diagnostics` also failed, but since the failure was already present in `e844f3d`, that is the commit that introduced the bug. The bad commit is: `e844f3d` (or the subject line "switch to dev").
mistralai/devstral-2512__json_schema PASS 100% 207 in → 11 out
ba8ec23
JSON Schema Structured Output
(raw) {"commit": "ba8ec23"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 224 in → 168 out (166 reasoning)
23166dc
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 224 in → 1,041 out (1,007 reasoning)
bafb98e
JSON Schema Structured Output
(raw) {"commit": "bafb98e" }
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 218 in → 19 out (0 reasoning)
a6dc74e switch to dev
JSON Schema Structured Output
(raw) { "commit": "a6dc74e switch to dev" }
poolside/laguna-xs.2:high PASS 100% 265 in → 415 out (295 reasoning)
Based on the test results, I can see that: 1. Commits e9e3627 and 6a19c14 both pass (exit code 0) 2. Commits 3e78b09 and 53942a9 both fail (exit code 1) The first commit where the tests start failing is **3e78b09** with the subject "switch to dev". This is the commit that introduced the bug. **Answer: 3e78b09**
poolside/laguna-xs.2:high__json_schema PASS 100% 259 in → 271 out (258 reasoning)
c167ff8
JSON Schema Structured Output
(raw) {"commit": "c167ff8"}
poolside/laguna-xs.2:none PASS 100% 261 in → 101 out (0 reasoning)
Looking at the test results, I can see that: - Commits `0608f29` and `8e5e3ff` both pass - Commits `fd95d61` and `1034857` both fail Since `fd95d61` is the first commit where the tests start failing, this is the commit that introduced the bug. **Bad commit: `fd95d61`**
poolside/laguna-xs.2:none__json_schema PASS 100% 263 in → 13 out (0 reasoning)
33d612c
JSON Schema Structured Output
(raw) {"commit": "33d612c"}
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 222 in → 7 out (0 reasoning)
609e7bf
Failure: Expected target hash/prefix or subject 'switch to dev'