: GitBench
Branching history with bad commit that removes validation - model identifies the bad commit
Tests ability to identify a commit that removed validation logic via bisect. Evaluates bisect for a regression that removes safety checks.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '#!/bin/bash' > test.sh
  5. 05 echo 'grep -q "if not data" src/auth.py' >> test.sh
  6. 06 chmod +x test.sh
  7. 07 mkdir -p src && echo 'def validate_input(data): if not data: return False return True' > src/auth.py
  8. 08 git add test.sh src/auth.py && git commit -m 'Initial commit'
  9. 09 echo '# validation enabled' >> src/auth.py
  10. 10 git add src/auth.py && git commit -m 'add validation function'
  11. 11 echo 'def validate_input(data): return True' > src/auth.py
  12. 12 git add src/auth.py && git commit -m 'remove validation logic'
  13. 13 git checkout -b feature
  14. 14 echo '# auth logging enabled' >> src/auth.py
  15. 15 git add src/auth.py && git commit -m 'add auth logging'
  16. 16 git checkout main
  17. 17 git merge feature
Prompt
You are debugging a git repository. The following commit history has one bad commit that broke the build. Identify the bad commit hash (first 7 characters) or commit subject line that introduced the bug. Use the git log and test output to determine which commit is bad. Your answer should be the commit hash (e.g., 'abc1234') or the commit subject line.
Expected
remove validation logic
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 193 in → 199 out (186 reasoning)
df10fcb
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 187 in → 124 out (102 reasoning)
d1f7220
JSON Schema Structured Output
(raw) { "commit": "d1f7220" }
deepseek/deepseek-v4-flash:none PASS 100% 189 in → 54 out (0 reasoning)
Based on the test results, the build passes for commits `b5930ba` and `b4466eb`, but fails starting from commit `4a30777`. This means the bug was introduced in commit `4a30777`. 4a30777
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 189 in → 13 out (0 reasoning)
51dc5c7
JSON Schema Structured Output
(raw) { "commit": "51dc5c7" }
mistralai/devstral-2512 PASS 100% 203 in → 78 out
Based on the test results, the build started failing after the commit `ba4accf remove validation logic`. The test passed for `51ad8f1 add validation function` but failed for `ba4accf remove validation logic` and all subsequent commits. Therefore, the bad commit is: `ba4accf` (or the subject line "remove validation logic")
mistralai/devstral-2512__json_schema PASS 100% 205 in → 11 out
1f2ced9
JSON Schema Structured Output
(raw) {"commit": "1f2ced9"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 226 in → 212 out (190 reasoning)
f861b65
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 228 in → 877 out (815 reasoning)
4887367
JSON Schema Structured Output
(raw) { "commit": "4887367" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 228 in → 33 out (0 reasoning)
abc1234 # Assuming "f253f6c" is the first 7 characters of the bad commit (remove validation logic)
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 228 in → 15 out (0 reasoning)
e082b33
JSON Schema Structured Output
(raw) { "commit": "e082b33" }
poolside/laguna-xs.2:high PASS 100% 261 in → 411 out (261 reasoning)
Looking at the test results chronologically: 1. **6ba4dec** (Initial commit): PASS ✓ 2. **e273c16** (add validation function): PASS ✓ 3. **972f764** (remove validation logic): FAIL ✗ 4. **621d5a2** (add auth logging): FAIL ✗ The tests pass up until commit `e273c16`, but fail starting with commit `972f764`. This commit "remove validation logic" is the one that introduced the bug. **Bad commit:** 972f764 (or "remove validation logic")
poolside/laguna-xs.2:high__json_schema PASS 100% 255 in → 288 out (271 reasoning)
3154218 remove validation logic
JSON Schema Structured Output
(raw) {"commit": "3154218 remove validation logic"}
poolside/laguna-xs.2:none PASS 100% 261 in → 155 out (0 reasoning)
Looking at the test results, I can see that the build failed starting with commit `bba28d8`. Let me analyze the history: - `5b2a36f` (Initial commit): PASS - `ec44b2f` (add validation function): PASS - `bba28d8` (remove validation logic): FAIL - `e5a9284` (add auth logging): FAIL The validation function was added in `ec44b2f` and the tests passed. Then `bba28d8` removed the validation logic, and that's when the tests started failing. This commit broke the build. **Answer: bba28d8**
poolside/laguna-xs.2:none__json_schema PASS 100% 257 in → 12 out (0 reasoning)
3b3ee20
JSON Schema Structured Output
(raw) {"commit": "3b3ee20"}