: GitBench
All branches merged
Tests ability to identify that all branches are merged and can be deleted. Evaluates the full-cleanup scenario.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'start' > start.txt
  5. 05 git add start.txt
  6. 06 git commit -m 'Start'
  7. 07 git checkout -b fix-1
  8. 08 echo 'fix1' > fix1.txt
  9. 09 git add fix1.txt
  10. 10 git commit -m 'Fix 1'
  11. 11 git checkout main
  12. 12 git merge fix-1
  13. 13 git checkout -b fix-2
  14. 14 echo 'fix2' > fix2.txt
  15. 15 git add fix2.txt
  16. 16 git commit -m 'Fix 2'
  17. 17 git checkout main
  18. 18 git merge fix-2
  19. 19 git checkout -b cleanup
  20. 20 echo 'cleanup' > cleanup.txt
  21. 21 git add cleanup.txt
  22. 22 git commit -m 'Cleanup'
  23. 23 git checkout main
  24. 24 git merge cleanup
Prompt
Look at the branches in this repository. Which branches should be deleted because they are fully merged into main? List ONLY the branch names to delete, one per line. If none, respond with 'none'.
Expected
cleanup fix-1 fix-2
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 116 in → 94 out (82 reasoning)
cleanup fix-1 fix-2
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 119 in → 147 out (107 reasoning)
cleanup fix-1 fix-2
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "cleanup", "fix-1", "fix-2" ] }
deepseek/deepseek-v4-flash:none PASS 100% 115 in → 11 out (0 reasoning)
cleanup fix-1 fix-2
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 118 in → 30 out (0 reasoning)
cleanup fix-1 fix-2
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "cleanup", "fix-1", "fix-2" ] }
mistralai/devstral-2512 PASS 100% 122 in → 11 out
cleanup fix-1 fix-2
mistralai/devstral-2512__json_schema PASS 100% 127 in → 22 out
cleanup fix-1 fix-2
JSON Schema Structured Output
(raw) {"branches_to_delete": ["cleanup", "fix-1", "fix-2"]}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 139 in → 131 out (127 reasoning)
cleanup fix-1 fix-2
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 140 in → 258 out (209 reasoning)
cleanup fix-1 fix-2
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "cleanup", "fix-1", "fix-2" ] }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 139 in → 11 out (0 reasoning)
cleanup fix-1 fix-2
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 139 in → 30 out (0 reasoning)
cleanup fix-1 fix-2
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "cleanup", "fix-1", "fix-2" ] }
poolside/laguna-xs.2:high PASS 100% 169 in → 185 out (173 reasoning)
cleanup fix-1 fix-2
poolside/laguna-xs.2:high__json_schema PASS 100% 169 in → 233 out (213 reasoning)
cleanup fix-1 fix-2
JSON Schema Structured Output
(raw) {"branches_to_delete": ["cleanup", "fix-1", "fix-2"]}
poolside/laguna-xs.2:none PASS 100% 172 in → 11 out (0 reasoning)
cleanup fix-1 fix-2
poolside/laguna-xs.2:none__json_schema PASS 100% 169 in → 19 out (0 reasoning)
cleanup fix-1 fix-2
JSON Schema Structured Output
(raw) {"branches_to_delete": ["cleanup", "fix-1", "fix-2"]}