: GitBench
Mixed merged and unmerged with many branches
Tests ability to process many branches (10+) with mixed merge status. Evaluates scalability of branch cleanup analysis.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'App' > app.txt
  5. 05 git add app.txt
  6. 06 git commit -m 'Initial'
  7. 07 git checkout -b fix-typo
  8. 08 echo 'typo fix' > typo.txt
  9. 09 git add typo.txt
  10. 10 git commit -m 'Fix typo'
  11. 11 git checkout main
  12. 12 git merge fix-typo
  13. 13 git checkout -b fix-bug
  14. 14 echo 'bug fix' > bug.txt
  15. 15 git add bug.txt
  16. 16 git commit -m 'Fix bug'
  17. 17 git checkout main
  18. 18 git merge fix-bug
  19. 19 git checkout -b feature-v2
  20. 20 echo 'v2' > v2.txt
  21. 21 git add v2.txt
  22. 22 git commit -m 'Start v2'
  23. 23 git checkout main
  24. 24 git checkout -b feature-v3
  25. 25 echo 'v3' > v3.txt
  26. 26 git add v3.txt
  27. 27 git commit -m 'Start v3'
  28. 28 git checkout main
  29. 29 git checkout -b docs-update
  30. 30 echo 'docs' > docs.txt
  31. 31 git add docs.txt
  32. 32 git commit -m 'Update docs'
  33. 33 git checkout main
  34. 34 git merge docs-update
Prompt
Look at the branches in this repository. Which branches should be deleted because they are fully merged into main? List ONLY the branch names to delete, one per line. If none, respond with 'none'.
Expected
docs-update fix-bug fix-typo
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 152 in → 188 out (174 reasoning)
docs-update fix-bug fix-typo
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 151 in → 182 out (133 reasoning)
docs-update fix-bug fix-typo
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "docs-update", "fix-bug", "fix-typo" ] }
deepseek/deepseek-v4-flash:none PASS 100% 146 in → 13 out (0 reasoning)
docs-update fix-bug fix-typo
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 150 in → 26 out (0 reasoning)
docs-update fix-bug fix-typo
JSON Schema Structured Output
(raw) { "branches_to_delete": ["docs-update", "fix-bug", "fix-typo"] }
mistralai/devstral-2512 PASS 100% 162 in → 12 out
docs-update fix-bug fix-typo
mistralai/devstral-2512__json_schema PASS 100% 162 in → 23 out
docs-update fix-bug fix-typo
JSON Schema Structured Output
(raw) {"branches_to_delete": ["docs-update", "fix-bug", "fix-typo"]}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 170 in → 169 out (173 reasoning)
docs-update fix-bug fix-typo
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 176 in → 161 out (131 reasoning)
docs-update fix-bug fix-typo
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "docs-update", "fix-bug", "fix-typo" ] }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 175 in → 12 out (0 reasoning)
docs-update fix-bug fix-typo
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 170 in → 30 out (0 reasoning)
docs-update fix-bug fix-typo
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "docs-update", "fix-bug", "fix-typo" ] }
poolside/laguna-xs.2:high PASS 100% 203 in → 271 out (259 reasoning)
docs-update fix-bug fix-typo
poolside/laguna-xs.2:high__json_schema PASS 100% 199 in → 189 out (169 reasoning)
docs-update fix-bug fix-typo
JSON Schema Structured Output
(raw) {"branches_to_delete": ["docs-update", "fix-bug", "fix-typo"]}
poolside/laguna-xs.2:none PASS 100% 204 in → 11 out (0 reasoning)
docs-update fix-bug fix-typo
poolside/laguna-xs.2:none__json_schema PASS 100% 205 in → 19 out (0 reasoning)
docs-update fix-bug fix-typo
JSON Schema Structured Output
(raw) {"branches_to_delete": ["docs-update", "fix-bug", "fix-typo"]}