: GitBench
Multiple merged branches
Tests ability to identify multiple merged branches for deletion. Evaluates git branch --merged output parsing with several branches.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'v1' > version.txt
  5. 05 git add version.txt
  6. 06 git commit -m 'Initial version'
  7. 07 git checkout -b fix-a
  8. 08 echo 'fix a' > a.txt
  9. 09 git add a.txt
  10. 10 git commit -m 'Fix A'
  11. 11 git checkout main
  12. 12 git merge fix-a
  13. 13 git checkout -b fix-b
  14. 14 echo 'fix b' > b.txt
  15. 15 git add b.txt
  16. 16 git commit -m 'Fix B'
  17. 17 git checkout main
  18. 18 git merge fix-b
  19. 19 git checkout -b feature-new
  20. 20 echo 'new feature' > new.txt
  21. 21 git add new.txt
  22. 22 git commit -m 'New feature'
Prompt
Look at the branches in this repository. Which branches should be deleted because they are fully merged into main? List ONLY the branch names to delete, one per line. If none, respond with 'none'.
Expected
fix-a fix-b
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 109 in → 98 out (91 reasoning)
fix-a fix-b
deepseek/deepseek-v4-flash:none PASS 100% 109 in → 5 out (0 reasoning)
fix-a fix-b
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 114 in → 23 out (0 reasoning)
fix-a fix-b
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "fix-a", "fix-b" ] }
mistralai/devstral-2512 PASS 100% 120 in → 6 out
fix-a fix-b
mistralai/devstral-2512__json_schema PASS 100% 117 in → 16 out
fix-a fix-b
JSON Schema Structured Output
(raw) {"branches_to_delete": ["fix-a", "fix-b"]}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 132 in → 194 out (207 reasoning)
fix-a fix-b
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 134 in → 249 out (257 reasoning)
fix-a fix-b
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "fix-a", "fix-b" ] }
nvidia/nemotron-3-nano-30b-a3b:none FAIL 100% 133 in → 8 out (0 reasoning)
fix-a fix-b main
Failure: Extra: ['main']
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 100% 128 in → 19 out (0 reasoning)
fix-a fix-b main
JSON Schema Structured Output
(raw) {"branches_to_delete": ["fix-a", "fix-b", "main"]}
Failure: Extra: ['main']
poolside/laguna-xs.2:high PASS 100% 171 in → 207 out (199 reasoning)
fix-a fix-b
poolside/laguna-xs.2:high__json_schema PASS 100% 166 in → 368 out (351 reasoning)
fix-a fix-b
JSON Schema Structured Output
(raw) {"branches_to_delete": ["fix-a", "fix-b"]}
poolside/laguna-xs.2:none PASS 100% 168 in → 7 out (0 reasoning)
fix-a fix-b
poolside/laguna-xs.2:none__json_schema FAIL 100% 170 in → 18 out (0 reasoning)
feature-new fix-a fix-b
JSON Schema Structured Output
(raw) {"branches_to_delete": ["feature-new", "fix-a", "fix-b"]}
Failure: Extra: ['feature-new']
Invalid JSON. Output:
JSON Schema Structured Output
Structured Output Error
Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)
Failure: Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)