: GitBench
Single unmerged branch
Tests ability to identify that zero branches should be deleted when only unmerged branches exist. Evaluates the all-unmerged edge case.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'App' > app.py
  5. 05 git add app.py
  6. 06 git commit -m 'Initial'
  7. 07 git checkout -b feature-experimental
  8. 08 echo 'experimental' > exp.py
  9. 09 git add exp.py
  10. 10 git commit -m 'Experiment'
  11. 11 git checkout main
Prompt
Look at the branches in this repository. Which branches should be deleted because they are fully merged into main? List ONLY the branch names to delete, one per line. If none, respond with 'none'.
Expected
none
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 79 in → 74 out (71 reasoning)
none
mistralai/devstral-2512 PASS 100% 79 in → 2 out
none
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 91 in → 79 out (90 reasoning)
none
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 94 in → 287 out (304 reasoning)
none
JSON Schema Structured Output
(raw) {"branches_to_delete":["none"]}
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 94 in → 2 out (0 reasoning)
none
poolside/laguna-xs.2:high PASS 100% 134 in → 424 out (420 reasoning)
none
poolside/laguna-xs.2:none PASS 100% 135 in → 3 out (0 reasoning)
none
deepseek/deepseek-v4-flash:high__json_schema FAIL 0% 79 in → 105 out (107 reasoning)
(empty output)
JSON Schema Structured Output
(raw) { "branches_to_delete": [] }
Failure: Missing: ['none']
deepseek/deepseek-v4-flash:none FAIL 0% 77 in → 4 out (0 reasoning)
feature-experimental
Failure: Missing: ['none'] Extra: ['feature-experimental']
deepseek/deepseek-v4-flash:none__json_schema FAIL 0% 79 in → 14 out (0 reasoning)
feature-experimental
JSON Schema Structured Output
(raw) { "branches_to_delete": ["feature-experimental"] }
Failure: Missing: ['none'] Extra: ['feature-experimental']
feature-experimental
JSON Schema Structured Output
(raw) {"branches_to_delete": ["feature-experimental"]}
Failure: Missing: ['none'] Extra: ['feature-experimental']
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 96 in → 19 out (0 reasoning)
feature-experimental
JSON Schema Structured Output
(raw) { "branches_to_delete": [ "feature-experimental" ] }
Failure: Missing: ['none'] Extra: ['feature-experimental']
poolside/laguna-xs.2:high__json_schema FAIL 0% 137 in → 175 out (166 reasoning)
(empty output)
JSON Schema Structured Output
(raw) {"branches_to_delete": []}
Failure: Missing: ['none']
poolside/laguna-xs.2:none__json_schema FAIL 0% 133 in → 11 out (0 reasoning)
feature-experimental
JSON Schema Structured Output
(raw) {"branches_to_delete": ["feature-experimental"]}
Failure: Missing: ['none'] Extra: ['feature-experimental']