: GitBench
Single-line status message conflict
Tests ability to resolve a status-message merge conflict. Evaluates choosing between contradictory status values.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'Status: OK' > status.txt
  5. 05 git add status.txt
  6. 06 git commit -m 'Initial status'
  7. 07 git checkout -b fix
  8. 08 echo 'Status: Resolved' > status.txt
  9. 09 git add status.txt
  10. 10 git commit -m 'Fix status'
  11. 11 git checkout main
  12. 12 echo 'Status: Failed' > status.txt
  13. 13 git add status.txt
  14. 14 git commit -m 'Update status'
  15. 15 git merge fix
Prompt
Resolve the merge conflict in status.txt. The base was 'Status: OK'. Branch A changed it to 'Status: Resolved'. Branch B changed it to 'Status: Failed'. Provide ONLY the resolved file content.
Expected
Status: Resolved
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 79 in → 529 out (564 reasoning)
Status: Resolved
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 79 in → 197 out (197 reasoning)
Status: Resolved
JSON Schema Structured Output
(raw) { "resolved_content": "Status: Resolved" }
deepseek/deepseek-v4-flash:none PASS 100% 79 in → 5 out (0 reasoning)
Status: Resolved
mistralai/devstral-2512 PASS 100% 79 in → 5 out
Status: Resolved
mistralai/devstral-2512__json_schema PASS 100% 79 in → 12 out
Status: Resolved
JSON Schema Structured Output
(raw) {"resolved_content": "Status: Resolved"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 92 in → 701 out (785 reasoning)
Status: Resolved
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 92 in → 1,054 out (1,159 reasoning)
Status: Resolved
JSON Schema Structured Output
(raw) { "resolved_content": "Status: Resolved" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 92 in → 5 out (0 reasoning)
Status: Resolved
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 92 in → 15 out (0 reasoning)
Status: Resolved
JSON Schema Structured Output
(raw) { "resolved_content": "Status: Resolved" }
poolside/laguna-xs.2:high PASS 100% 121 in → 884 out (879 reasoning)
Status: Resolved
poolside/laguna-xs.2:high__json_schema PASS 100% 121 in → 382 out (372 reasoning)
Status: Resolved
JSON Schema Structured Output
(raw) {"resolved_content": "Status: Resolved"}
deepseek/deepseek-v4-flash:none__json_schema FAIL 0% 79 in → 14 out (0 reasoning)
Status: Failed
JSON Schema Structured Output
(raw) { "resolved_content": "Status: Failed\n" }
Failure: Expected 'Status: Resolved', got 'Status: Failed '
poolside/laguna-xs.2:none FAIL 0% 121 in → 5 out (0 reasoning)
Status: Failed
Failure: Expected 'Status: Resolved', got ' Status: Failed '
poolside/laguna-xs.2:none__json_schema FAIL 0% 121 in → 10 out (0 reasoning)
Status: Failed
JSON Schema Structured Output
(raw) {"resolved_content": "Status: Failed"}
Failure: Expected 'Status: Resolved', got 'Status: Failed'