: GitBench
Single-line greeting conflict
Tests ability to resolve a simple single-line merge conflict in a greeting file. Evaluates basic conflict marker comprehension.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'Hello, World!' > greeting.txt
  5. 05 git add greeting.txt
  6. 06 git commit -m 'Initial greeting'
  7. 07 git checkout -b feature
  8. 08 echo 'Hello, World!!!' > greeting.txt
  9. 09 git add greeting.txt
  10. 10 git commit -m 'Add excitement'
  11. 11 git checkout main
  12. 12 echo 'Hello, Planet!' > greeting.txt
  13. 13 git add greeting.txt
  14. 14 git commit -m 'Update greeting'
  15. 15 git merge feature
Prompt
Resolve the merge conflict in greeting.txt. The base version was 'Hello, World!'. Branch A changed it to 'Hello, World!!!'. Branch B changed it to 'Hello, Planet!'. Provide ONLY the resolved file content, nothing else.
Expected
Hello, Planet!!!
Loading campaign evidence…
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 84 in → 414 out (443 reasoning)
Hello, Planet!!!
JSON Schema Structured Output
(raw) { "resolved_content": "Hello, Planet!!!" }
mistralai/devstral-2512 PASS 100% 85 in → 5 out
Hello, Planet!!!
mistralai/devstral-2512__json_schema PASS 100% 85 in → 13 out
Hello, Planet!!!
JSON Schema Structured Output
(raw) {"resolved_content": "Hello, Planet!!!"}
poolside/laguna-xs.2:high PASS 100% 130 in → 284 out (277 reasoning)
Hello, Planet!!!
poolside/laguna-xs.2:high__json_schema PASS 100% 130 in → 625 out (608 reasoning)
Hello, Planet!!!
JSON Schema Structured Output
(raw) { "resolved_content": "Hello, Planet!!!" }
deepseek/deepseek-v4-flash:high FAIL 0% 84 in → 1,235 out (1,354 reasoning)
Hello, Planet!
Failure: Expected 'Hello, Planet!!!', got 'Hello, Planet!'
deepseek/deepseek-v4-flash:none FAIL 0% 84 in → 4 out (0 reasoning)
Hello, World!!!
Failure: Expected 'Hello, Planet!!!', got 'Hello, World!!!'
deepseek/deepseek-v4-flash:none__json_schema FAIL 0% 84 in → 16 out (0 reasoning)
Hello, World!!!
JSON Schema Structured Output
(raw) { "resolved_content": "Hello, World!!!" }
Failure: Expected 'Hello, Planet!!!', got 'Hello, World!!!'
nvidia/nemotron-3-nano-30b-a3b:high FAIL 0% 98 in → 2,074 out (2,356 reasoning)
Hello, World!!!
Failure: Expected 'Hello, Planet!!!', got 'Hello, World!!!'
nvidia/nemotron-3-nano-30b-a3b:high__json_schema FAIL 0% 98 in → 1,445 out (1,661 reasoning)
Hello, World!!!
JSON Schema Structured Output
(raw) { "resolved_content": "Hello, World!!!" }
Failure: Expected 'Hello, Planet!!!', got 'Hello, World!!!'
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 98 in → 5 out (0 reasoning)
Hello, World!!!
Failure: Expected 'Hello, Planet!!!', got 'Hello, World!!!'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 98 in → 15 out (0 reasoning)
Hello, World!!!
JSON Schema Structured Output
(raw) { "resolved_content": "Hello, World!!!" }
Failure: Expected 'Hello, Planet!!!', got 'Hello, World!!!'
poolside/laguna-xs.2:none FAIL 0% 130 in → 6 out (0 reasoning)
Hello, World!!!
Failure: Expected 'Hello, Planet!!!', got ' Hello, World!!! '
poolside/laguna-xs.2:none__json_schema FAIL 0% 130 in → 15 out (0 reasoning)
Hello, World!!!
JSON Schema Structured Output
(raw) { "resolved_content": "Hello, World!!!" }
Failure: Expected 'Hello, Planet!!!', got 'Hello, World!!!'