: GitBench
Multi-file: conflicts in both main.py and utils.py
Tests ability to resolve simultaneous conflicts in two files with different strategies per file. Evaluates multi-file conflict juggling.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 printf 'def main(): print("Running")' > main.py
  5. 05 printf 'DEBUG = True PORT = 8000' > utils.py
  6. 06 git add main.py utils.py
  7. 07 git commit -m 'Initial code'
  8. 08 git checkout -b feature
  9. 09 printf 'def main(): print("Running v2")' > main.py
  10. 10 printf 'DEBUG = False PORT = 8000' > utils.py
  11. 11 git add main.py utils.py
  12. 12 git commit -m 'Feature: update main and disable debug'
  13. 13 git checkout main
  14. 14 printf 'def main(): print("Running enterprise")' > main.py
  15. 15 printf 'DEBUG = True PORT = 9000' > utils.py
  16. 16 git add main.py utils.py
  17. 17 git commit -m 'Enterprise: update main and port'
  18. 18 git merge feature
Prompt
Resolve the merge conflicts in main.py and utils.py. For main.py: branch A says 'Running v2', branch B says 'Running enterprise'. For utils.py: branch A set DEBUG=False with PORT=8000, branch B kept DEBUG=True with PORT=9000. Resolve both conflicts preferring the more descriptive version for main.py and the safer DEBUG=True for utils.py. Provide ONLY the two resolved file contents, separated by a blank line, with each file prefixed by its filename.
Expected
main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000
Loading campaign evidence…
deepseek/deepseek-v4-flash:high FAIL 0% 178 in → 593 out (597 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
deepseek/deepseek-v4-flash:high__json_schema FAIL 0% 178 in → 467 out (480 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) { "resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\nutils.py\nDEBUG = True\nPORT = 9000" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
deepseek/deepseek-v4-flash:none FAIL 0% 178 in → 24 out (0 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 8000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 8000'
deepseek/deepseek-v4-flash:none__json_schema FAIL 0% 178 in → 43 out (0 reasoning)
main.py def main(): print("Running v2") utils.py DEBUG = True PORT = 8000
JSON Schema Structured Output
(raw) { "resolved_content": "main.py\ndef main():\n print(\"Running v2\")\n\nutils.py\nDEBUG = True\nPORT = 8000" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running v2") utils.py DEBUG = True PORT = 8000'
mistralai/devstral-2512 FAIL 0% 189 in → 45 out
Here are the resolved file contents: main.py ```python def main(): print("Running enterprise v2") ``` utils.py ```python DEBUG = True PORT = 9000 ```
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'Here are the resolved file contents: main.py ```python def main(): print("Running enterprise v2") ``` utils.py ```python DEBUG = True PORT = 9000 ```'
mistralai/devstral-2512__json_schema FAIL 0% 189 in → 43 out
--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) {"resolved_content": "--- main.py\ndef main():\n print(\"Running enterprise\")\n\n--- utils.py\nDEBUG = True\nPORT = 9000"}
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:high FAIL 0% 202 in → 282 out (304 reasoning)
--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:high__json_schema FAIL 0% 202 in → 428 out (433 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) { "resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\nutils.py\nDEBUG = True\nPORT = 9000" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 202 in → 29 out (0 reasoning)
--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 202 in → 46 out (0 reasoning)
--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) { "resolved_content": "--- main.py\ndef main():\n print(\"Running enterprise\")\n\n--- utils.py\nDEBUG = True\nPORT = 9000" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000'
poolside/laguna-xs.2:high FAIL 0% 228 in → 115 out (84 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got ' main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000 '
poolside/laguna-xs.2:high__json_schema FAIL 0% 228 in → 316 out (275 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) {"resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\n\nutils.py\nDEBUG = True\nPORT = 9000"}
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
poolside/laguna-xs.2:none FAIL 0% 228 in → 30 out (0 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got ' main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000 '
poolside/laguna-xs.2:none__json_schema FAIL 0% 228 in → 49 out (0 reasoning)
main.py ```python def main(): print("Running enterprise") ``` utils.py ```python DEBUG = True PORT = 9000 ```
JSON Schema Structured Output
(raw) {"resolved_content": "main.py\n```python\ndef main():\n print(\"Running enterprise\")\n```\n\nutils.py\n```python\nDEBUG = True\nPORT = 9000\n```"}
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py ```python def main(): print("Running enterprise") ``` utils.py ```python DEBUG = True PORT = 9000 ```'