: GitBench
Multi-file: conflicts in both main.py and utils.py
Tests ability to resolve simultaneous conflicts in two files with different resolution strategies per file. Evaluates multi-file conflict juggling.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 printf 'def main(): print("Running")' > main.py
  5. 05 printf 'DEBUG = True PORT = 8000' > utils.py
  6. 06 git add main.py utils.py
  7. 07 git commit -m 'Initial code'
  8. 08 git checkout -b feature
  9. 09 printf 'def main(): print("Running v2")' > main.py
  10. 10 printf 'DEBUG = False PORT = 8000' > utils.py
  11. 11 git add main.py utils.py
  12. 12 git commit -m 'Feature: update main and disable debug'
  13. 13 git checkout main
  14. 14 printf 'def main(): print("Running enterprise")' > main.py
  15. 15 printf 'DEBUG = True PORT = 9000' > utils.py
  16. 16 git add main.py utils.py
  17. 17 git commit -m 'Enterprise: update main and port'
  18. 18 git cherry-pick $(git rev-parse feature)
Prompt
Resolve the cherry-pick conflicts in main.py and utils.py. For main.py: the incoming commit says 'Running v2', the current branch (main) says 'Running enterprise'. For utils.py: the incoming commit set DEBUG=False with PORT=8000, the current branch kept DEBUG=True with PORT=9000. Resolve both conflicts preferring the more descriptive version for main.py and the safer DEBUG=True for utils.py. Provide ONLY the two resolved file contents, separated by a blank line, with each file prefixed by its filename.
Expected
main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000
Loading campaign evidence…
deepseek/deepseek-v4-flash:high FAIL 0% 211 in → 263 out (237 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
deepseek/deepseek-v4-flash:high__json_schema FAIL 0% 211 in → 368 out (327 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) { "resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\nutils.py\nDEBUG = True\nPORT = 9000" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
deepseek/deepseek-v4-flash:none FAIL 0% 211 in → 25 out (0 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
deepseek/deepseek-v4-flash:none__json_schema FAIL 0% 213 in → 20 out (0 reasoning)
DEBUG = True PORT = 8000
JSON Schema Structured Output
(raw) { "resolved_content": "DEBUG = True\nPORT = 8000" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'DEBUG = True PORT = 8000'
mistralai/devstral-2512 FAIL 0% 224 in → 29 out
--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000'
mistralai/devstral-2512__json_schema FAIL 0% 228 in → 43 out
--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) {"resolved_content": "--- main.py\ndef main():\n print(\"Running enterprise\")\n\n--- utils.py\nDEBUG = True\nPORT = 9000"}
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:high FAIL 0% 241 in → 610 out (708 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:high__json_schema FAIL 0% 241 in → 458 out (485 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) { "resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\n\nutils.py\nDEBUG = True\nPORT = 9000" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 239 in → 29 out (0 reasoning)
--- utilities.py DEBUG = True PORT = 9000 main.py def main(): print("Running enterprise")
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '--- utilities.py DEBUG = True PORT = 9000 main.py def main(): print("Running enterprise")'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 237 in → 33 out (0 reasoning)
def main(): print("Running enterprise") <no module>
JSON Schema Structured Output
(raw) { "resolved_content": "def main():\n print(\"Running enterprise\")\n\n<no module>\n\n" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'def main(): print("Running enterprise") <no module> '
poolside/laguna-xs.2:high FAIL 0% 266 in → 407 out (368 reasoning)
main.py ``` def main(): print("Running enterprise") ``` utils.py ``` DEBUG = True PORT = 9000 ```
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got ' main.py ``` def main(): print("Running enterprise") ``` utils.py ``` DEBUG = True PORT = 9000 ``` '
poolside/laguna-xs.2:high__json_schema FAIL 0% 268 in → 357 out (316 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) {"resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\n\nutils.py\nDEBUG = True\nPORT = 9000"}
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
poolside/laguna-xs.2:none FAIL 0% 266 in → 30 out (0 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got ' main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000 '
poolside/laguna-xs.2:none__json_schema FAIL 0% 264 in → 40 out (0 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) {"resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\n\nutils.py\nDEBUG = True\nPORT = 9000"}
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'