: GitBench
Multi-file: conflicts in both main.py and utils.py
Tests ability to resolve simultaneous rebase conflicts in two files with different strategies. Evaluates multi-file rebase juggling.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 printf 'def main(): print("Running")' > main.py
  5. 05 printf 'DEBUG = True PORT = 8000' > utils.py
  6. 06 git add main.py utils.py
  7. 07 git commit -m 'Initial code'
  8. 08 git checkout -b feature
  9. 09 printf 'def main(): print("Running v2")' > main.py
  10. 10 printf 'DEBUG = False PORT = 8000' > utils.py
  11. 11 git add main.py utils.py
  12. 12 git commit -m 'Feature: update main and disable debug'
  13. 13 git checkout main
  14. 14 printf 'def main(): print("Running enterprise")' > main.py
  15. 15 printf 'DEBUG = True PORT = 9000' > utils.py
  16. 16 git add main.py utils.py
  17. 17 git commit -m 'Enterprise: update main and port'
  18. 18 git rebase feature
Prompt
Resolve the rebase conflicts in main.py and utils.py. For main.py: branch being rebased says 'Running v2', main says 'Running enterprise'. For utils.py: branch being rebased set DEBUG=False with PORT=8000, main kept DEBUG=True with PORT=9000. Resolve both conflicts preferring the more descriptive version for main.py and the safer DEBUG=True for utils.py. Provide ONLY the two resolved file contents, separated by a blank line, with each file prefixed by its filename.
Expected
main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000
Loading campaign evidence…
deepseek/deepseek-v4-flash:high FAIL 0% 201 in → 373 out (349 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
deepseek/deepseek-v4-flash:high__json_schema FAIL 0% 203 in → 533 out (488 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 8000
JSON Schema Structured Output
(raw) { "resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\nutils.py\nDEBUG = True\nPORT = 8000\n" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 8000 '
deepseek/deepseek-v4-flash:none FAIL 0% 201 in → 25 out (0 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 8000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 8000'
deepseek/deepseek-v4-flash:none__json_schema FAIL 0% 205 in → 41 out (0 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 8000
JSON Schema Structured Output
(raw) { "resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\nutils.py\nDEBUG = True\nPORT = 8000" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 8000'
mistralai/devstral-2512 FAIL 0% 214 in → 38 out
```python --- main.py def main(): print("Running enterprise") ``` ```python --- utils.py DEBUG = True PORT = 9000 ```
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '```python --- main.py def main(): print("Running enterprise") ``` ```python --- utils.py DEBUG = True PORT = 9000 ```'
mistralai/devstral-2512__json_schema FAIL 0% 218 in → 43 out
--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) {"resolved_content": "--- main.py\ndef main():\n print(\"Running enterprise\")\n\n--- utils.py\nDEBUG = True\nPORT = 9000"}
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:high FAIL 0% 227 in → 532 out (561 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:high__json_schema FAIL 0% 231 in → 1,343 out (1,458 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) { "resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\nutils.py\nDEBUG = True\nPORT = 9000" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 229 in → 29 out (0 reasoning)
--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '--- main.py def main(): print("Running enterprise") --- utils.py DEBUG = True PORT = 9000'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 229 in → 495 out (0 reasoning)
@ conflict_id: main.py-0 unified: true version: 2024-05-27 --- conflict: main.py files: ["/main.py"] lines: 4 code: --- main.py.preferred_version_code: | def main(): print("Running enterprise") --- main.py.conflicting_versions_code: | <<<<<<< HEAD print("Running v2") ======= print("Running enterprise") >>>>>>> b1be283 (Enterprise: update main and port) --- main.py.non_conflicting_versions_code: | pass --- } { "resolved_content": "DEBUG = True\nPORT = 9000\n"] } <pre> conflict_id: utils.py-0 unified: true version: 2024-05-27 --- conflict: utils.py files: ["/utils.py"] lines: 4 code: --- utils.py.conflicting_versions_code: | <<<<<<< HEAD DEBUG = False\nPORT = 8000\n=======\nDEBUG = True\nPORT = 9000\n>>>>>>> b1be283 (Enterprise: update main and port) --- utils.py.preferred_versions_code: | DEBUG = True\nPORT = 9000\n--- utils.py.conflicting_versions_code: | <<<<<<< HEAD\nDEBUG = False\nPORT = 8000\n=======\nDEBUG = True\nPORT = 9000\n>>>>>>> b1be283 (Enterprise: update main and port)\n--- utils.py.non_conflicting_versions_code: | pass\n--- </pre> ><|startoftext|>|<|reserved_special_token_866|><|startoftext|>
JSON Schema Structured Output
(raw) { "resolved_content": "@\nconflict_id: main.py-0\nunified: true\nversion: 2024-05-27\n---\nconflict: main.py\nfiles: [\"/main.py\"]\nlines: 4\ncode:\n---\nmain.py.preferred_version_code: |\n def main():\n print(\"Running enterprise\")\n---\nmain.py.conflicting_versions_code: |\n<<<<<<< HEAD\n print(\"Running v2\")\n=======\n print(\"Running enterprise\")\n>>>>>>> b1be283 (Enterprise: update main and port)\n---\nmain.py.non_conflicting_versions_code: |\n pass\n---\n}\n{\n \"resolved_content\": \"DEBUG = True\\nPORT = 9000\\n\"]\n}\n<pre>\nconflict_id: utils.py-0\nunified: true\nversion: 2024-05-27\n---\nconflict: utils.py\nfiles: [\"/utils.py\"]\nlines: 4\ncode:\n---\nutils.py.conflicting_versions_code: |\n<<<<<<< HEAD\nDEBUG = False\\nPORT = 8000\\n=======\\nDEBUG = True\\nPORT = 9000\\n>>>>>>> b1be283 (Enterprise: update main and port)\n---\nutils.py.preferred_versions_code: |\nDEBUG = True\\nPORT = 9000\\n---\nutils.py.conflicting_versions_code: |\n<<<<<<< HEAD\\nDEBUG = False\\nPORT = 8000\\n=======\\nDEBUG = True\\nPORT = 9000\\n>>>>>>> b1be283 (Enterprise: update main and port)\\n---\nutils.py.non_conflicting_versions_code: |\n pass\\n---\n</pre>\n><|startoftext|>|<|reserved_special_token_866|><|startoftext|>" }
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got '@ conflict_id: main.py-0 unified: true version: 2024-05-27 --- conflict: main.py files: ["/main.py"] lines: 4 code: --- main.py.preferred_version_code: | def main(): print("Running enterprise") --- main.py.conflicting_versions_code: | <<<<<<< HEAD print("Running v2") ======= print("Running enterprise") >>>>>>> b1be283 (Enterprise: update main and port) --- main.py.non_conflicting_versions_code: | pass --- } { "resolved_content": "DEBUG = True\nPORT = 9000\n"] } <pre> conflict_id: utils.py-0 unified: true version: 2024-05-27 --- conflict: utils.py files: ["/utils.py"] lines: 4 code: --- utils.py.conflicting_versions_code: | <<<<<<< HEAD DEBUG = False\nPORT = 8000\n=======\nDEBUG = True\nPORT = 9000\n>>>>>>> b1be283 (Enterprise: update main and port) --- utils.py.preferred_versions_code: | DEBUG = True\nPORT = 9000\n--- utils.py.conflicting_versions_code: | <<<<<<< HEAD\nDEBUG = False\nPORT = 8000\n=======\nDEBUG = True\nPORT = 9000\n>>>>>>> b1be283 (Enterprise: update main and port)\n--- utils.py.non_conflicting_versions_code: | pass\n--- </pre> ><|startoftext|>|<|reserved_special_token_866|><|startoftext|>'
poolside/laguna-xs.2:high FAIL 0% 258 in → 236 out (205 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got ' main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000 '
poolside/laguna-xs.2:high__json_schema FAIL 0% 258 in → 210 out (169 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) {"resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\n\nutils.py\nDEBUG = True\nPORT = 9000"}
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'
poolside/laguna-xs.2:none FAIL 0% 260 in → 30 out (0 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got ' main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000 '
poolside/laguna-xs.2:none__json_schema FAIL 0% 260 in → 40 out (0 reasoning)
main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000
JSON Schema Structured Output
(raw) {"resolved_content": "main.py\ndef main():\n print(\"Running enterprise\")\n\n\nutils.py\nDEBUG = True\nPORT = 9000"}
Failure: Expected 'main.py: def main(): print("Running enterprise") utils.py: DEBUG = True PORT = 9000 ', got 'main.py def main(): print("Running enterprise") utils.py DEBUG = True PORT = 9000'