: GitBench
Config changes spread across commits
Tests ability to squash config changes spread across multiple commits. Evaluates recognizing scattered configuration updates.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '{}' > config.json
  5. 05 git add config.json
  6. 06 git commit -m 'Initial config'
  7. 07 echo '{"debug": true}' > config.json
  8. 08 git add config.json
  9. 09 git commit -m 'WIP: add debug flag'
  10. 10 echo '{"debug": true, "port": 8080}' > config.json
  11. 11 git add config.json
  12. 12 git commit -m 'WIP: add port'
  13. 13 echo '{"debug": true, "port": 8080, "host": "localhost"}' > config.json
  14. 14 git add config.json
  15. 15 git commit -m 'Finalize config'
Prompt
These commits show incremental configuration changes. Identify the WIP commits that should be marked as squash/fixup to combine all config work into a single meaningful commit. Do not include the initial or final config anchor commit. Provide the commit messages.
Expected
WIP: add debug flag, WIP: add port
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 97 in → 303 out (290 reasoning)
WIP: add port WIP: add debug flag
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 99 in → 246 out (190 reasoning)
b92ffd3 WIP: add port a6cb1fd WIP: add debug flag
JSON Schema Structured Output
(raw) { "commits": [ "b92ffd3 WIP: add port", "a6cb1fd WIP: add debug flag" ] }
deepseek/deepseek-v4-flash:none FAIL 100% 96 in → 115 out (0 reasoning)
Based on the incremental configuration changes shown, the following WIP commits should be marked as squash/fixup into the final config commit: - `c7e320c WIP: add port` - `a37bee8 WIP: add debug flag` These two commits represent intermediate, incomplete configuration steps that should be combined with the final config commit (`c7713ef Finalize config`) to create one meaningful commit. The initial config (`ffda372`) and final config (`c7713ef`) should remain as separate commits as specified in your instructions.
Failure: Extra selected commit messages: ['Finalize config', 'Initial config']
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 96 in → 36 out (0 reasoning)
93c4282 WIP: add debug flag 303a11b WIP: add port
JSON Schema Structured Output
(raw) { "commits": [ "93c4282 WIP: add debug flag", "303a11b WIP: add port" ] }
mistralai/devstral-2512 FAIL 100% 108 in → 113 out
Based on the commit history, the WIP commits that should be marked as `squash` or `fixup` to combine all config work into a single meaningful commit (excluding the initial and final anchor commits) are: 1. `c8a5d7d WIP: add port` 2. `f3d47fb WIP: add debug flag` These would be squashed into the final commit `85864f9 Finalize config` to create a single, clean commit representing all the configuration changes.
Failure: Extra selected commit messages: ['Finalize config']
mistralai/devstral-2512__json_schema PASS 100% 105 in → 21 out
WIP: add port WIP: add debug flag
JSON Schema Structured Output
(raw) {"commits": ["WIP: add port", "WIP: add debug flag"]}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 119 in → 519 out (430 reasoning)
The commits that should be squash‑/fixup‑ed (i.e., marked as “WIP” commits to combine into a single meaningful config commit) are: 1. **e12c589 – “WIP: add port”** 2. **9ee468d – “WIP: add debug flag”**
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 119 in → 48 out (0 reasoning)
**WIP commits to squash/fixup:** 1. `854071f WIP: add debug flag` 2. `bb9f998 WIP: add port`
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 117 in → 28 out (0 reasoning)
WIP: add debug flag WIP: add port
JSON Schema Structured Output
(raw) { "commits": [ "WIP: add debug flag", "WIP: add port" ] }
poolside/laguna-xs.2:high FAIL 100% 150 in → 283 out (205 reasoning)
Based on the commit history, the WIP commits that should be squashed/fixup'd are: 1. **WIP: add port** 2. **WIP: add debug flag** These two commits fall between the "Initial config" anchor and the "Finalize config" anchor, and should be combined into a single meaningful commit to clean up the incremental configuration changes.
Failure: Extra selected commit messages: ['Finalize config', 'Initial config']
poolside/laguna-xs.2:high__json_schema PASS 100% 147 in → 267 out (249 reasoning)
WIP: add port WIP: add debug flag
JSON Schema Structured Output
(raw) {"commits": ["WIP: add port", "WIP: add debug flag"]}
poolside/laguna-xs.2:none FAIL 100% 145 in → 124 out (0 reasoning)
Looking at the commit history, I can see the pattern: - `46bf757` - Finalize config (final anchor) - `0eaef32` - WIP: add port (WIP commit) - `cf7c2d3` - WIP: add debug flag (WIP commit) - `fb054aa` - Initial config (initial anchor) The WIP commits that should be squashed/fixup to combine into a single meaningful commit are: 1. **WIP: add port** 2. **WIP: add debug flag**
Failure: Extra selected commit messages: ['Finalize config', 'Initial config']
poolside/laguna-xs.2:none__json_schema PASS 100% 149 in → 31 out (0 reasoning)
0f6475b WIP: add port 45e310a WIP: add debug flag
JSON Schema Structured Output
(raw) {"commits": ["0f6475b WIP: add port", "45e310a WIP: add debug flag"]}
nvidia/nemotron-3-nano-30b-a3b:high__json_schema FAIL 0% 119 in → 493 out (472 reasoning)
add port add debug flag
JSON Schema Structured Output
(raw) { "commits": [ "add port", "add debug flag" ] }
Failure: Missing expected commit messages: ['WIP: add debug flag', 'WIP: add port']