: GitBench
Long-running feature with many small commits
Tests ability to handle a long-running feature with many small commits requiring reorganization. Evaluates scaling of squash analysis.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'start' > feature.py
  5. 05 git add feature.py
  6. 06 git commit -m 'Initial feature'
  7. 07 echo 'start line1' > feature.py
  8. 08 git add feature.py
  9. 09 git commit -m 'WIP: add line1'
  10. 10 echo 'start line1 line2' > feature.py
  11. 11 git add feature.py
  12. 12 git commit -m 'WIP: add line2'
  13. 13 echo 'start line1 line2 line3' > feature.py
  14. 14 git add feature.py
  15. 15 git commit -m 'WIP: add line3'
  16. 16 echo 'start line1 line2 line3 line4' > feature.py
  17. 17 git add feature.py
  18. 18 git commit -m 'Complete feature'
Prompt
This repository has a feature built incrementally with many small commits. Identify all the WIP commits that should be marked as squash/fixup for a cleaner history. Do not include the initial or final feature anchor commit. Provide the commit messages.
Expected
WIP: add line1, WIP: add line2, WIP: add line3
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 109 in → 260 out (191 reasoning)
f4a227e WIP: add line3 15fc7bf WIP: add line2 219a13e WIP: add line1
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 110 in → 290 out (237 reasoning)
WIP: add line3 WIP: add line2 WIP: add line1
JSON Schema Structured Output
(raw) { "commits": ["WIP: add line3", "WIP: add line2", "WIP: add line1"] }
deepseek/deepseek-v4-flash:none PASS 100% 105 in → 128 out (0 reasoning)
Based on the commit history provided, the commits that should be marked as squash/fixup (WIP commits) are: 1. `23fe023 WIP: add line3` 2. `4722549 WIP: add line2` 3. `8690323 WIP: add line1` These are the intermediate work-in-progress commits that should be squashed into the final feature commit (`393c26b Complete feature`) to create a cleaner history. The initial feature anchor commit (`eaa875a Initial feature`) and the final commit (`393c26b Complete feature`) are excluded as requested.
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 109 in → 30 out (0 reasoning)
WIP: add line3 WIP: add line2 WIP: add line1
JSON Schema Structured Output
(raw) { "commits": ["WIP: add line3", "WIP: add line2", "WIP: add line1"] }
mistralai/devstral-2512 PASS 100% 122 in → 157 out
Based on the commit history provided, the WIP (Work In Progress) commits that should be marked as squash/fixup for a cleaner history are: 1. `28a9835 WIP: add line3` 2. `9927041 WIP: add line2` 3. `8af1acd WIP: add line1` These commits are intermediate steps in the development of the feature and should be squashed into the final commit (`2f70423 Complete feature`) to maintain a clean and concise history. The initial commit (`47b6b7b Initial feature`) and the final commit (`2f70423 Complete feature`) should be kept as anchor commits.
mistralai/devstral-2512__json_schema PASS 100% 121 in → 51 out
a4a9f89 WIP: add line3 0927c9b WIP: add line2 2d1b89c WIP: add line1
JSON Schema Structured Output
(raw) {"commits": ["a4a9f89 WIP: add line3", "0927c9b WIP: add line2", "2d1b89c WIP: add line1"]}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 132 in → 306 out (280 reasoning)
``` WIP: add line3 WIP: add line2 WIP: add line1 ```
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 132 in → 244 out (201 reasoning)
WIP: add line3 WIP: add line2 WIP: add line1
JSON Schema Structured Output
(raw) { "commits": [ "WIP: add line3", "WIP: add line2", "WIP: add line1" ] }
nvidia/nemotron-3-nano-30b-a3b:none FAIL 100% 134 in → 440 out (0 reasoning)
To identify all the WIP commits that should be squashed or fixupped (i.e., to preserve only the clean commits and discard the intermediate WIP commits, except the initial and final anchors), we analyze the provided commit history: --- **Git Commit History (newest first):** ``` 22e9101 Complete feature 12d8d7c WIP: add line3 a54e5da WIP: add line2 83295ef WIP: add line1 aa93f8f Initial feature ``` --- ### Step-by-Step Reasoning - The goal is to **remove all WIP (Work In Progress) commits** that are not the initial or final feature anchors. - These WIP commits (`add line1`, `add line2`, `add line3`) are **intermediate** and should be squashed or fixupped to maintain a clean history. - The **initial anchor** (`aa93f8f Initial feature`) is the starting point of the feature development. It should **not be marked for squashing**. - The **final feature** (`22e9101 Complete feature`) is the target commit that includes the full feature. It should also **not be marked for squashing**. - Therefore, **only the WIP commits** that are not anchors should be considered for squashing. --- ### Identifying WIP Commits to Be Squashed From the list of commits, the WIP commits that need to be marked for squashing or fixup are: - `12d8d7c WIP: add line3` - `a54e5da WIP: add line2` - `83295ef WIP: add line1` These correspond to the **three intermediate WIP commits** that are not the initial anchor or the final feature commit. --- ### Final Answer The WIP commits that should be marked as **squash/fixup** are: --- **add line3** **add line2** **add line1**
Failure: Extra selected commit messages: ['Complete feature', 'Initial feature']
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 133 in → 38 out (0 reasoning)
WIP: add line1 WIP: add line2 WIP: add line3
JSON Schema Structured Output
(raw) { "commits": [ "WIP: add line1", "WIP: add line2", "WIP: add line3" ] }
poolside/laguna-xs.2:high FAIL 100% 162 in → 409 out (255 reasoning)
Based on the commit history, I can identify the following WIP commits that should be squashed or fixed up: **WIP commits to squash/fixup:** - `4f3b4de WIP: add line3` - `692b078 WIP: add line2` - `60e49e2 WIP: add line1` These three commits all have "WIP:" prefixes in their messages, indicating they are work-in-progress commits that would benefit from being combined into the final feature commit for a cleaner history. The initial commit (`05bfa41 Initial feature`) and final commit (`245b0c0 Complete feature`) are the anchor points and should remain as-is.
Failure: Extra selected commit messages: ['Complete feature', 'Initial feature']
poolside/laguna-xs.2:high__json_schema PASS 100% 162 in → 317 out (290 reasoning)
WIP: add line3 WIP: add line2 WIP: add line1
JSON Schema Structured Output
(raw) {"commits": ["WIP: add line3", "WIP: add line2", "WIP: add line1"]}
poolside/laguna-xs.2:none FAIL 100% 163 in → 98 out (0 reasoning)
Based on the commit history, here are the WIP commits that should be marked as squash/fixup for a cleaner history: - WIP: add line3 - WIP: add line2 - WIP: add line1 These commits are all intermediate work-in-progress commits that were made while building the feature incrementally. They should be squashed together or marked as fixup commits to be combined with the initial feature commit, resulting in a cleaner history that only shows the complete feature.
Failure: Extra selected commit messages: ['Complete feature', 'Initial feature']
poolside/laguna-xs.2:none__json_schema PASS 100% 162 in → 42 out (0 reasoning)
WIP: add line3 WIP: add line2 WIP: add line1
JSON Schema Structured Output
(raw) { "commits": [ "WIP: add line3", "WIP: add line2", "WIP: add line1" ] }