: GitBench
WIP commits - two consecutive WIP commits should be squashed
Tests ability to identify that two consecutive WIP commits should be squashed into one meaningful commit. Evaluates recognizing work-in-progress patterns.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo '# Project' > README.md
  5. 05 git add README.md
  6. 06 git commit -m 'Initial commit'
  7. 07 echo 'print(1)' > main.py
  8. 08 git add main.py
  9. 09 git commit -m 'WIP: add main.py'
  10. 10 echo 'print(2)' >> main.py
  11. 11 git add main.py
  12. 12 git commit -m 'WIP: continue work'
  13. 13 echo 'print(3)' >> main.py
  14. 14 git add main.py
  15. 15 git commit -m 'Complete feature'
Prompt
Analyze this git history and identify which consecutive commits should be marked as squash/fixup in an interactive rebase to create a cleaner history. Focus on WIP (Work In Progress) commits that represent incomplete work. Do not include the base/anchor commit they would be squashed into. Provide the commit hashes or commit messages in the order they should be combined. Output ONLY the selected commits, one per line, with no explanation and no base/anchor/final commits.
Expected
WIP: add main.py, WIP: continue work
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 147 in → 1,116 out (1,068 reasoning)
993c9fd 6f7ab45
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 147 in → 660 out (634 reasoning)
c24a1a9 76f3dcd
JSON Schema Structured Output
(raw) { "commits": ["c24a1a9", "76f3dcd"] }
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 145 in → 26 out (0 reasoning)
a5259f4 7b402ed
JSON Schema Structured Output
(raw) { "commits": [ "a5259f4", "7b402ed" ] }
mistralai/devstral-2512 PASS 100% 154 in → 14 out
6db6a69 d2990de
mistralai/devstral-2512__json_schema PASS 100% 154 in → 23 out
c367b0c 99dc37f
JSON Schema Structured Output
(raw) {"commits": ["c367b0c", "99dc37f"]}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 167 in → 464 out (434 reasoning)
2d6563b 6c5436a
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 168 in → 1,361 out (1,284 reasoning)
2d99955 10260d7
JSON Schema Structured Output
(raw) { "commits": [ "2d99955", "10260d7" ] }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 167 in → 37 out (0 reasoning)
c764b74 WIP: add main.py 6978e5a WIP: continue work 76a392c Complete feature
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 165 in → 32 out (0 reasoning)
d74f567 03fc360
JSON Schema Structured Output
(raw) { "commits": [ "d74f567", "03fc360" ] }
poolside/laguna-xs.2:high PASS 100% 194 in → 331 out (314 reasoning)
6bb7322 2e51009
poolside/laguna-xs.2:high__json_schema PASS 100% 197 in → 1,947 out (1,915 reasoning)
4d0d107 WIP: add main.py 606097f WIP: continue work
JSON Schema Structured Output
(raw) {"commits": ["4d0d107 WIP: add main.py", "606097f WIP: continue work"]}
poolside/laguna-xs.2:none PASS 100% 194 in → 24 out (0 reasoning)
dd38fd1 WIP: continue work 21e1216 WIP: add main.py
poolside/laguna-xs.2:none__json_schema PASS 100% 196 in → 22 out (0 reasoning)
490bd45 c228897
JSON Schema Structured Output
(raw) {"commits": ["490bd45", "c228897"]}
(empty output)
Failure: Model 'deepseek/deepseek-v4-flash:none' violated the reasoning_level=none invariant: provider reported 1420 reasoning token(s)