: GitBench
Test commits first then implementation
Tests ability to squash test-first commits with their implementation commits. Evaluates recognizing TDD-style commit patterns.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'code' > app.py
  5. 05 git add app.py
  6. 06 git commit -m 'Initial app'
  7. 07 echo 'import unittest' > test.py
  8. 08 git add test.py
  9. 09 git commit -m 'WIP: add test stub'
  10. 10 echo 'import unittest class Test(unittest.TestCase): pass' > test.py
  11. 11 git add test.py
  12. 12 git commit -m 'WIP: expand test'
  13. 13 echo 'import unittest class Test(unittest.TestCase): def test_ok(self): pass' > test.py
  14. 14 git add test.py
  15. 15 git commit -m 'Complete tests'
Prompt
Identify the WIP commits that should be marked as squash/fixup for a cleaner test file addition. Do not include the final complete-tests commit. Provide the commit messages.
Expected
WIP: add test stub, WIP: expand test
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 84 in → 563 out (604 reasoning)
- WIP: expand test - WIP: add test stub
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 85 in → 291 out (259 reasoning)
71f5304 WIP: add test stub 895c158 WIP: expand test
JSON Schema Structured Output
(raw) { "commits": [ "71f5304 WIP: add test stub", "895c158 WIP: expand test" ] }
deepseek/deepseek-v4-flash:none FAIL 100% 85 in → 100 out (0 reasoning)
Based on the commit history, the WIP commits that should be marked as squash/fixup for a cleaner test file addition are: - `60d77e4 WIP: expand test` - `3cdb613 WIP: add test stub` These two commits build incrementally toward the final test file, and they should be squashed together (keeping the message of one or creating a new combined message) before reaching the `7eeb0bf Complete tests` commit.
Failure: Extra selected commit messages: ['Complete tests']
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 84 in → 29 out (0 reasoning)
d333d00 WIP: add test stub 0491ff8 WIP: expand test
JSON Schema Structured Output
(raw) {"commits": ["d333d00 WIP: add test stub", "0491ff8 WIP: expand test"]}
mistralai/devstral-2512 PASS 100% 93 in → 147 out
Based on the commit history, the WIP commits that should be marked as squash/fixup for a cleaner test file addition are: 1. `b3d6cad WIP: expand test` (should be squashed/fixup into the final commit) 2. `e799a10 WIP: add test stub` (should be squashed/fixup into the final commit) These are intermediate work-in-progress commits that would be better consolidated into the final "Complete tests" commit for a cleaner history. The commit messages to mark would be: ``` b3d6cad WIP: expand test e799a10 WIP: add test stub ```
mistralai/devstral-2512__json_schema PASS 100% 92 in → 28 out
WIP: expand test WIP: add test stub
JSON Schema Structured Output
(raw) { "commits": [ "WIP: expand test", "WIP: add test stub" ] }
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 106 in → 185 out (142 reasoning)
- **WIP: expand test** - **WIP: add test stub**
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 107 in → 229 out (181 reasoning)
WIP: expand test WIP: add test stub
JSON Schema Structured Output
(raw) { "commits": [ "WIP: expand test", "WIP: add test stub" ] }
nvidia/nemotron-3-nano-30b-a3b:none FAIL 100% 107 in → 275 out (0 reasoning)
Based on the provided Git commit history and the requirement to identify **WIP commits for squash/fixup** (excluding `Complete tests`), the relevant WIP commits are: ``` WIP: add test stub WIP: expand test ``` ### Why these commits? - **`2954762 Initial app`** is the **base commit** (final complete commit), so it is **excluded** per your instructions. - **`594fd6c WIP: expand test`** and **`2b99a81 WIP: add test stub`** are the **partial WIP commits** that should be squashed/fixed up to clean up the test file addition. - The final commit (`9221f34 Complete tests`) is already resolved and **should not be included** in the squash/fixup process. ### Target for squashing: You should squash **both** WIP commits (`WIP: add test stub` and `WIP: expand test`) into a single commit (e.g., using `git rebase -i HEAD~3`), **excluding** `Complete tests`. ### Commit messages to provide (as requested): ``` WIP: add test stub WIP: expand test ```
Failure: Extra selected commit messages: ['Complete tests', 'Initial app']
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 107 in → 28 out (0 reasoning)
WIP: add test stub WIP: expand test
JSON Schema Structured Output
(raw) { "commits": [ "WIP: add test stub", "WIP: expand test" ] }
poolside/laguna-xs.2:high PASS 100% 135 in → 311 out (198 reasoning)
Based on the commit history, here are the WIP commits that should be marked as squash/fixup for a cleaner test file addition: 1. **WIP: expand test** (640917f) 2. **WIP: add test stub** (f19d399) These two commits represent the incremental work that was ultimately completed in the final "Complete tests" commit. Squashing or marking them as fixup would create a cleaner project history by consolidating the work-in-progress changes into the final complete commit.
poolside/laguna-xs.2:high__json_schema PASS 100% 138 in → 220 out (202 reasoning)
WIP: expand test WIP: add test stub
JSON Schema Structured Output
(raw) {"commits": ["WIP: expand test", "WIP: add test stub"]}
poolside/laguna-xs.2:none PASS 100% 134 in → 109 out (0 reasoning)
I'll identify the WIP commits that should be squashed or marked as fixup for a cleaner test file addition. Based on the commit history, the WIP commits that should be marked as squash/fixup are: 1. "WIP: expand test" 2. "WIP: add test stub" These two commits represent work-in-progress changes that build upon each other and should be combined with the final "Complete tests" commit for a cleaner history. The "Initial app" commit is the starting point and should remain separate.
poolside/laguna-xs.2:none__json_schema PASS 100% 135 in → 17 out (0 reasoning)
WIP: expand test WIP: add test stub
JSON Schema Structured Output
(raw) {"commits": ["WIP: expand test", "WIP: add test stub"]}