: GitBench
Refactor multiple files
Tests ability to generate a commit message for a multi-file refactor. Evaluates describing structural reorganization across modules.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'old code' > helpers.py
  5. 05 echo 'old code' > utils.py
  6. 06 git add helpers.py utils.py
  7. 07 git commit -m 'Initial commit'
  8. 08 rm helpers.py utils.py
  9. 09 mkdir -p core
  10. 10 echo 'new code' > core/helpers.py
  11. 11 echo 'new code' > core/utils.py
  12. 12 git add -A
Prompt
Generate a concise commit message (max 50 characters) describing the changes in this diff. Focus on the action taken (e.g., 'Add', 'Fix', 'Update', 'Refactor').
Expected
Refactor helpers into core module
Loading campaign evidence…
poolside/laguna-xs.2:none__json_schema PASS 94.7% 345 in → 16 out (0 reasoning)
Refactor: Move helpers and utils to core directory
JSON Schema Structured Output
(raw) {"commit_message": "Refactor: Move helpers and utils to core directory"}
poolside/laguna-xs.2:none PASS 93.3% 345 in → 11 out (0 reasoning)
Refactor: move helpers and utils to core module
deepseek/deepseek-v4-flash:none__json_schema PASS 91.7% 266 in → 16 out (0 reasoning)
Move helpers.py and utils.py to core/
JSON Schema Structured Output
(raw) {"commit_message": "Move helpers.py and utils.py to core/"}
mistralai/devstral-2512__json_schema PASS 91.7% 298 in → 16 out
Refactor and move helpers and utils to core
JSON Schema Structured Output
(raw) {"commit_message": "Refactor and move helpers and utils to core"}
deepseek/deepseek-v4-flash:none PASS 91% 266 in → 8 out (0 reasoning)
Move helpers & utils to core/
poolside/laguna-xs.2:high__json_schema PASS 90% 345 in → 255 out (241 reasoning)
Move helpers and utils to core
JSON Schema Structured Output
(raw) {"commit_message": "Move helpers and utils to core"}
nvidia/nemotron-3-nano-30b-a3b:none PASS 89.7% 311 in → 11 out (0 reasoning)
Add core helpers and utils; drop old helper files
deepseek/deepseek-v4-flash:high PASS 89.3% 266 in → 380 out (382 reasoning)
Move helpers.py and utils.py to core/
mistralai/devstral-2512 PASS 88.3% 298 in → 10 out
"Move helpers and utils to core module"
poolside/laguna-xs.2:high PASS 87.7% 345 in → 274 out (263 reasoning)
Add core helpers/utils, remove old files
nvidia/nemotron-3-nano-30b-a3b:high PASS 77.7% 311 in → 461 out (389 reasoning)
Add core helpers, delete legacy helpers
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 50% 311 in → 26 out (0 reasoning)
Add core/helpers.py and core/utils.py, remove old helpers.py and utils.py
JSON Schema Structured Output
(raw) { "commit_message": "Add core/helpers.py and core/utils.py, remove old helpers.py and utils.py" }
nvidia/nemotron-3-nano-30b-a3b:high__json_schema FAIL 48.3% 311 in → 1,046 out (846 reasoning)
Add core/helpers.py core/utils.py; delete helpers
JSON Schema Structured Output
(raw) { "commit_message": "Add core/helpers.py core/utils.py; delete helpers" }
Invalid JSON. Output: Move utils and helpers to core/
JSON Schema Structured Output
Structured Output Error
Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)
Failure: Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)