: GitBench
Update configuration
Tests ability to generate a commit message for a configuration change. Evaluates describing infrastructure/config modifications.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 printf '{\n "version": "1.0",\n "debug": false\n}\n' > config.json
  5. 05 git add config.json
  6. 06 git commit -m 'Add config'
  7. 07 printf '{\n "version": "1.1",\n "debug": true,\n "cache": true\n}\n' > config.json
  8. 08 git add config.json
Prompt
Generate a concise commit message (max 50 characters) describing the changes in this diff. Focus on the action taken (e.g., 'Add', 'Fix', 'Update').
Expected
Update config with caching enabled
Loading campaign evidence…
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 84.7% 155 in → 13 out (0 reasoning)
Update version, enable debug and cache
JSON Schema Structured Output
(raw) {"commit_message":"Update version, enable debug and cache"}
mistralai/devstral-2512__json_schema PASS 83.7% 142 in → 20 out
Update config: version 1.1, enable debug & cache
JSON Schema Structured Output
(raw) {"commit_message": "Update config: version 1.1, enable debug & cache"}
mistralai/devstral-2512 PASS 78.3% 142 in → 16 out
"Update config: version 1.1, enable debug and cache"
poolside/laguna-xs.2:none__json_schema PASS 70% 195 in → 22 out (0 reasoning)
Update config: bump version, enable debug and cache
JSON Schema Structured Output
(raw) { "commit_message": "Update config: bump version, enable debug and cache" }
deepseek/deepseek-v4-flash:high__json_schema PASS 64% 139 in → 643 out (623 reasoning)
Update config to v1.1, enable debug and cache
JSON Schema Structured Output
(raw) {"commit_message": "Update config to v1.1, enable debug and cache"}
nvidia/nemotron-3-nano-30b-a3b:none PASS 64% 155 in → 11 out (0 reasoning)
Update config: bump version, enable debug and cache
poolside/laguna-xs.2:none PASS 62.7% 195 in → 12 out (0 reasoning)
Update config: bump version, enable debug and cache
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 61.3% 155 in → 747 out (500 reasoning)
Update config: bump version, enable debug and cache
JSON Schema Structured Output
(raw) { "commit_message": "Update config: bump version, enable debug and cache" }
poolside/laguna-xs.2:high__json_schema PASS 61.3% 195 in → 198 out (178 reasoning)
Update config: v1.1, enable debug and cache
JSON Schema Structured Output
(raw) {"commit_message": "Update config: v1.1, enable debug and cache"}
poolside/laguna-xs.2:high PASS 59.3% 195 in → 404 out (388 reasoning)
Update config: v1.1, enable debug, add cache
deepseek/deepseek-v4-flash:none PASS 58.7% 137 in → 13 out (0 reasoning)
Update config to v1.1, enable debug and cache
nvidia/nemotron-3-nano-30b-a3b:high PASS 54.3% 155 in → 1,011 out (634 reasoning)
Bump version to 1.1, enable debug and cache
deepseek/deepseek-v4-flash:none__json_schema FAIL 49.3% 137 in → 21 out (0 reasoning)
Update config: version 1.1, debug on, add cache
JSON Schema Structured Output
(raw) {"commit_message": "Update config: version 1.1, debug on, add cache"}
deepseek/deepseek-v4-flash:high FAIL 43.3% 137 in → 344 out (330 reasoning)
Update config: bump version, add cache, enable debug