: GitBench
File modification
Tests ability to generate a commit message for a single-file content modification. Evaluates describing incremental changes.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'original content' > app.py
  5. 05 git add app.py
  6. 06 git commit -m 'Initial commit'
  7. 07 echo 'updated content' > app.py
  8. 08 git add app.py
Prompt
Generate a concise commit message (max 50 characters) describing the changes in this diff. Focus on the action taken (e.g., 'Add', 'Fix', 'Update').
Expected
Update app.py with new content
Loading campaign evidence…
deepseek/deepseek-v4-flash:none PASS 94% 96 in → 4 out (0 reasoning)
Update app.py content
poolside/laguna-xs.2:high__json_schema PASS 93.3% 150 in → 273 out (261 reasoning)
Update app.py content
JSON Schema Structured Output
(raw) {"commit_message": "Update app.py content"}
mistralai/devstral-2512__json_schema PASS 92.3% 103 in → 11 out
Update app.py content
JSON Schema Structured Output
(raw) {"commit_message": "Update app.py content"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 91.7% 116 in → 249 out (187 reasoning)
Update app.py content
poolside/laguna-xs.2:none PASS 91.7% 150 in → 6 out (0 reasoning)
Update app.py content
deepseek/deepseek-v4-flash:none__json_schema PASS 91.3% 98 in → 9 out (0 reasoning)
Update app.py content
JSON Schema Structured Output
(raw) {"commit_message":"Update app.py content"}
deepseek/deepseek-v4-flash:high PASS 91% 96 in → 133 out (127 reasoning)
Update app.py content
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 91% 116 in → 111 out (104 reasoning)
Update app.py content
JSON Schema Structured Output
(raw) { "commit_message": "Update app.py content" }
poolside/laguna-xs.2:none__json_schema PASS 91% 150 in → 11 out (0 reasoning)
Update app.py content
JSON Schema Structured Output
(raw) {"commit_message": "Update app.py content"}
mistralai/devstral-2512 PASS 90% 103 in → 5 out
Update app.py content
poolside/laguna-xs.2:high PASS 83.3% 150 in → 202 out (195 reasoning)
Update app.py content
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 63.3% 116 in → 12 out (0 reasoning)
Update app.py with content revisions
JSON Schema Structured Output
(raw) {"commit_message":"Update app.py with content revisions"}
nvidia/nemotron-3-nano-30b-a3b:none FAIL 16.7% 116 in → 6 out (0 reasoning)
Update configuration and logging settings
Invalid JSON. Output:
JSON Schema Structured Output
Structured Output Error
Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)
Failure: Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)