: GitBench
Fix bug in function
Tests ability to generate a commit message for a bug fix. Evaluates recognizing and describing fix semantics in a diff.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 printf 'def add(a, b):\n return a + b\n' > calculator.py
  5. 05 git add calculator.py
  6. 06 git commit -m 'Add calculator'
  7. 07 printf 'def add(a, b):\n return int(a) + int(b)\n' > calculator.py
  8. 08 git add calculator.py
Prompt
Generate a concise commit message (max 50 characters) describing the changes in this diff. Focus on the action taken (e.g., 'Add', 'Fix', 'Update').
Expected
Fix add function to handle string inputs
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 93.3% 119 in → 331 out (323 reasoning)
Fix: convert add arguments to int
deepseek/deepseek-v4-flash:none PASS 90% 119 in → 7 out (0 reasoning)
Fix type conversion in add function
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 89.7% 143 in → 17 out (0 reasoning)
Update add() to cast arguments to int
JSON Schema Structured Output
(raw) { "commit_message": "Update add() to cast arguments to int" }
mistralai/devstral-2512 PASS 88.7% 130 in → 10 out
"Fix add() to handle string inputs"
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 88.7% 143 in → 670 out (475 reasoning)
Cast operands to int in add()
JSON Schema Structured Output
(raw) { "commit_message": "Cast operands to int in add()" }
poolside/laguna-xs.2:high PASS 87.7% 176 in → 253 out (242 reasoning)
Update add function to convert inputs to int
poolside/laguna-xs.2:none__json_schema PASS 86.7% 176 in → 14 out (0 reasoning)
Cast inputs to int in add function
JSON Schema Structured Output
(raw) {"commit_message": "Cast inputs to int in add function"}
nvidia/nemotron-3-nano-30b-a3b:none PASS 85% 143 in → 10 out (0 reasoning)
Update add() to explicitly cast inputs to int
deepseek/deepseek-v4-flash:none__json_schema PASS 83.3% 121 in → 13 out (0 reasoning)
Fix add to convert args to int
JSON Schema Structured Output
(raw) {"commit_message": "Fix add to convert args to int"}
poolside/laguna-xs.2:high__json_schema PASS 83.3% 176 in → 363 out (348 reasoning)
Cast inputs to int in add function
JSON Schema Structured Output
(raw) {"commit_message": "Cast inputs to int in add function"}
poolside/laguna-xs.2:none PASS 81.7% 176 in → 10 out (0 reasoning)
Fix add function to convert inputs to int
mistralai/devstral-2512__json_schema PASS 81% 130 in → 13 out
Fix addition to handle string inputs
JSON Schema Structured Output
(raw) {"commit_message": "Fix addition to handle string inputs"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 80% 143 in → 232 out (187 reasoning)
Convert inputs to int in add function
Invalid JSON. Output:
JSON Schema Structured Output
Structured Output Error
Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)
Failure: Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)