: GitBench
Show commit touching multiple files
Tests ability to inspect a commit touching multiple files. Evaluates multi-file commit comprehension.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'src' > main.py
  5. 05 echo 'test' > test_main.py
  6. 06 echo 'docs' > README.md
  7. 07 git add main.py test_main.py README.md
  8. 08 git commit -m 'Add all project files'
Prompt
Using git show --stat, how many files were added in the commit 'Add all project files'? Output ONLY the number, nothing else.
Expected
3
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 292 in → 192 out (198 reasoning)
3
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 354 in → 98 out (96 reasoning)
3
JSON Schema Structured Output
(raw) {"count":3}
deepseek/deepseek-v4-flash:none PASS 100% 292 in → 2 out (0 reasoning)
3
mistralai/devstral-2512 PASS 100% 332 in → 2 out
3
mistralai/devstral-2512__json_schema PASS 100% 336 in → 7 out
3
JSON Schema Structured Output
(raw) {"count": 3}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 347 in → 81 out (83 reasoning)
3
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 342 in → 104 out (99 reasoning)
3
JSON Schema Structured Output
(raw) { "count": 3 }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 342 in → 2 out (0 reasoning)
3
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 344 in → 15 out (0 reasoning)
3
JSON Schema Structured Output
(raw) { "count": 3 }
poolside/laguna-xs.2:high PASS 100% 386 in → 195 out (191 reasoning)
3
poolside/laguna-xs.2:high__json_schema PASS 100% 382 in → 144 out (136 reasoning)
3
JSON Schema Structured Output
(raw) {"count": 3}
poolside/laguna-xs.2:none PASS 100% 379 in → 3 out (0 reasoning)
3
poolside/laguna-xs.2:none__json_schema PASS 100% 382 in → 7 out (0 reasoning)
3
JSON Schema Structured Output
(raw) {"count": 3}
deepseek/deepseek-v4-flash:none__json_schema FAIL 0% 294 in → 7 out (0 reasoning)
0
JSON Schema Structured Output
(raw) {"count": 0}
Failure: Expected '3', got '0'