: GitBench
Show diff between two commits
Tests ability to show diff between two commits. Evaluates comparative-commit inspection.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 printf 'line1 line2 line3 ' > file.txt
  5. 05 git add file.txt
  6. 06 git commit -m 'Original three lines'
  7. 07 printf 'line1 modified line3 ' > file.txt
  8. 08 git add file.txt
  9. 09 git commit -m 'Changed line2'
Prompt
Using git show --stat, how many files were changed in the commit 'Changed line2'? Output ONLY the number, nothing else.
Expected
1
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 181 in → 52 out (50 reasoning)
1
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 178 in → 83 out (76 reasoning)
1
JSON Schema Structured Output
(raw) { "count": 1 }
deepseek/deepseek-v4-flash:none PASS 100% 182 in → 2 out (0 reasoning)
1
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 180 in → 8 out (0 reasoning)
1
JSON Schema Structured Output
(raw) { "count": 1 }
mistralai/devstral-2512 PASS 100% 207 in → 2 out
1
mistralai/devstral-2512__json_schema PASS 100% 208 in → 7 out
1
JSON Schema Structured Output
(raw) {"count": 1}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 219 in → 103 out (99 reasoning)
1
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 219 in → 87 out (82 reasoning)
1
JSON Schema Structured Output
(raw) { "count": 1 }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 220 in → 2 out (0 reasoning)
1
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 220 in → 8 out (0 reasoning)
1
JSON Schema Structured Output
(raw) { "count": 1 }
poolside/laguna-xs.2:high PASS 100% 256 in → 204 out (201 reasoning)
1
poolside/laguna-xs.2:high__json_schema PASS 100% 252 in → 139 out (131 reasoning)
1
JSON Schema Structured Output
(raw) {"count": 1}
poolside/laguna-xs.2:none PASS 100% 254 in → 3 out (0 reasoning)
1
poolside/laguna-xs.2:none__json_schema PASS 100% 253 in → 11 out (0 reasoning)
1
JSON Schema Structured Output
(raw) { "count": 1 }