: GitBench
Count commits by a specific author
Tests ability to count commits by a specific author. Evaluates combining author filter with counting.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'alice@example.com'
  3. 03 git config user.name 'Alice'
  4. 04 echo 'a' > file.txt
  5. 05 git add file.txt
  6. 06 git commit -m 'Alice commit 1'
  7. 07 git config user.email 'bob@example.com'
  8. 08 git config user.name 'Bob'
  9. 09 echo 'b' > file.txt
  10. 10 git add file.txt
  11. 11 git commit -m 'Bob commit 1'
  12. 12 git config user.email 'alice@example.com'
  13. 13 git config user.name 'Alice'
  14. 14 echo 'c' > file.txt
  15. 15 git add file.txt
  16. 16 git commit -m 'Alice commit 2'
  17. 17 git config user.email 'alice@example.com'
  18. 18 git config user.name 'Alice'
  19. 19 echo 'd' > file.txt
  20. 20 git add file.txt
  21. 21 git commit -m 'Alice commit 3'
  22. 22 git config user.email 'bob@example.com'
  23. 23 git config user.name 'Bob'
  24. 24 echo 'e' > file.txt
  25. 25 git add file.txt
  26. 26 git commit -m 'Bob commit 2'
Prompt
How many commits were authored by Alice? Output ONLY the number, nothing else.
Expected
3
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 704 in → 88 out (86 reasoning)
3
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 704 in → 79 out (70 reasoning)
3
JSON Schema Structured Output
(raw) { "count": 3 }
deepseek/deepseek-v4-flash:none PASS 100% 700 in → 2 out (0 reasoning)
3
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 702 in → 9 out (0 reasoning)
3
JSON Schema Structured Output
(raw) { "count": 3 }
mistralai/devstral-2512 PASS 100% 888 in → 2 out
3
mistralai/devstral-2512__json_schema PASS 100% 892 in → 7 out
3
JSON Schema Structured Output
(raw) {"count": 3}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 922 in → 91 out (74 reasoning)
3
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 910 in → 68 out (55 reasoning)
3
JSON Schema Structured Output
(raw) { "count": 3 }
poolside/laguna-xs.2:high PASS 100% 938 in → 239 out (235 reasoning)
3
poolside/laguna-xs.2:high__json_schema PASS 100% 917 in → 885 out (884 reasoning)
3
JSON Schema Structured Output
(raw) {"count":3}
poolside/laguna-xs.2:none PASS 100% 938 in → 3 out (0 reasoning)
3
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 914 in → 3 out (0 reasoning)
3 ways
Failure: Expected '3', got '3 ways'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 909 in → 8 out (0 reasoning)
4
JSON Schema Structured Output
(raw) { "count": 4 }
Failure: Expected '3', got '4'
poolside/laguna-xs.2:none__json_schema FAIL 0% 938 in → 7 out (0 reasoning)
4
JSON Schema Structured Output
(raw) {"count": 4}
Failure: Expected '3', got '4'