: GitBench
Identify commits by a specific author
Tests ability to identify commits by a specific author. Evaluates author-filtering in git log.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'alice@example.com'
  3. 03 git config user.name 'Alice'
  4. 04 echo 'v1' > code.py
  5. 05 git add code.py
  6. 06 git commit -m 'Initial code by Alice'
  7. 07 git config user.email 'bob@example.com'
  8. 08 git config user.name 'Bob'
  9. 09 echo 'v2' > code.py
  10. 10 git add code.py
  11. 11 git commit -m 'Refactor by Bob'
  12. 12 git config user.email 'alice@example.com'
  13. 13 git config user.name 'Alice'
  14. 14 echo 'v3' > code.py
  15. 15 git add code.py
  16. 16 git commit -m 'Bugfix by Alice'
  17. 17 git config user.email 'bob@example.com'
  18. 18 git config user.name 'Bob'
  19. 19 echo 'v4' > code.py
  20. 20 git add code.py
  21. 21 git commit -m 'Feature by Bob'
Prompt
Which commits were authored by Alice? Output ONLY the commit messages, one per line, nothing else.
Expected
Initial code by Alice Bugfix by Alice
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 582 in → 53 out (42 reasoning)
Bugfix by Alice Initial code by Alice
deepseek/deepseek-v4-flash:none PASS 100% 574 in → 10 out (0 reasoning)
Bugfix by Alice Initial code by Alice
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 581 in → 31 out (0 reasoning)
Bugfix by Alice Initial code by Alice
JSON Schema Structured Output
(raw) { "commit_message_list": [ "Bugfix by Alice", "Initial code by Alice" ] }
mistralai/devstral-2512 PASS 100% 726 in → 10 out
Bugfix by Alice Initial code by Alice
mistralai/devstral-2512__json_schema PASS 100% 721 in → 26 out
Bugfix by Alice Initial code by Alice
JSON Schema Structured Output
(raw) { "commit_message_list": [ "Bugfix by Alice", "Initial code by Alice" ] }
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 733 in → 100 out (105 reasoning)
Bugfix by Alice Initial code by Alice
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 742 in → 125 out (123 reasoning)
Bugfix by Alice Initial code by Alice
JSON Schema Structured Output
(raw) { "commit_message_list": [ "Bugfix by Alice", "Initial code by Alice" ] }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 746 in → 10 out (0 reasoning)
Bugfix by Alice Initial code by Alice
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 732 in → 29 out (0 reasoning)
Bugfix by Alice Initial code by Alice
JSON Schema Structured Output
(raw) { "commit_message_list": [ "Bugfix by Alice", "Initial code by Alice" ] }
poolside/laguna-xs.2:high PASS 100% 777 in → 372 out (360 reasoning)
Bugfix by Alice Initial code by Alice
poolside/laguna-xs.2:high__json_schema PASS 100% 751 in → 228 out (195 reasoning)
Bugfix by Alice Initial code by Alice
JSON Schema Structured Output
(raw) { "commit_message_list": [ "Bugfix by Alice", "Initial code by Alice" ] }
poolside/laguna-xs.2:none PASS 100% 767 in → 11 out (0 reasoning)
Bugfix by Alice Initial code by Alice
poolside/laguna-xs.2:none__json_schema PASS 100% 775 in → 19 out (0 reasoning)
Bugfix by Alice Initial code by Alice
JSON Schema Structured Output
(raw) {"commit_message_list": ["Bugfix by Alice", "Initial code by Alice"]}
Invalid JSON. Output: Bugfixby Alice Initial code by Alice
JSON Schema Structured Output
Structured Output Error
Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)
Failure: Failed to parse structured JSON response: Expecting value: line 1 column 1 (char 0)