: GitBench
Search commit messages using git log --grep
Tests ability to search commit messages using git log --grep. Evaluates understanding of log-search vs file-search.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'alpha' > alpha.txt
  5. 05 git add alpha.txt
  6. 06 git commit -m 'Add alpha module'
  7. 07 echo 'beta' > beta.txt
  8. 08 git add beta.txt
  9. 09 git commit -m 'Fix beta parsing bug'
  10. 10 echo 'gamma' > gamma.txt
  11. 11 git add gamma.txt
  12. 12 git commit -m 'Add gamma feature'
  13. 13 echo 'updated alpha' > alpha.txt
  14. 14 git add alpha.txt
  15. 15 git commit -m 'Fix alpha edge case'
  16. 16 echo 'git log --oneline --grep=Fix' > .grep_command
  17. 17 git add .grep_command
  18. 18 git commit -m 'Add grep sentinel'
Prompt
Here is the output of a git log --grep command run on this repository. How many commits contain the word 'Fix' in their message? Output ONLY the number, nothing else.
Expected
2
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 60 in → 127 out (124 reasoning)
2
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 60 in → 9 out (0 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
mistralai/devstral-2512 PASS 100% 63 in → 2 out
2
mistralai/devstral-2512__json_schema PASS 100% 63 in → 7 out
2
JSON Schema Structured Output
(raw) {"count": 2}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 76 in → 75 out (79 reasoning)
2
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 73 in → 117 out (110 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 77 in → 2 out (0 reasoning)
2
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 77 in → 8 out (0 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
poolside/laguna-xs.2:high PASS 100% 110 in → 85 out (81 reasoning)
2
poolside/laguna-xs.2:high__json_schema PASS 100% 112 in → 86 out (74 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
poolside/laguna-xs.2:none PASS 100% 112 in → 3 out (0 reasoning)
2
poolside/laguna-xs.2:none__json_schema PASS 100% 112 in → 11 out (0 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
Invalid structured output. Output: 2
JSON Schema Structured Output
Structured Output Error
Structured output schema validation failed: $ must be of type object
Failure: Structured output schema validation failed: $ must be of type object
deepseek/deepseek-v4-flash:none FAIL 0% 60 in → 2 out (0 reasoning)
3
Failure: Expected '2', got '3'