: GitBench
Search across branches using git grep on a specific ref
Tests ability to search across branches using git grep on a specific ref. Evaluates revision-scoped search.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'version = 1.0' > version.txt
  5. 05 git add version.txt
  6. 06 git commit -m 'Initial version 1.0'
  7. 07 git checkout -b feature/v2
  8. 08 echo 'version = 2.0' > version.txt
  9. 09 git add version.txt
  10. 10 git commit -m 'Bump to version 2.0'
  11. 11 git checkout main
  12. 12 echo 'git grep version feature/v2' > .grep_command
  13. 13 git add .grep_command
  14. 14 git commit -m 'Add grep sentinel'
Prompt
Here is the output of a git grep command run against the feature/v2 branch. What version number is found? Output ONLY the version number, nothing else.
Expected
2.0
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 50 in → 48 out (44 reasoning)
2.0
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 50 in → 73 out (60 reasoning)
2.0
JSON Schema Structured Output
(raw) { "version_number": "2.0" }
deepseek/deepseek-v4-flash:none PASS 100% 50 in → 4 out (0 reasoning)
2.0
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 50 in → 13 out (0 reasoning)
2.0
JSON Schema Structured Output
(raw) { "version_number": "2.0" }
mistralai/devstral-2512 PASS 100% 49 in → 4 out
2.0
mistralai/devstral-2512__json_schema PASS 100% 49 in → 10 out
2.0
JSON Schema Structured Output
(raw) {"version_number": "2.0"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 62 in → 79 out (75 reasoning)
2.0
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 62 in → 238 out (241 reasoning)
2.0
JSON Schema Structured Output
(raw) { "version_number": "2.0" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 62 in → 4 out (0 reasoning)
2.0
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 62 in → 12 out (0 reasoning)
2.0
JSON Schema Structured Output
(raw) { "version_number": "2.0" }
poolside/laguna-xs.2:high PASS 100% 97 in → 59 out (53 reasoning)
2.0
poolside/laguna-xs.2:high__json_schema PASS 100% 97 in → 210 out (194 reasoning)
2.0
JSON Schema Structured Output
(raw) { "version_number": "2.0" }
poolside/laguna-xs.2:none PASS 100% 97 in → 5 out (0 reasoning)
2.0
poolside/laguna-xs.2:none__json_schema PASS 100% 97 in → 10 out (0 reasoning)
2.0
JSON Schema Structured Output
(raw) {"version_number": "2.0"}