: GitBench
Combined flags: case-insensitive with line numbers
Tests ability to combine flags (case-insensitive + line numbers) with git grep -in. Evaluates multi-flag composition.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 mkdir -p docs
  5. 05 printf '# README This project uses Python. ## Setup Install PYTHON 3.10 or higher. The python interpreter must be in your PATH. ## Usage Run the main script with python main.py. ' > docs/README.md
  6. 06 git add .
  7. 07 git commit -m 'Add readme'
  8. 08 echo 'git grep -n -i python' > .grep_command
  9. 09 git add .grep_command
  10. 10 git commit -m 'Add grep sentinel'
Prompt
Here is the output of a git grep -n -i command that searches case-insensitively with line numbers. How many lines contain 'python' in any capitalization? Output ONLY the number, nothing else.
Expected
4
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 107 in → 146 out (143 reasoning)
4
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 109 in → 137 out (127 reasoning)
4
JSON Schema Structured Output
(raw) { "count": 4 }
deepseek/deepseek-v4-flash:none PASS 100% 107 in → 2 out (0 reasoning)
4
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 107 in → 9 out (0 reasoning)
4
JSON Schema Structured Output
(raw) { "count": 4 }
mistralai/devstral-2512__json_schema PASS 100% 106 in → 7 out
4
JSON Schema Structured Output
(raw) {"count": 4}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 119 in → 163 out (166 reasoning)
4
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 119 in → 92 out (74 reasoning)
4
JSON Schema Structured Output
(raw) { "count": 4 }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 119 in → 2 out (0 reasoning)
4
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 119 in → 8 out (0 reasoning)
4
JSON Schema Structured Output
(raw) { "count": 4 }
poolside/laguna-xs.2:high PASS 100% 154 in → 145 out (141 reasoning)
4
poolside/laguna-xs.2:high__json_schema PASS 100% 154 in → 186 out (178 reasoning)
4
JSON Schema Structured Output
(raw) {"count": 4}
poolside/laguna-xs.2:none PASS 100% 154 in → 3 out (0 reasoning)
4
poolside/laguna-xs.2:none__json_schema PASS 100% 154 in → 10 out (0 reasoning)
4
JSON Schema Structured Output
(raw) { "count": 4 }
mistralai/devstral-2512 FAIL 0% 106 in → 2 out
3
Failure: Expected numeric answer '4', got '3'