f005 — git

Combined flags: case-insensitive with line numbers

Tests ability to combine flags (case-insensitive + line numbers) with git grep -in. Evaluates multi-flag composition.

easy git-grep combined-flags case-insensitive

Baseline Repository

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

01 git init
02 git config user.email 'test@test.com'
03 git config user.name 'Test User'
04 mkdir -p docs
05 printf '# README This project uses Python. ## Setup Install PYTHON 3.10 or higher. The python interpreter must be in your PATH. ## Usage Run the main script with python main.py. ' > docs/README.md
06 git add .
07 git commit -m 'Add readme'
08 echo 'git grep -n -i python' > .grep_command
09 git add .grep_command
10 git commit -m 'Add grep sentinel'

Prompt

Here is the output of a git grep -n -i command that searches case-insensitively with line numbers. How many lines contain 'python' in any capitalization? Output ONLY the number, nothing else.

Expected

Campaign Evidence

Loading campaign evidence…

Model Outputs (14)

deepseek/deepseek-v4-flash:high PASS 100% 107 in → 146 out (143 reasoning)

deepseek/deepseek-v4-flash:high__json_schema PASS 100% 109 in → 137 out (127 reasoning)

JSON Schema Structured Output

(raw) { "count": 4 }

deepseek/deepseek-v4-flash:none PASS 100% 107 in → 2 out (0 reasoning)

deepseek/deepseek-v4-flash:none__json_schema PASS 100% 107 in → 9 out (0 reasoning)

JSON Schema Structured Output

(raw) { "count": 4 }

mistralai/devstral-2512__json_schema PASS 100% 106 in → 7 out

JSON Schema Structured Output

(raw) {"count": 4}

nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 119 in → 163 out (166 reasoning)

nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 119 in → 92 out (74 reasoning)

JSON Schema Structured Output

(raw) { "count": 4 }

nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 119 in → 2 out (0 reasoning)

nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 119 in → 8 out (0 reasoning)

JSON Schema Structured Output

(raw) { "count": 4 }

poolside/laguna-xs.2:high PASS 100% 154 in → 145 out (141 reasoning)

poolside/laguna-xs.2:high__json_schema PASS 100% 154 in → 186 out (178 reasoning)

JSON Schema Structured Output

(raw) {"count": 4}

poolside/laguna-xs.2:none PASS 100% 154 in → 3 out (0 reasoning)

poolside/laguna-xs.2:none__json_schema PASS 100% 154 in → 10 out (0 reasoning)

JSON Schema Structured Output

(raw) { "count": 4 }

mistralai/devstral-2512 FAIL 0% 106 in → 2 out

Failure: Expected numeric answer '4', got '3'