: GitBench
Word boundary search using git grep -w
Tests ability to perform word-boundary search with git grep -w. Evaluates precise vs substring matching awareness.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 mkdir -p src
  5. 05 echo 'data = load_data("input.csv") result = process(data) save_data(result, "output.csv") data_loader = DataLoader()' > src/pipeline.py
  6. 06 git add .
  7. 07 git commit -m 'Add data pipeline'
  8. 08 echo 'git grep -w data' > .grep_command
  9. 09 git add .grep_command
  10. 10 git commit -m 'Add grep sentinel'
Prompt
Here is the output of a git grep -w command that searches for the whole word 'data'. How many lines match? Output ONLY the number, nothing else.
Expected
2
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 60 in → 88 out (85 reasoning)
2
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 60 in → 230 out (220 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 60 in → 8 out (0 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
mistralai/devstral-2512 PASS 100% 59 in → 2 out
2
mistralai/devstral-2512__json_schema PASS 100% 59 in → 7 out
2
JSON Schema Structured Output
(raw) {"count": 2}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 72 in → 574 out (618 reasoning)
2
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 72 in → 209 out (211 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 72 in → 2 out (0 reasoning)
2
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 72 in → 10 out (0 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
poolside/laguna-xs.2:high PASS 100% 108 in → 84 out (80 reasoning)
2
poolside/laguna-xs.2:high__json_schema PASS 100% 108 in → 118 out (106 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
poolside/laguna-xs.2:none PASS 100% 108 in → 3 out (0 reasoning)
2
poolside/laguna-xs.2:none__json_schema PASS 100% 108 in → 7 out (0 reasoning)
2
JSON Schema Structured Output
(raw) {"count": 2}
deepseek/deepseek-v4-flash:none FAIL 0% 60 in → 1 out (0 reasoning)
4
Failure: Expected '2', got '4'