: GitBench
Path-limited search restricting to a subdirectory
Tests ability to restrict search to a subdirectory path. Evaluates path-limited search scoping.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 mkdir -p src tests
  5. 05 echo 'def connect(): return True' > src/db.py
  6. 06 echo 'def test_connect(): assert connect()' > tests/test_db.py
  7. 07 echo 'def helper(): pass' > src/utils.py
  8. 08 echo 'def test_helper(): assert helper()' > tests/test_utils.py
  9. 09 git add .
  10. 10 git commit -m 'Add source and test files'
  11. 11 echo 'git grep def -- src/' > .grep_command
  12. 12 git add .grep_command
  13. 13 git commit -m 'Add grep sentinel'
Prompt
Here is the output of a git grep command limited to the src/ directory. How many function definitions appear in the output? Output ONLY the number, nothing else.
Expected
2
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 57 in → 90 out (87 reasoning)
2
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 59 in → 7 out (0 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
mistralai/devstral-2512 PASS 100% 55 in → 2 out
2
mistralai/devstral-2512__json_schema PASS 100% 55 in → 7 out
2
JSON Schema Structured Output
(raw) {"count": 2}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 68 in → 51 out (52 reasoning)
2
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 68 in → 111 out (109 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 68 in → 2 out (0 reasoning)
2
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 68 in → 10 out (0 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
poolside/laguna-xs.2:high PASS 100% 101 in → 70 out (66 reasoning)
2
poolside/laguna-xs.2:high__json_schema PASS 100% 101 in → 109 out (97 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
poolside/laguna-xs.2:none PASS 100% 101 in → 3 out (0 reasoning)
2
poolside/laguna-xs.2:none__json_schema PASS 100% 101 in → 11 out (0 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
Invalid structured output. Output: 2
JSON Schema Structured Output
Structured Output Error
Structured output schema validation failed: $ must be of type object
Failure: Structured output schema validation failed: $ must be of type object
deepseek/deepseek-v4-flash:none FAIL 0% 57 in → 2 out (0 reasoning)
3
Failure: Expected '2', got '3'