: GitBench
Basic keyword search across tracked files
Tests ability to search tracked files for a keyword using git grep. Evaluates basic code search comprehension.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 mkdir -p src
  5. 05 echo 'def authenticate(user, password): if not user: raise ValueError("user required") return check_credentials(user, password)' > src/auth.py
  6. 06 echo 'def render_page(user): template = load_template("home.html") return template.render(user=user)' > src/views.py
  7. 07 echo 'import logging logger = logging.getLogger(__name__) def check_credentials(user, password): logger.info(f"Auth attempt for {user}") return db.lookup(user, password)' > src/db.py
  8. 08 git add .
  9. 09 git commit -m 'Add project files'
  10. 10 echo 'git grep authenticate' > .grep_command
  11. 11 git add .grep_command
  12. 12 git commit -m 'Add grep sentinel'
Prompt
Here is the output of a git grep command run on this repository. Which files contain the word 'authenticate'? Output ONLY the filenames, one per line, nothing else.
Expected
src/auth.py
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 52 in → 74 out (70 reasoning)
src/auth.py
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 52 in → 78 out (66 reasoning)
src/auth.py
JSON Schema Structured Output
(raw) { "files": [ "src/auth.py" ] }
deepseek/deepseek-v4-flash:none PASS 100% 52 in → 3 out (0 reasoning)
src/auth.py
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 52 in → 16 out (0 reasoning)
src/auth.py
JSON Schema Structured Output
(raw) { "files": [ "src/auth.py" ] }
mistralai/devstral-2512 PASS 100% 49 in → 4 out
src/auth.py
mistralai/devstral-2512__json_schema PASS 100% 49 in → 16 out
src/auth.py
JSON Schema Structured Output
(raw) { "files": [ "src/auth.py" ] }
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 62 in → 5,230 out (6,308 reasoning)
src/auth.py
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 62 in → 72 out (60 reasoning)
src/auth.py
JSON Schema Structured Output
(raw) { "files": [ "src/auth.py" ] }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 62 in → 4 out (0 reasoning)
src/auth.py
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 62 in → 16 out (0 reasoning)
src/auth.py
JSON Schema Structured Output
(raw) { "files": [ "src/auth.py" ] }
poolside/laguna-xs.2:high PASS 100% 95 in → 178 out (172 reasoning)
src/auth.py
poolside/laguna-xs.2:high__json_schema PASS 100% 95 in → 105 out (84 reasoning)
src/auth.py
JSON Schema Structured Output
(raw) { "files": [ "src/auth.py" ] }
poolside/laguna-xs.2:none PASS 100% 95 in → 5 out (0 reasoning)
src/auth.py
poolside/laguna-xs.2:none__json_schema PASS 100% 95 in → 9 out (0 reasoning)
src/auth.py
JSON Schema Structured Output
(raw) {"files": ["src/auth.py"]}