: GitBench
Context lines with git grep -C
Tests ability to show context lines with git grep -C. Evaluates understanding of surrounding-line output.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 printf 'import sys import os CONFIG = { "host": "localhost", "port": 8080, "debug": True, "workers": 4, } def start_server(): host = CONFIG["host"] port = CONFIG["port"] print(f"Starting on {host}:{port}") ' > server.py
  5. 05 git add .
  6. 06 git commit -m 'Add server config'
  7. 07 echo 'git grep -C 2 port' > .grep_command
  8. 08 git add .grep_command
  9. 09 git commit -m 'Add grep sentinel'
Prompt
Here is the output of a git grep -C 2 command showing 2 lines of context around matches for 'port'. How many total lines (including context) appear in the output? Output ONLY the number, nothing else.
Expected
13
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 156 in → 2,827 out (2,824 reasoning)
13
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 219 in → 245 out (224 reasoning)
13
JSON Schema Structured Output
(raw) {"count": 13}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 169 in → 2,343 out (2,350 reasoning)
13
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 169 in → 1,285 out (1,286 reasoning)
13
JSON Schema Structured Output
(raw) { "count" : 13 }
poolside/laguna-xs.2:high PASS 100% 210 in → 823 out (818 reasoning)
13
poolside/laguna-xs.2:high__json_schema PASS 100% 210 in → 409 out (398 reasoning)
13
JSON Schema Structured Output
(raw) {"count": 13}
deepseek/deepseek-v4-flash:none FAIL 0% 156 in → 1 out (0 reasoning)
12
Failure: Expected numeric answer '13', got '12'
deepseek/deepseek-v4-flash:none__json_schema FAIL 0% 156 in → 11 out (0 reasoning)
6
JSON Schema Structured Output
(raw) { "count": 6 }
Failure: Expected numeric answer '13', got '6'
mistralai/devstral-2512 FAIL 0% 156 in → 3 out
12
Failure: Expected numeric answer '13', got '12'
12
JSON Schema Structured Output
(raw) {"count": 12}
Failure: Expected numeric answer '13', got '12'
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 169 in → 2 out (0 reasoning)
5
Failure: Expected numeric answer '13', got '5'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 169 in → 11 out (0 reasoning)
15
JSON Schema Structured Output
(raw) { "count": 15 }
Failure: Expected numeric answer '13', got '15'
poolside/laguna-xs.2:none FAIL 0% 210 in → 4 out (0 reasoning)
30
Failure: Expected numeric answer '13', got '30'
poolside/laguna-xs.2:none__json_schema FAIL 0% 210 in → 7 out (0 reasoning)
15
JSON Schema Structured Output
(raw) {"count":15}
Failure: Expected numeric answer '13', got '15'