: GitBench
Show file content at a specific revision
Tests ability to show file content at a specific revision. Evaluates historical-file-content retrieval.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'version one' > doc.txt
  5. 05 git add doc.txt
  6. 06 git commit -m 'First version'
  7. 07 FIRST=$(git rev-parse HEAD)
  8. 08 echo 'version two' > doc.txt
  9. 09 git add doc.txt
  10. 10 git commit -m 'Second version'
Prompt
Using git show HEAD~1:doc.txt, what was the content of doc.txt in the first commit? Output ONLY the file content, nothing else.
Expected
version one
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 173 in → 69 out (68 reasoning)
version one
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 173 in → 102 out (91 reasoning)
version one
JSON Schema Structured Output
(raw) { "content": "version one" }
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 171 in → 10 out (0 reasoning)
version one
JSON Schema Structured Output
(raw) { "content": "version one" }
mistralai/devstral-2512 PASS 100% 196 in → 3 out
version one
mistralai/devstral-2512__json_schema PASS 100% 197 in → 8 out
version one
JSON Schema Structured Output
(raw) {"content": "version one"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 208 in → 230 out (256 reasoning)
version one
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 210 in → 170 out (181 reasoning)
version one
JSON Schema Structured Output
(raw) { "content": "version one" }
poolside/laguna-xs.2:high PASS 100% 246 in → 507 out (502 reasoning)
version one
poolside/laguna-xs.2:high__json_schema PASS 100% 246 in → 246 out (231 reasoning)
version one
JSON Schema Structured Output
(raw) { "content": "version one\n" }
poolside/laguna-xs.2:none PASS 100% 244 in → 4 out (0 reasoning)
version one
poolside/laguna-xs.2:none__json_schema PASS 100% 246 in → 8 out (0 reasoning)
version one
JSON Schema Structured Output
(raw) {"content":"version one\n"}
(empty output)
Failure: Model 'deepseek/deepseek-v4-flash:none' violated the reasoning_level=none invariant: provider reported 84 reasoning token(s)
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 209 in → 3 out (0 reasoning)
version two
Failure: Expected 'version one', got 'version two'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 207 in → 11 out (0 reasoning)
version two
JSON Schema Structured Output
(raw) { "content": "version two" }
Failure: Expected 'version one', got 'version two'