: GitBench
Tag an older commit
Tests ability to tag an older (non-HEAD) commit. Evaluates retroactive-tagging command knowledge.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'v1' > app.txt
  5. 05 git add app.txt
  6. 06 git commit -m 'First commit'
  7. 07 echo 'v2' > app.txt
  8. 08 git add app.txt
  9. 09 git commit -m 'Second commit'
  10. 10 echo 'v3' > app.txt
  11. 11 git add app.txt
  12. 12 git commit -m 'Third commit'
Prompt
Tag the commit with message 'First commit' as 'v0.1'. Output ONLY the git command, nothing else.
Expected
git tag v0.1 HEAD~2
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 59 in → 626 out (604 reasoning)
git tag -a v0.1 58b14f6 -m "First commit"
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 122 in → 340 out (293 reasoning)
git tag v0.1 4baea00
JSON Schema Structured Output
(raw) {"command": "git tag v0.1 4baea00"}
deepseek/deepseek-v4-flash:none PASS 100% 61 in → 18 out (0 reasoning)
git tag -a v0.1 eb0183e -m 'First commit'
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 57 in → 28 out (0 reasoning)
git tag -a v0.1 8723b83 -m 'First commit'
JSON Schema Structured Output
(raw) { "command": "git tag -a v0.1 8723b83 -m 'First commit'" }
mistralai/devstral-2512__json_schema PASS 100% 63 in → 25 out
git tag -a v0.1 ce38bde -m 'First commit'
JSON Schema Structured Output
(raw) {"command": "git tag -a v0.1 ce38bde -m 'First commit'"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 76 in → 105 out (87 reasoning)
git tag v0.1 8fdaf14
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 77 in → 402 out (284 reasoning)
git tag -a v0.1 53250b5 -m "First commit"
JSON Schema Structured Output
(raw) { "command": "git tag -a v0.1 53250b5 -m \"First commit\"" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 74 in → 14 out (0 reasoning)
git tag v0.1 10da856
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 75 in → 24 out (0 reasoning)
git tag -f v0.1 55fb709
JSON Schema Structured Output
(raw) { "command": "git tag -f v0.1 55fb709" }
poolside/laguna-xs.2:high PASS 100% 114 in → 478 out (452 reasoning)
git tag -a v0.1 -m "v0.1" 65db02f
poolside/laguna-xs.2:none PASS 100% 113 in → 14 out (0 reasoning)
git tag v0.1 aef435f
poolside/laguna-xs.2:none__json_schema FAIL 50% 110 in → 18 out (0 reasoning)
git tag v0.1 2c17aec
JSON Schema Structured Output
(raw) {"command": "git tag v0.1 2c17aec"}
Failure: Failed assertions: [{'type': 'git_output', 'command': 'git log v0.1 --oneline -1', 'contains': 'First commit'}]
mistralai/devstral-2512 FAIL 0% 62 in → 19 out
```bash git tag v0.1 27cc6c5 ```
Failure: Failed assertions: [{'type': 'git_output', 'command': 'git tag -l', 'contains': 'v0.1'}, {'type': 'git_output', 'command': 'git log v0.1 --oneline -1', 'contains': 'First commit'}]
poolside/laguna-xs.2:high__json_schema FAIL 0% 113 in → 162 out (140 reasoning)
git tag -a v0.1 02e6ed4
JSON Schema Structured Output
(raw) {"command": "git tag -a v0.1 02e6ed4"}
Failure: Failed assertions: [{'type': 'git_output', 'command': 'git tag -l', 'contains': 'v0.1'}, {'type': 'git_output', 'command': 'git log v0.1 --oneline -1', 'contains': 'First commit'}]