: GitBench
Create an annotated tag with a message
Tests ability to create an annotated tag with a message. Evaluates annotated-tag creation with metadata.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 echo 'App' > app.txt
  5. 05 git add app.txt
  6. 06 git commit -m 'Initial release'
Prompt
Create an annotated git tag named 'v1.0.0' on HEAD with the message 'First stable release'. Output ONLY the git command, nothing else.
Expected
git tag -a v1.0.0 -m 'First stable release'
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 69 in → 178 out (158 reasoning)
git tag -a v1.0.0 -m 'First stable release' HEAD
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 51 in → 140 out (109 reasoning)
git tag -a v1.0.0 -m 'First stable release'
JSON Schema Structured Output
(raw) {"command": "git tag -a v1.0.0 -m 'First stable release'"}
deepseek/deepseek-v4-flash:none PASS 100% 51 in → 18 out (0 reasoning)
git tag -a v1.0.0 -m 'First stable release'
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 54 in → 25 out (0 reasoning)
git tag -a v1.0.0 -m 'First stable release'
JSON Schema Structured Output
(raw) { "command": "git tag -a v1.0.0 -m 'First stable release'" }
mistralai/devstral-2512__json_schema PASS 100% 53 in → 24 out
git tag -a v1.0.0 -m "First stable release" HEAD
JSON Schema Structured Output
(raw) {"command": "git tag -a v1.0.0 -m \"First stable release\" HEAD"}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 66 in → 2,377 out (2,671 reasoning)
git tag -a v1.0.0 -m "First stable release"
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 65 in → 100 out (71 reasoning)
git tag -a v1.0.0 -m "First stable release" HEAD
JSON Schema Structured Output
(raw) { "command": "git tag -a v1.0.0 -m \"First stable release\" HEAD" }
nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 64 in → 19 out (0 reasoning)
git tag -a v1.0.0 -m "First stable release" HEAD
nvidia/nemotron-3-nano-30b-a3b:none__json_schema PASS 100% 66 in → 53 out (0 reasoning)
git tag -a v1.0.0 -m 'First stable release' && git tag -p v1.0.0 -f 'Initial release' && git push --tags --force-with-lease
JSON Schema Structured Output
(raw) { "command": "git tag -a v1.0.0 -m 'First stable release' && git tag -p v1.0.0 -f 'Initial release' && git push --tags --force-with-lease" }
poolside/laguna-xs.2:high PASS 100% 102 in → 135 out (115 reasoning)
git tag -a v1.0.0 -m 'First stable release'
poolside/laguna-xs.2:high__json_schema PASS 100% 104 in → 163 out (137 reasoning)
git tag -a v1.0.0 -m 'First stable release'
JSON Schema Structured Output
(raw) { "command": "git tag -a v1.0.0 -m 'First stable release'" }
poolside/laguna-xs.2:none PASS 100% 104 in → 20 out (0 reasoning)
git tag -a v1.0.0 -m 'First stable release' HEAD
poolside/laguna-xs.2:none__json_schema PASS 100% 103 in → 24 out (0 reasoning)
git tag -a v1.0.0 -m 'First stable release' HEAD
JSON Schema Structured Output
(raw) {"command": "git tag -a v1.0.0 -m 'First stable release' HEAD"}
mistralai/devstral-2512 FAIL 0% 53 in → 24 out
```bash git tag -a v1.0.0 -m "First stable release" HEAD ```
Failure: Failed assertions: [{'type': 'git_output', 'command': 'git tag -l', 'contains': 'v1.0.0'}, {'type': 'git_output', 'command': 'git tag -n1 v1.0.0', 'contains': 'First stable release'}, {'type': 'git_output', 'command': 'git cat-file -t v1.0.0', 'contains': 'tag'}]