: GitBench
Count commits in a date range
Tests ability to count commits in a date range. Evaluates combining time filters with aggregation.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 GIT_AUTHOR_DATE='2025-01-15T10:00:00' GIT_COMMITTER_DATE='2025-01-15T10:00:00' bash -c 'echo x > f.txt && git add f.txt && git commit -m "Mid-Jan commit"'
  5. 05 GIT_AUTHOR_DATE='2025-02-10T10:00:00' GIT_COMMITTER_DATE='2025-02-10T10:00:00' bash -c 'echo y > f.txt && git add f.txt && git commit -m "Mid-Feb commit"'
  6. 06 GIT_AUTHOR_DATE='2025-02-20T10:00:00' GIT_COMMITTER_DATE='2025-02-20T10:00:00' bash -c 'echo z > f.txt && git add f.txt && git commit -m "Late-Feb commit"'
  7. 07 GIT_AUTHOR_DATE='2025-03-05T10:00:00' GIT_COMMITTER_DATE='2025-03-05T10:00:00' bash -c 'echo w > f.txt && git add f.txt && git commit -m "Early-Mar commit"'
  8. 08 GIT_AUTHOR_DATE='2025-04-01T10:00:00' GIT_COMMITTER_DATE='2025-04-01T10:00:00' bash -c 'echo v > f.txt && git add f.txt && git commit -m "April commit"'
Prompt
How many commits were made between 2025-02-01 and 2025-03-31 (inclusive)? Output ONLY the number, nothing else.
Expected
3
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 703 in → 143 out (102 reasoning)
3
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 703 in → 139 out (130 reasoning)
3
JSON Schema Structured Output
(raw) { "count": 3 }
deepseek/deepseek-v4-flash:none__json_schema PASS 100% 703 in → 8 out (0 reasoning)
3
JSON Schema Structured Output
(raw) { "count": 3 }
mistralai/devstral-2512__json_schema PASS 100% 914 in → 7 out
3
JSON Schema Structured Output
(raw) {"count": 3}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 927 in → 299 out (176 reasoning)
3
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 927 in → 184 out (116 reasoning)
3
JSON Schema Structured Output
(raw) { "count": 3 }
poolside/laguna-xs.2:high PASS 100% 943 in → 295 out (291 reasoning)
3
poolside/laguna-xs.2:high__json_schema PASS 100% 943 in → 379 out (367 reasoning)
3
JSON Schema Structured Output
(raw) { "count": 3 }
poolside/laguna-xs.2:none PASS 100% 943 in → 3 out (0 reasoning)
3
poolside/laguna-xs.2:none__json_schema PASS 100% 943 in → 7 out (0 reasoning)
3
JSON Schema Structured Output
(raw) {"count": 3}
deepseek/deepseek-v4-flash:none FAIL 0% 703 in → 2 out (0 reasoning)
2
Failure: Expected '3', got '2'
mistralai/devstral-2512 FAIL 0% 914 in → 2 out
2
Failure: Expected '3', got '2'
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 927 in → 2 out (0 reasoning)
2
Failure: Expected '3', got '2'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 927 in → 12 out (0 reasoning)
2
JSON Schema Structured Output
(raw) { "count": 2 }
Failure: Expected '3', got '2'