: GitBench
List commits in a date range
Tests ability to list commits in a date range. Evaluates time-based filtering with git log.

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

  1. 01 git init
  2. 02 git config user.email 'test@test.com'
  3. 03 git config user.name 'Test User'
  4. 04 GIT_AUTHOR_DATE='2025-01-01T12:00:00' GIT_COMMITTER_DATE='2025-01-01T12:00:00' bash -c 'echo a > f.txt && git add f.txt && git commit -m "January commit"'
  5. 05 GIT_AUTHOR_DATE='2025-02-01T12:00:00' GIT_COMMITTER_DATE='2025-02-01T12:00:00' bash -c 'echo b > f.txt && git add f.txt && git commit -m "February commit"'
  6. 06 GIT_AUTHOR_DATE='2025-03-01T12:00:00' GIT_COMMITTER_DATE='2025-03-01T12:00:00' bash -c 'echo c > f.txt && git add f.txt && git commit -m "March commit"'
  7. 07 GIT_AUTHOR_DATE='2025-04-01T12:00:00' GIT_COMMITTER_DATE='2025-04-01T12:00:00' bash -c 'echo d > f.txt && git add f.txt && git commit -m "April commit"'
Prompt
List all commits made between 2025-02-01 and 2025-03-31 (inclusive). Output ONLY the commit messages, one per line, nothing else.
Expected
February commit March commit
Loading campaign evidence…
deepseek/deepseek-v4-flash:high PASS 100% 590 in → 191 out (184 reasoning)
February commit March commit
deepseek/deepseek-v4-flash:high__json_schema PASS 100% 651 in → 192 out (169 reasoning)
February commit March commit
JSON Schema Structured Output
(raw) {"commit_message_list": ["February commit", "March commit"]}
nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 748 in → 632 out (519 reasoning)
February commit March commit
nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 748 in → 345 out (250 reasoning)
February commit March commit
JSON Schema Structured Output
(raw) { "commit_message_list": [ "February commit", "March commit" ] }
poolside/laguna-xs.2:high PASS 100% 776 in → 421 out (413 reasoning)
February commit March commit
poolside/laguna-xs.2:high__json_schema PASS 100% 776 in → 332 out (302 reasoning)
February commit March commit
JSON Schema Structured Output
(raw) { "commit_message_list": [ "February commit", "March commit" ] }
poolside/laguna-xs.2:none PASS 100% 776 in → 7 out (0 reasoning)
February commit March commit
deepseek/deepseek-v4-flash:none FAIL 0% 573 in → 6 out (0 reasoning)
March commit February commit
Failure: Expected 'February commit March commit', got 'March commit February commit'
deepseek/deepseek-v4-flash:none__json_schema FAIL 0% 573 in → 23 out (0 reasoning)
March commit February commit
JSON Schema Structured Output
(raw) { "commit_message_list": [ "March commit", "February commit" ] }
Failure: Expected 'February commit March commit', got 'March commit February commit'
mistralai/devstral-2512 FAIL 0% 735 in → 6 out
March commit February commit
Failure: Expected 'February commit March commit', got 'March commit February commit'
mistralai/devstral-2512__json_schema FAIL 0% 735 in → 15 out
March commit February commit
JSON Schema Structured Output
(raw) {"commit_message_list": ["March commit", "February commit"]}
Failure: Expected 'February commit March commit', got 'March commit February commit'
nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 748 in → 112 out (0 reasoning)
c3a6a944c845e3309cc8e6e76656cdb529de1ef9 eb3c63770b4ef621ccd003c101db519e1427225c c5123c440bfbc522b085898f04b1f637264a5fb7
Failure: Expected 'February commit March commit', got 'c3a6a944c845e3309cc8e6e76656cdb529de1ef9 eb3c63770b4ef621ccd003c101db519e1427225c c5123c440bfbc522b085898f04b1f637264a5fb7'
nvidia/nemotron-3-nano-30b-a3b:none__json_schema FAIL 0% 748 in → 34 out (0 reasoning)
Fourth commit Third commit Second commit First commit
JSON Schema Structured Output
(raw) { "commit_message_list": [ "Fourth commit", "Third commit", "Second commit", "First commit" ] }
Failure: Expected 'February commit March commit', got 'Fourth commit Third commit Second commit First commit'
poolside/laguna-xs.2:none__json_schema FAIL 0% 776 in → 14 out (0 reasoning)
March commit February commit
JSON Schema Structured Output
(raw) {"commit_message_list": ["March commit", "February commit"]}
Failure: Expected 'February commit March commit', got 'March commit February commit'