f001 — git_log

Count commits matching a grep pattern

Tests ability to count commits matching a grep pattern in git log. Evaluates log filtering and counting.

Baseline Repository

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

Prompt

How many commits in this repository have 'Fix' in their commit message? Output ONLY the number, nothing else.

Expected

Campaign Evidence

Loading campaign evidence…

Model Outputs (14)

deepseek/deepseek-v4-flash:high PASS 100% 705 in → 44 out (41 reasoning)

deepseek/deepseek-v4-flash:high__json_schema PASS 100% 702 in → 85 out (74 reasoning)

JSON Schema Structured Output

(raw) { "count": 2 }

deepseek/deepseek-v4-flash:none PASS 100% 702 in → 1 out (0 reasoning)

deepseek/deepseek-v4-flash:none__json_schema PASS 100% 712 in → 9 out (0 reasoning)

JSON Schema Structured Output

(raw) { "count": 2 }

mistralai/devstral-2512 PASS 100% 914 in → 2 out

JSON Schema Structured Output

(raw) {"count": 2}

nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 920 in → 120 out (128 reasoning)

nvidia/nemotron-3-nano-30b-a3b:high__json_schema PASS 100% 935 in → 161 out (175 reasoning)

JSON Schema Structured Output

(raw) {"count": 2}

nvidia/nemotron-3-nano-30b-a3b:none PASS 100% 929 in → 2 out (0 reasoning)

JSON Schema Structured Output

(raw) { "count": 2 }

poolside/laguna-xs.2:high PASS 100% 947 in → 115 out (111 reasoning)

poolside/laguna-xs.2:high__json_schema PASS 100% 960 in → 420 out (408 reasoning)

JSON Schema Structured Output

(raw) { "count": 2 }

poolside/laguna-xs.2:none PASS 100% 954 in → 3 out (0 reasoning)

poolside/laguna-xs.2:none__json_schema PASS 100% 947 in → 11 out (0 reasoning)

JSON Schema Structured Output

(raw) { "count": 2 }