f004 — git_log

Count commits by a specific author

Tests ability to count commits by a specific author. Evaluates combining author filter with counting.

Baseline Repository

These commands set up the repo before the model sees the prompt. They define the starting file structure, staged changes, and Git history.

Prompt

How many commits were authored by Alice? Output ONLY the number, nothing else.

Expected

Campaign Evidence

Loading campaign evidence…

Model Outputs (14)

deepseek/deepseek-v4-flash:high PASS 100% 704 in → 88 out (86 reasoning)

deepseek/deepseek-v4-flash:high__json_schema PASS 100% 704 in → 79 out (70 reasoning)

JSON Schema Structured Output

(raw) { "count": 3 }

deepseek/deepseek-v4-flash:none PASS 100% 700 in → 2 out (0 reasoning)

deepseek/deepseek-v4-flash:none__json_schema PASS 100% 702 in → 9 out (0 reasoning)

JSON Schema Structured Output

(raw) { "count": 3 }

mistralai/devstral-2512 PASS 100% 888 in → 2 out

JSON Schema Structured Output

(raw) {"count": 3}

nvidia/nemotron-3-nano-30b-a3b:high PASS 100% 922 in → 91 out (74 reasoning)

JSON Schema Structured Output

(raw) { "count": 3 }

poolside/laguna-xs.2:high PASS 100% 938 in → 239 out (235 reasoning)

poolside/laguna-xs.2:high__json_schema PASS 100% 917 in → 885 out (884 reasoning)

JSON Schema Structured Output

(raw) {"count":3}

poolside/laguna-xs.2:none PASS 100% 938 in → 3 out (0 reasoning)

nvidia/nemotron-3-nano-30b-a3b:none FAIL 0% 914 in → 3 out (0 reasoning)

3 ways

Failure: Expected '3', got '3 ways'

JSON Schema Structured Output

(raw) { "count": 4 }

Failure: Expected '3', got '4'

poolside/laguna-xs.2:none__json_schema FAIL 0% 938 in → 7 out (0 reasoning)

JSON Schema Structured Output

(raw) {"count": 4}

Failure: Expected '3', got '4'