: GitBench
Deepseek / deepseek-v4-flash

none

79.4% 162 / 204 fixtures 1 run(s)
33,944 input / 5,380 total output / 2,342 reasoning within output tokens $0.00515977
84.8% 173 / 204 fixtures 1 run(s)
34,566 input / 4,068 total output / 0 reasoning within output tokens $0.00536949
Loading reliability summary…
Pass Rate Delta
+5.4% Text: 79.4% → JSON: 84.8%
+19
Gained
JSON pass / text fail
−8
Lost
Text pass / JSON fail
154
Unchanged Pass
Both pass
23
Unchanged Fail
Both fail
Fixture Reliability Delta
Fixture Text JSON Delta
f011 100% (1/1) 0% (0/1) +100%
Benchmark Deltas
Benchmark Text JSON Delta
commit_squash 50% 100% + 50%
git_grep 58.3% 83.3% + 25%
blame_forensics 58.3% 75% + 16.7%
git_log_format 83.3% 91.7% + 8.3%
merge_conflicts 66.7% 58.3% -8.3%
submodule_usage 83.3% 91.7% + 8.3%
tag_management 83.3% 91.7% + 8.3%
worktree_usage 83.3% 91.7% + 8.3%
branch_cleanup 83.3% 75% -8.3%
git_clean 83.3% 75% -8.3%
reflog 100% 91.7% -8.3%
cherry_pick 66.7% 66.7% + 0%
commit_messages 91.7% 91.7% + 0%
git_bisect 100% 100% + 0%
git_show 91.7% 91.7% + 0%
rebase 66.7% 66.7% + 0%
stash_recovery 100% 100% + 0%
Changed Fixtures (27)