: GitBench
P
Poolside / laguna-xs.2

none

74% 151 / 204 fixtures 1 run(s)
48,954 input / 7,954 total output / 0 reasoning within output tokens $0.00602300
75% 153 / 204 fixtures 1 run(s)
49,058 input / 3,636 total output / 0 reasoning within output tokens $0.00516980
Loading reliability summary…
Pass Rate Delta
+1% Text: 74% → JSON: 75%
+19
Gained
JSON pass / text fail
−17
Lost
Text pass / JSON fail
134
Unchanged Pass
Both pass
34
Unchanged Fail
Both fail
Fixture Reliability Delta
Fixture Text JSON Delta
Benchmark Deltas
Benchmark Text JSON Delta
commit_squash 50% 100% + 50%
reflog 100% 50% -50%
blame_forensics 41.7% 83.3% + 41.7%
branch_cleanup 83.3% 58.3% -25%
git_log_format 100% 83.3% -16.7%
git_clean 66.7% 58.3% -8.3%
worktree_usage 58.3% 66.7% + 8.3%
git_grep 83.3% 75% -8.3%
merge_conflicts 50% 58.3% + 8.3%
rebase 50% 58.3% + 8.3%
submodule_usage 66.7% 75% + 8.3%
cherry_pick 41.7% 41.7% + 0%
commit_messages 100% 100% + 0%
git_bisect 100% 100% + 0%
git_show 91.7% 91.7% + 0%
stash_recovery 100% 100% + 0%
tag_management 75% 75% + 0%
Changed Fixtures (36)