Poolside / laguna-xs.2
none
74%
151 / 204 fixtures
1 run(s)
48,954 input / 7,954 total output / 0 reasoning within output tokens $0.00602300
Reliability by Benchmark (Text)
Loading reliability summary…
Text vs JSON Schema Comparison
Pass Rate Delta
+1%
Text: 74% →
JSON: 75%
+19
Gained
JSON pass / text fail
−17
Lost
Text pass / JSON fail
134
Unchanged Pass
Both pass
34
Unchanged Fail
Both fail
Fixture Reliability Delta
| Fixture | Text | JSON | Delta |
|---|
Benchmark Deltas
| Benchmark | Text | JSON | Delta |
|---|---|---|---|
| commit_squash | 50% | 100% | + 50% |
| reflog | 100% | 50% | -50% |
| blame_forensics | 41.7% | 83.3% | + 41.7% |
| branch_cleanup | 83.3% | 58.3% | -25% |
| git_log_format | 100% | 83.3% | -16.7% |
| git_clean | 66.7% | 58.3% | -8.3% |
| worktree_usage | 58.3% | 66.7% | + 8.3% |
| git_grep | 83.3% | 75% | -8.3% |
| merge_conflicts | 50% | 58.3% | + 8.3% |
| rebase | 50% | 58.3% | + 8.3% |
| submodule_usage | 66.7% | 75% | + 8.3% |
| cherry_pick | 41.7% | 41.7% | + 0% |
| commit_messages | 100% | 100% | + 0% |
| git_bisect | 100% | 100% | + 0% |
| git_show | 91.7% | 91.7% | + 0% |
| stash_recovery | 100% | 100% | + 0% |
| tag_management | 75% | 75% | + 0% |
Changed Fixtures (36)
f006 +gain f008 +gain f009 +gain f011 +gain f012 +gain f002 -loss f009 -loss f010 -loss f002 -loss f008 +gain f011 -loss f012 +gain f003 +gain f005 +gain f008 +gain f009 +gain f011 +gain f012 +gain f005 -loss f006 -loss f004 -loss f005 -loss f007 +gain f008 +gain f003 -loss f004 -loss f005 -loss f007 -loss f008 -loss f012 -loss f002 +gain f006 -loss f009 +gain f006 +gain f011 -loss f012 +gain
Fixture Gallery (204)
blame_forensics f001
100%
Passed
Tokens
298
Input
6
Total output
0
Reasoning within output
blame_forensics f002
100%
Passed
Tokens
298
Input
5
Total output
0
Reasoning within output
blame_forensics f003
100%
Passed
Tokens
322
Input
8
Total output
0
Reasoning within output
blame_forensics f004
100%
Passed
Tokens
241
Input
7
Total output
0
Reasoning within output
blame_forensics f005
0%
Failed
Tokens
313
Input
13
Total output
0
Reasoning within output
blame_forensics f006
0%
Failed
Tokens
284
Input
7
Total output
0
Reasoning within output
blame_forensics f007
100%
Passed
Tokens
263
Input
6
Total output
0
Reasoning within output
blame_forensics f008
0%
Failed
Tokens
300
Input
5
Total output
0
Reasoning within output
blame_forensics f009
0%
Failed
Tokens
246
Input
10
Total output
0
Reasoning within output
blame_forensics f010
0%
Failed
Tokens
483
Input
11
Total output
0
Reasoning within output
blame_forensics f011
0%
Failed
Tokens
305
Input
12
Total output
0
Reasoning within output
blame_forensics f012
0%
Failed
Tokens
191
Input
12
Total output
0
Reasoning within output
branch_cleanup f001
100%
Failed
Tokens
148
Input
8
Total output
0
Reasoning within output
branch_cleanup f002
100%
Passed
Tokens
168
Input
7
Total output
0
Reasoning within output
branch_cleanup f003
100%
Failed
Tokens
144
Input
120
Total output
0
Reasoning within output
branch_cleanup f004
100%
Passed
Tokens
180
Input
7
Total output
0
Reasoning within output
branch_cleanup f005
100%
Passed
Tokens
154
Input
5
Total output
0
Reasoning within output
branch_cleanup f006
100%
Passed
Tokens
173
Input
11
Total output
0
Reasoning within output
branch_cleanup f007
100%
Passed
Tokens
154
Input
7
Total output
0
Reasoning within output
branch_cleanup f008
100%
Passed
Tokens
172
Input
11
Total output
0
Reasoning within output
branch_cleanup f009
100%
Passed
Tokens
135
Input
3
Total output
0
Reasoning within output
branch_cleanup f010
100%
Passed
Tokens
163
Input
7
Total output
0
Reasoning within output
branch_cleanup f011
100%
Passed
Tokens
114
Input
3
Total output
0
Reasoning within output
branch_cleanup f012
100%
Passed
Tokens
204
Input
11
Total output
0
Reasoning within output
cherry_pick f001
0%
Failed
Tokens
141
Input
6
Total output
0
Reasoning within output
cherry_pick f002
100%
Passed
Tokens
160
Input
9
Total output
0
Reasoning within output
cherry_pick f003
0%
Failed
Tokens
131
Input
9
Total output
0
Reasoning within output
cherry_pick f004
0%
Failed
Tokens
203
Input
25
Total output
0
Reasoning within output
cherry_pick f005
0%
Failed
Tokens
177
Input
26
Total output
0
Reasoning within output
cherry_pick f006
100%
Passed
Tokens
187
Input
24
Total output
0
Reasoning within output
cherry_pick f007
100%
Passed
Tokens
179
Input
12
Total output
0
Reasoning within output
cherry_pick f008
0%
Failed
Tokens
168
Input
63
Total output
0
Reasoning within output
cherry_pick f009
100%
Passed
Tokens
151
Input
7
Total output
0
Reasoning within output
cherry_pick f010
0%
Failed
Tokens
266
Input
30
Total output
0
Reasoning within output
cherry_pick f011
100%
Passed
Tokens
210
Input
25
Total output
0
Reasoning within output
cherry_pick f012
0%
Failed
Tokens
316
Input
120
Total output
0
Reasoning within output
commit_messages f001
87%
Passed
Tokens
155
Input
8
Total output
0
Reasoning within output
commit_messages f002
85%
Passed
Tokens
274
Input
11
Total output
0
Reasoning within output
commit_messages f003
91.7%
Passed
Tokens
128
Input
10
Total output
0
Reasoning within output
commit_messages f004
93.7%
Passed
Tokens
160
Input
6
Total output
0
Reasoning within output
commit_messages f005
91.7%
Passed
Tokens
150
Input
6
Total output
0
Reasoning within output
commit_messages f006
92.7%
Passed
Tokens
191
Input
8
Total output
0
Reasoning within output
commit_messages f007
76.7%
Passed
Tokens
232
Input
9
Total output
0
Reasoning within output
commit_messages f008
86.7%
Passed
Tokens
122
Input
6
Total output
0
Reasoning within output
commit_messages f009
81.7%
Passed
Tokens
176
Input
10
Total output
0
Reasoning within output
commit_messages f010
93.3%
Passed
Tokens
345
Input
11
Total output
0
Reasoning within output
commit_messages f011
62.7%
Passed
Tokens
195
Input
12
Total output
0
Reasoning within output
commit_messages f012
90%
Passed
Tokens
172
Input
7
Total output
0
Reasoning within output
commit_squash f001
100%
Passed
Tokens
194
Input
24
Total output
0
Reasoning within output
commit_squash f002
100%
Passed
Tokens
159
Input
16
Total output
0
Reasoning within output
commit_squash f003
100%
Failed
Tokens
160
Input
214
Total output
0
Reasoning within output
commit_squash f004
100%
Passed
Tokens
158
Input
28
Total output
0
Reasoning within output
commit_squash f005
100%
Failed
Tokens
153
Input
148
Total output
0
Reasoning within output
commit_squash f006
100%
Passed
Tokens
129
Input
108
Total output
0
Reasoning within output
commit_squash f007
100%
Passed
Tokens
134
Input
109
Total output
0
Reasoning within output
commit_squash f008
100%
Failed
Tokens
145
Input
124
Total output
0
Reasoning within output
commit_squash f009
100%
Failed
Tokens
140
Input
113
Total output
0
Reasoning within output
commit_squash f010
100%
Passed
Tokens
157
Input
72
Total output
0
Reasoning within output
commit_squash f011
100%
Failed
Tokens
164
Input
92
Total output
0
Reasoning within output
commit_squash f012
100%
Failed
Tokens
163
Input
98
Total output
0
Reasoning within output
git_bisect f001
100%
Passed
Tokens
231
Input
97
Total output
0
Reasoning within output
git_bisect f002
100%
Passed
Tokens
261
Input
106
Total output
0
Reasoning within output
git_bisect f003
100%
Passed
Tokens
253
Input
109
Total output
0
Reasoning within output
git_bisect f004
100%
Passed
Tokens
253
Input
108
Total output
0
Reasoning within output
git_bisect f005
100%
Passed
Tokens
263
Input
127
Total output
0
Reasoning within output
git_bisect f006
100%
Passed
Tokens
279
Input
151
Total output
0
Reasoning within output
git_bisect f007
100%
Passed
Tokens
261
Input
101
Total output
0
Reasoning within output
git_bisect f008
100%
Passed
Tokens
261
Input
155
Total output
0
Reasoning within output
git_bisect f009
100%
Passed
Tokens
261
Input
128
Total output
0
Reasoning within output
git_bisect f010
100%
Passed
Tokens
257
Input
144
Total output
0
Reasoning within output
git_bisect f011
100%
Passed
Tokens
263
Input
122
Total output
0
Reasoning within output
git_bisect f012
100%
Passed
Tokens
279
Input
96
Total output
0
Reasoning within output
git_clean f001
100%
Passed
Tokens
96
Input
9
Total output
0
Reasoning within output
git_clean f002
100%
Passed
Tokens
107
Input
6
Total output
0
Reasoning within output
git_clean f003
100%
Passed
Tokens
106
Input
6
Total output
0
Reasoning within output
git_clean f004
33.3%
Failed
Tokens
103
Input
12
Total output
0
Reasoning within output
git_clean f005
100%
Passed
Tokens
95
Input
8
Total output
0
Reasoning within output
git_clean f006
100%
Passed
Tokens
120
Input
8
Total output
0
Reasoning within output
git_clean f007
50%
Failed
Tokens
111
Input
11
Total output
0
Reasoning within output
git_clean f008
100%
Passed
Tokens
119
Input
6
Total output
0
Reasoning within output
git_clean f009
100%
Passed
Tokens
113
Input
8
Total output
0
Reasoning within output
git_clean f010
100%
Passed
Tokens
109
Input
6
Total output
0
Reasoning within output
git_clean f011
40%
Failed
Tokens
107
Input
6
Total output
0
Reasoning within output
git_clean f012
66.7%
Failed
Tokens
109
Input
8
Total output
0
Reasoning within output
git_grep f001
100%
Passed
Tokens
95
Input
5
Total output
0
Reasoning within output
git_grep f002
100%
Passed
Tokens
112
Input
3
Total output
0
Reasoning within output
git_grep f003
100%
Passed
Tokens
130
Input
3
Total output
0
Reasoning within output
git_grep f004
100%
Passed
Tokens
116
Input
7
Total output
0
Reasoning within output
git_grep f005
100%
Passed
Tokens
154
Input
3
Total output
0
Reasoning within output
git_grep f006
100%
Passed
Tokens
84
Input
3
Total output
0
Reasoning within output
git_grep f007
0%
Failed
Tokens
210
Input
4
Total output
0
Reasoning within output
git_grep f008
100%
Passed
Tokens
101
Input
3
Total output
0
Reasoning within output
git_grep f009
100%
Passed
Tokens
97
Input
5
Total output
0
Reasoning within output
git_grep f010
100%
Passed
Tokens
108
Input
3
Total output
0
Reasoning within output
git_grep f011
0%
Failed
Tokens
153
Input
3
Total output
0
Reasoning within output
git_grep f012
100%
Passed
Tokens
108
Input
10
Total output
0
Reasoning within output
git_log_format f001
100%
Passed
Tokens
954
Input
3
Total output
0
Reasoning within output
git_log_format f002
100%
Passed
Tokens
901
Input
9
Total output
0
Reasoning within output
git_log_format f003
100%
Passed
Tokens
767
Input
11
Total output
0
Reasoning within output
git_log_format f004
100%
Passed
Tokens
938
Input
3
Total output
0
Reasoning within output
git_log_format f005
100%
Passed
Tokens
776
Input
7
Total output
0
Reasoning within output
git_log_format f006
100%
Passed
Tokens
943
Input
3
Total output
0
Reasoning within output
git_log_format f007
100%
Passed
Tokens
960
Input
8
Total output
0
Reasoning within output
git_log_format f008
100%
Passed
Tokens
1,257
Input
3
Total output
0
Reasoning within output
git_log_format f009
100%
Passed
Tokens
768
Input
3
Total output
0
Reasoning within output
git_log_format f010
100%
Passed
Tokens
1,434
Input
3
Total output
0
Reasoning within output
git_log_format f011
100%
Passed
Tokens
477
Input
4
Total output
0
Reasoning within output
git_log_format f012
100%
Passed
Tokens
449
Input
3
Total output
0
Reasoning within output
git_show f001
100%
Passed
Tokens
243
Input
5
Total output
0
Reasoning within output
git_show f002
0%
Failed
Tokens
245
Input
41
Total output
0
Reasoning within output
git_show f003
100%
Passed
Tokens
719
Input
8
Total output
0
Reasoning within output
git_show f004
100%
Passed
Tokens
521
Input
8
Total output
0
Reasoning within output
git_show f005
100%
Passed
Tokens
244
Input
4
Total output
0
Reasoning within output
git_show f006
100%
Passed
Tokens
254
Input
3
Total output
0
Reasoning within output
git_show f007
100%
Passed
Tokens
245
Input
6
Total output
0
Reasoning within output
git_show f008
100%
Passed
Tokens
254
Input
39
Total output
0
Reasoning within output
git_show f009
100%
Passed
Tokens
365
Input
3
Total output
0
Reasoning within output
git_show f010
100%
Passed
Tokens
239
Input
3
Total output
0
Reasoning within output
git_show f011
100%
Passed
Tokens
270
Input
4
Total output
0
Reasoning within output
git_show f012
100%
Passed
Tokens
379
Input
3
Total output
0
Reasoning within output
merge_conflicts f001
0%
Failed
Tokens
130
Input
6
Total output
0
Reasoning within output
merge_conflicts f002
100%
Passed
Tokens
148
Input
9
Total output
0
Reasoning within output
merge_conflicts f003
0%
Failed
Tokens
121
Input
5
Total output
0
Reasoning within output
merge_conflicts f004
100%
Passed
Tokens
188
Input
26
Total output
0
Reasoning within output
merge_conflicts f005
0%
Failed
Tokens
162
Input
26
Total output
0
Reasoning within output
merge_conflicts f006
100%
Passed
Tokens
174
Input
29
Total output
0
Reasoning within output
merge_conflicts f007
0%
Failed
Tokens
178
Input
12
Total output
0
Reasoning within output
merge_conflicts f008
100%
Passed
Tokens
150
Input
8
Total output
0
Reasoning within output
merge_conflicts f009
100%
Passed
Tokens
147
Input
7
Total output
0
Reasoning within output
merge_conflicts f010
0%
Failed
Tokens
228
Input
30
Total output
0
Reasoning within output
merge_conflicts f011
100%
Passed
Tokens
190
Input
25
Total output
0
Reasoning within output
merge_conflicts f012
0%
Failed
Tokens
276
Input
26
Total output
0
Reasoning within output
rebase f001
0%
Failed
Tokens
143
Input
6
Total output
0
Reasoning within output
rebase f002
100%
Passed
Tokens
166
Input
9
Total output
0
Reasoning within output
rebase f003
100%
Passed
Tokens
145
Input
5
Total output
0
Reasoning within output
rebase f004
100%
Passed
Tokens
201
Input
26
Total output
0
Reasoning within output
rebase f005
0%
Failed
Tokens
174
Input
26
Total output
0
Reasoning within output
rebase f006
0%
Failed
Tokens
188
Input
32
Total output
0
Reasoning within output
rebase f007
100%
Passed
Tokens
182
Input
12
Total output
0
Reasoning within output
rebase f008
0%
Failed
Tokens
164
Input
37
Total output
0
Reasoning within output
rebase f009
100%
Passed
Tokens
147
Input
7
Total output
0
Reasoning within output
rebase f010
0%
Failed
Tokens
260
Input
30
Total output
0
Reasoning within output
rebase f011
100%
Passed
Tokens
201
Input
25
Total output
0
Reasoning within output
rebase f012
0%
Failed
Tokens
312
Input
107
Total output
0
Reasoning within output
reflog f001
100%
Passed
Tokens
289
Input
37
Total output
0
Reasoning within output
reflog f002
100%
Passed
Tokens
283
Input
253
Total output
0
Reasoning within output
reflog f003
100%
Passed
Tokens
264
Input
96
Total output
0
Reasoning within output
reflog f004
100%
Passed
Tokens
376
Input
288
Total output
0
Reasoning within output
reflog f005
100%
Passed
Tokens
434
Input
149
Total output
0
Reasoning within output
reflog f006
100%
Passed
Tokens
303
Input
184
Total output
0
Reasoning within output
reflog f007
100%
Passed
Tokens
324
Input
229
Total output
0
Reasoning within output
reflog f008
100%
Passed
Tokens
411
Input
219
Total output
0
Reasoning within output
reflog f009
100%
Passed
Tokens
385
Input
139
Total output
0
Reasoning within output
reflog f010
100%
Passed
Tokens
333
Input
207
Total output
0
Reasoning within output
reflog f011
100%
Passed
Tokens
386
Input
202
Total output
0
Reasoning within output
reflog f012
100%
Passed
Tokens
395
Input
263
Total output
0
Reasoning within output
stash_recovery f001
100%
Passed
Tokens
178
Input
62
Total output
0
Reasoning within output
stash_recovery f002
100%
Passed
Tokens
339
Input
85
Total output
0
Reasoning within output
stash_recovery f003
100%
Passed
Tokens
166
Input
35
Total output
0
Reasoning within output
stash_recovery f004
100%
Passed
Tokens
161
Input
71
Total output
0
Reasoning within output
stash_recovery f005
100%
Passed
Tokens
172
Input
63
Total output
0
Reasoning within output
stash_recovery f006
100%
Passed
Tokens
226
Input
49
Total output
0
Reasoning within output
stash_recovery f007
100%
Passed
Tokens
254
Input
62
Total output
0
Reasoning within output
stash_recovery f008
100%
Passed
Tokens
222
Input
96
Total output
0
Reasoning within output
stash_recovery f009
100%
Passed
Tokens
253
Input
103
Total output
0
Reasoning within output
stash_recovery f010
100%
Passed
Tokens
170
Input
72
Total output
0
Reasoning within output
stash_recovery f011
100%
Passed
Tokens
166
Input
63
Total output
0
Reasoning within output
stash_recovery f012
100%
Passed
Tokens
174
Input
85
Total output
0
Reasoning within output
submodule_usage f001
100%
Passed
Tokens
86
Input
10
Total output
0
Reasoning within output
submodule_usage f002
50%
Failed
Tokens
146
Input
83
Total output
0
Reasoning within output
submodule_usage f003
75%
Failed
Tokens
154
Input
40
Total output
0
Reasoning within output
submodule_usage f004
100%
Passed
Tokens
135
Input
5
Total output
0
Reasoning within output
submodule_usage f005
66.7%
Failed
Tokens
93
Input
32
Total output
0
Reasoning within output
submodule_usage f006
100%
Passed
Tokens
132
Input
5
Total output
0
Reasoning within output
submodule_usage f007
100%
Passed
Tokens
93
Input
19
Total output
0
Reasoning within output
submodule_usage f008
33.3%
Failed
Tokens
144
Input
6
Total output
0
Reasoning within output
submodule_usage f009
100%
Passed
Tokens
141
Input
16
Total output
0
Reasoning within output
submodule_usage f010
100%
Passed
Tokens
147
Input
8
Total output
0
Reasoning within output
submodule_usage f011
100%
Passed
Tokens
148
Input
6
Total output
0
Reasoning within output
submodule_usage f012
100%
Passed
Tokens
91
Input
13
Total output
0
Reasoning within output
tag_management f001
100%
Passed
Tokens
105
Input
18
Total output
0
Reasoning within output
tag_management f002
100%
Passed
Tokens
104
Input
20
Total output
0
Reasoning within output
tag_management f003
50%
Failed
Tokens
109
Input
12
Total output
0
Reasoning within output
tag_management f004
100%
Passed
Tokens
118
Input
6
Total output
0
Reasoning within output
tag_management f005
100%
Passed
Tokens
130
Input
9
Total output
0
Reasoning within output
tag_management f006
100%
Passed
Tokens
113
Input
14
Total output
0
Reasoning within output
tag_management f007
100%
Passed
Tokens
109
Input
23
Total output
0
Reasoning within output
tag_management f008
100%
Passed
Tokens
113
Input
26
Total output
0
Reasoning within output
tag_management f009
0%
Failed
Tokens
115
Input
23
Total output
0
Reasoning within output
tag_management f010
100%
Passed
Tokens
94
Input
9
Total output
0
Reasoning within output
tag_management f011
0%
Failed
Tokens
99
Input
8
Total output
0
Reasoning within output
tag_management f012
100%
Passed
Tokens
138
Input
10
Total output
0
Reasoning within output
worktree_usage f001
100%
Passed
Tokens
189
Input
11
Total output
0
Reasoning within output
worktree_usage f002
100%
Passed
Tokens
193
Input
23
Total output
0
Reasoning within output
worktree_usage f003
100%
Passed
Tokens
265
Input
10
Total output
0
Reasoning within output
worktree_usage f004
100%
Passed
Tokens
269
Input
6
Total output
0
Reasoning within output
worktree_usage f005
33.3%
Failed
Tokens
179
Input
16
Total output
0
Reasoning within output
worktree_usage f006
0%
Failed
Tokens
185
Input
21
Total output
0
Reasoning within output
worktree_usage f007
100%
Passed
Tokens
181
Input
21
Total output
0
Reasoning within output
worktree_usage f008
50%
Failed
Tokens
209
Input
45
Total output
0
Reasoning within output
worktree_usage f009
100%
Passed
Tokens
295
Input
6
Total output
0
Reasoning within output
worktree_usage f010
0%
Failed
Tokens
272
Input
17
Total output
0
Reasoning within output
worktree_usage f011
100%
Passed
Tokens
257
Input
10
Total output
0
Reasoning within output
worktree_usage f012
0%
Failed
Tokens
179
Input
19
Total output
0
Reasoning within output