Poolside / laguna-xs.2
high
88.2%
180 / 204 fixtures
1 run(s)
49,024 input / 62,164 total output / 54,469 reasoning within output tokens $0.01617040
Reliability by Benchmark (Text)
Loading reliability summary…
Text vs JSON Schema Comparison
Pass Rate Delta
+0.5%
Text: 88.2% →
JSON: 88.7%
+13
Gained
JSON pass / text fail
−12
Lost
Text pass / JSON fail
168
Unchanged Pass
Both pass
11
Unchanged Fail
Both fail
Fixture Reliability Delta
| Fixture | Text | JSON | Delta |
|---|
Benchmark Deltas
| Benchmark | Text | JSON | Delta |
|---|---|---|---|
| commit_squash | 50% | 100% | + 50% |
| reflog | 100% | 50% | -50% |
| rebase | 75% | 58.3% | -16.7% |
| worktree_usage | 83.3% | 100% | + 16.7% |
| cherry_pick | 66.7% | 83.3% | + 16.7% |
| git_show | 83.3% | 91.7% | + 8.3% |
| branch_cleanup | 100% | 91.7% | -8.3% |
| git_clean | 75% | 83.3% | + 8.3% |
| merge_conflicts | 83.3% | 75% | -8.3% |
| tag_management | 100% | 91.7% | -8.3% |
| blame_forensics | 100% | 100% | + 0% |
| commit_messages | 100% | 100% | + 0% |
| git_bisect | 100% | 100% | + 0% |
| git_grep | 100% | 100% | + 0% |
| git_log_format | 100% | 100% | + 0% |
| stash_recovery | 100% | 100% | + 0% |
| submodule_usage | 83.3% | 83.3% | + 0% |
Changed Fixtures (25)
Fixture Gallery (204)
blame_forensics f001
100%
Passed
Tokens
297
Input
428
Total output
421
Reasoning within output
blame_forensics f002
100%
Passed
Tokens
294
Input
251
Total output
245
Reasoning within output
blame_forensics f003
100%
Passed
Tokens
326
Input
255
Total output
246
Reasoning within output
blame_forensics f004
100%
Passed
Tokens
244
Input
496
Total output
488
Reasoning within output
blame_forensics f005
100%
Passed
Tokens
311
Input
887
Total output
881
Reasoning within output
blame_forensics f006
100%
Passed
Tokens
283
Input
1,293
Total output
1,287
Reasoning within output
blame_forensics f007
100%
Passed
Tokens
260
Input
227
Total output
220
Reasoning within output
blame_forensics f008
100%
Passed
Tokens
301
Input
346
Total output
340
Reasoning within output
blame_forensics f009
100%
Passed
Tokens
253
Input
281
Total output
275
Reasoning within output
blame_forensics f010
100%
Passed
Tokens
489
Input
453
Total output
447
Reasoning within output
blame_forensics f011
100%
Passed
Tokens
303
Input
259
Total output
253
Reasoning within output
blame_forensics f012
100%
Passed
Tokens
194
Input
321
Total output
314
Reasoning within output
branch_cleanup f001
100%
Passed
Tokens
145
Input
173
Total output
167
Reasoning within output
branch_cleanup f002
100%
Passed
Tokens
171
Input
207
Total output
199
Reasoning within output
branch_cleanup f003
100%
Passed
Tokens
141
Input
374
Total output
370
Reasoning within output
branch_cleanup f004
100%
Passed
Tokens
179
Input
199
Total output
191
Reasoning within output
branch_cleanup f005
100%
Passed
Tokens
147
Input
132
Total output
126
Reasoning within output
branch_cleanup f006
100%
Passed
Tokens
179
Input
138
Total output
126
Reasoning within output
branch_cleanup f007
100%
Passed
Tokens
157
Input
128
Total output
120
Reasoning within output
branch_cleanup f008
100%
Passed
Tokens
169
Input
185
Total output
173
Reasoning within output
branch_cleanup f009
100%
Passed
Tokens
134
Input
424
Total output
420
Reasoning within output
branch_cleanup f010
100%
Passed
Tokens
164
Input
198
Total output
190
Reasoning within output
branch_cleanup f011
100%
Passed
Tokens
115
Input
337
Total output
333
Reasoning within output
branch_cleanup f012
100%
Passed
Tokens
203
Input
271
Total output
259
Reasoning within output
cherry_pick f001
100%
Passed
Tokens
141
Input
259
Total output
252
Reasoning within output
cherry_pick f002
100%
Passed
Tokens
161
Input
305
Total output
295
Reasoning within output
cherry_pick f003
100%
Passed
Tokens
133
Input
242
Total output
236
Reasoning within output
cherry_pick f004
0%
Failed
Tokens
202
Input
413
Total output
387
Reasoning within output
cherry_pick f005
0%
Failed
Tokens
177
Input
241
Total output
214
Reasoning within output
cherry_pick f006
100%
Passed
Tokens
189
Input
294
Total output
269
Reasoning within output
cherry_pick f007
100%
Passed
Tokens
180
Input
181
Total output
168
Reasoning within output
cherry_pick f008
100%
Passed
Tokens
168
Input
192
Total output
184
Reasoning within output
cherry_pick f009
100%
Passed
Tokens
151
Input
163
Total output
155
Reasoning within output
cherry_pick f010
0%
Failed
Tokens
266
Input
407
Total output
368
Reasoning within output
cherry_pick f011
100%
Passed
Tokens
210
Input
298
Total output
267
Reasoning within output
cherry_pick f012
0%
Failed
Tokens
312
Input
631
Total output
484
Reasoning within output
commit_messages f001
93%
Passed
Tokens
155
Input
197
Total output
191
Reasoning within output
commit_messages f002
91%
Passed
Tokens
274
Input
307
Total output
298
Reasoning within output
commit_messages f003
97.7%
Passed
Tokens
128
Input
537
Total output
526
Reasoning within output
commit_messages f004
91%
Passed
Tokens
160
Input
175
Total output
169
Reasoning within output
commit_messages f005
83.3%
Passed
Tokens
150
Input
202
Total output
195
Reasoning within output
commit_messages f006
94.3%
Passed
Tokens
191
Input
214
Total output
205
Reasoning within output
commit_messages f007
83.3%
Passed
Tokens
232
Input
260
Total output
250
Reasoning within output
commit_messages f008
93.3%
Passed
Tokens
122
Input
214
Total output
207
Reasoning within output
commit_messages f009
87.7%
Passed
Tokens
176
Input
253
Total output
242
Reasoning within output
commit_messages f010
87.7%
Passed
Tokens
345
Input
274
Total output
263
Reasoning within output
commit_messages f011
59.3%
Passed
Tokens
195
Input
404
Total output
388
Reasoning within output
commit_messages f012
90%
Passed
Tokens
172
Input
303
Total output
295
Reasoning within output
commit_squash f001
100%
Passed
Tokens
194
Input
331
Total output
314
Reasoning within output
commit_squash f002
100%
Passed
Tokens
157
Input
199
Total output
193
Reasoning within output
commit_squash f003
100%
Failed
Tokens
160
Input
490
Total output
370
Reasoning within output
commit_squash f004
100%
Passed
Tokens
160
Input
222
Total output
203
Reasoning within output
commit_squash f005
100%
Failed
Tokens
153
Input
424
Total output
300
Reasoning within output
commit_squash f006
100%
Passed
Tokens
132
Input
303
Total output
221
Reasoning within output
commit_squash f007
100%
Passed
Tokens
135
Input
311
Total output
198
Reasoning within output
commit_squash f008
100%
Failed
Tokens
150
Input
283
Total output
205
Reasoning within output
commit_squash f009
100%
Failed
Tokens
141
Input
325
Total output
174
Reasoning within output
commit_squash f010
100%
Failed
Tokens
153
Input
297
Total output
197
Reasoning within output
commit_squash f011
100%
Passed
Tokens
162
Input
327
Total output
267
Reasoning within output
commit_squash f012
100%
Failed
Tokens
162
Input
409
Total output
255
Reasoning within output
git_bisect f001
100%
Passed
Tokens
233
Input
295
Total output
195
Reasoning within output
git_bisect f002
100%
Passed
Tokens
263
Input
368
Total output
262
Reasoning within output
git_bisect f003
100%
Passed
Tokens
253
Input
437
Total output
302
Reasoning within output
git_bisect f004
100%
Passed
Tokens
261
Input
386
Total output
257
Reasoning within output
git_bisect f005
100%
Passed
Tokens
257
Input
382
Total output
291
Reasoning within output
git_bisect f006
100%
Passed
Tokens
289
Input
357
Total output
246
Reasoning within output
git_bisect f007
100%
Passed
Tokens
265
Input
415
Total output
295
Reasoning within output
git_bisect f008
100%
Passed
Tokens
261
Input
411
Total output
261
Reasoning within output
git_bisect f009
100%
Passed
Tokens
261
Input
499
Total output
371
Reasoning within output
git_bisect f010
100%
Passed
Tokens
257
Input
452
Total output
346
Reasoning within output
git_bisect f011
100%
Passed
Tokens
257
Input
361
Total output
216
Reasoning within output
git_bisect f012
100%
Passed
Tokens
287
Input
356
Total output
248
Reasoning within output
git_clean f001
100%
Passed
Tokens
96
Input
97
Total output
87
Reasoning within output
git_clean f002
100%
Passed
Tokens
107
Input
141
Total output
134
Reasoning within output
git_clean f003
100%
Passed
Tokens
106
Input
145
Total output
138
Reasoning within output
git_clean f004
66.7%
Failed
Tokens
103
Input
152
Total output
143
Reasoning within output
git_clean f005
100%
Passed
Tokens
95
Input
288
Total output
281
Reasoning within output
git_clean f006
100%
Passed
Tokens
120
Input
136
Total output
127
Reasoning within output
git_clean f007
100%
Passed
Tokens
111
Input
218
Total output
207
Reasoning within output
git_clean f008
100%
Passed
Tokens
119
Input
141
Total output
134
Reasoning within output
git_clean f009
100%
Passed
Tokens
113
Input
151
Total output
141
Reasoning within output
git_clean f010
100%
Passed
Tokens
109
Input
211
Total output
203
Reasoning within output
git_clean f011
60%
Failed
Tokens
107
Input
308
Total output
299
Reasoning within output
git_clean f012
66.7%
Failed
Tokens
109
Input
158
Total output
149
Reasoning within output
git_grep f001
100%
Passed
Tokens
95
Input
178
Total output
172
Reasoning within output
git_grep f002
100%
Passed
Tokens
110
Input
85
Total output
81
Reasoning within output
git_grep f003
100%
Passed
Tokens
130
Input
114
Total output
110
Reasoning within output
git_grep f004
100%
Passed
Tokens
116
Input
101
Total output
93
Reasoning within output
git_grep f005
100%
Passed
Tokens
154
Input
145
Total output
141
Reasoning within output
git_grep f006
100%
Passed
Tokens
84
Input
509
Total output
505
Reasoning within output
git_grep f007
100%
Passed
Tokens
210
Input
823
Total output
818
Reasoning within output
git_grep f008
100%
Passed
Tokens
101
Input
70
Total output
66
Reasoning within output
git_grep f009
100%
Passed
Tokens
97
Input
59
Total output
53
Reasoning within output
git_grep f010
100%
Passed
Tokens
108
Input
84
Total output
80
Reasoning within output
git_grep f011
100%
Passed
Tokens
153
Input
185
Total output
181
Reasoning within output
git_grep f012
100%
Passed
Tokens
108
Input
83
Total output
72
Reasoning within output
git_log_format f001
100%
Passed
Tokens
947
Input
115
Total output
111
Reasoning within output
git_log_format f002
100%
Passed
Tokens
934
Input
129
Total output
119
Reasoning within output
git_log_format f003
100%
Passed
Tokens
777
Input
372
Total output
360
Reasoning within output
git_log_format f004
100%
Passed
Tokens
938
Input
239
Total output
235
Reasoning within output
git_log_format f005
100%
Passed
Tokens
776
Input
421
Total output
413
Reasoning within output
git_log_format f006
100%
Passed
Tokens
943
Input
295
Total output
291
Reasoning within output
git_log_format f007
100%
Passed
Tokens
949
Input
146
Total output
137
Reasoning within output
git_log_format f008
100%
Passed
Tokens
1,231
Input
170
Total output
166
Reasoning within output
git_log_format f009
100%
Passed
Tokens
769
Input
137
Total output
133
Reasoning within output
git_log_format f010
100%
Passed
Tokens
1,435
Input
209
Total output
205
Reasoning within output
git_log_format f011
100%
Passed
Tokens
483
Input
124
Total output
119
Reasoning within output
git_log_format f012
100%
Passed
Tokens
452
Input
204
Total output
201
Reasoning within output
git_show f001
0%
Failed
Tokens
239
Input
225
Total output
176
Reasoning within output
git_show f002
0%
Failed
Tokens
243
Input
678
Total output
636
Reasoning within output
git_show f003
100%
Passed
Tokens
735
Input
216
Total output
207
Reasoning within output
git_show f004
100%
Passed
Tokens
521
Input
211
Total output
203
Reasoning within output
git_show f005
100%
Passed
Tokens
246
Input
507
Total output
502
Reasoning within output
git_show f006
100%
Passed
Tokens
256
Input
204
Total output
201
Reasoning within output
git_show f007
100%
Passed
Tokens
245
Input
171
Total output
165
Reasoning within output
git_show f008
100%
Passed
Tokens
255
Input
169
Total output
128
Reasoning within output
git_show f009
100%
Passed
Tokens
369
Input
385
Total output
381
Reasoning within output
git_show f010
100%
Passed
Tokens
236
Input
200
Total output
196
Reasoning within output
git_show f011
100%
Passed
Tokens
270
Input
106
Total output
101
Reasoning within output
git_show f012
100%
Passed
Tokens
386
Input
195
Total output
191
Reasoning within output
merge_conflicts f001
100%
Passed
Tokens
130
Input
284
Total output
277
Reasoning within output
merge_conflicts f002
100%
Passed
Tokens
148
Input
574
Total output
564
Reasoning within output
merge_conflicts f003
100%
Passed
Tokens
121
Input
884
Total output
879
Reasoning within output
merge_conflicts f004
100%
Passed
Tokens
188
Input
1,234
Total output
1,208
Reasoning within output
merge_conflicts f005
100%
Passed
Tokens
162
Input
369
Total output
345
Reasoning within output
merge_conflicts f006
100%
Passed
Tokens
174
Input
554
Total output
529
Reasoning within output
merge_conflicts f007
100%
Passed
Tokens
178
Input
258
Total output
245
Reasoning within output
merge_conflicts f008
100%
Passed
Tokens
150
Input
142
Total output
133
Reasoning within output
merge_conflicts f009
100%
Passed
Tokens
147
Input
131
Total output
123
Reasoning within output
merge_conflicts f010
0%
Failed
Tokens
228
Input
115
Total output
84
Reasoning within output
merge_conflicts f011
100%
Passed
Tokens
190
Input
235
Total output
209
Reasoning within output
merge_conflicts f012
0%
Failed
Tokens
276
Input
1,860
Total output
1,835
Reasoning within output
rebase f001
100%
Passed
Tokens
140
Input
509
Total output
502
Reasoning within output
rebase f002
100%
Passed
Tokens
169
Input
236
Total output
227
Reasoning within output
rebase f003
100%
Passed
Tokens
146
Input
215
Total output
210
Reasoning within output
rebase f004
100%
Passed
Tokens
202
Input
923
Total output
896
Reasoning within output
rebase f005
0%
Failed
Tokens
174
Input
362
Total output
335
Reasoning within output
rebase f006
100%
Passed
Tokens
188
Input
318
Total output
288
Reasoning within output
rebase f007
100%
Passed
Tokens
182
Input
214
Total output
201
Reasoning within output
rebase f008
100%
Passed
Tokens
162
Input
197
Total output
188
Reasoning within output
rebase f009
100%
Passed
Tokens
145
Input
416
Total output
408
Reasoning within output
rebase f010
0%
Failed
Tokens
258
Input
236
Total output
205
Reasoning within output
rebase f011
100%
Passed
Tokens
198
Input
329
Total output
303
Reasoning within output
rebase f012
0%
Failed
Tokens
312
Input
279
Total output
252
Reasoning within output
reflog f001
100%
Passed
Tokens
303
Input
310
Total output
273
Reasoning within output
reflog f002
100%
Passed
Tokens
291
Input
442
Total output
274
Reasoning within output
reflog f003
100%
Passed
Tokens
253
Input
408
Total output
280
Reasoning within output
reflog f004
100%
Passed
Tokens
388
Input
678
Total output
378
Reasoning within output
reflog f005
100%
Passed
Tokens
434
Input
245
Total output
138
Reasoning within output
reflog f006
100%
Passed
Tokens
295
Input
356
Total output
176
Reasoning within output
reflog f007
100%
Passed
Tokens
344
Input
495
Total output
295
Reasoning within output
reflog f008
100%
Passed
Tokens
415
Input
492
Total output
311
Reasoning within output
reflog f009
100%
Passed
Tokens
393
Input
482
Total output
290
Reasoning within output
reflog f010
100%
Passed
Tokens
330
Input
268
Total output
195
Reasoning within output
reflog f011
100%
Passed
Tokens
382
Input
520
Total output
299
Reasoning within output
reflog f012
100%
Passed
Tokens
373
Input
896
Total output
539
Reasoning within output
stash_recovery f001
100%
Passed
Tokens
179
Input
197
Total output
128
Reasoning within output
stash_recovery f002
100%
Passed
Tokens
342
Input
173
Total output
97
Reasoning within output
stash_recovery f003
100%
Passed
Tokens
166
Input
189
Total output
124
Reasoning within output
stash_recovery f004
100%
Passed
Tokens
161
Input
253
Total output
174
Reasoning within output
stash_recovery f005
100%
Passed
Tokens
172
Input
311
Total output
174
Reasoning within output
stash_recovery f006
100%
Passed
Tokens
226
Input
263
Total output
178
Reasoning within output
stash_recovery f007
100%
Passed
Tokens
252
Input
299
Total output
186
Reasoning within output
stash_recovery f008
100%
Passed
Tokens
222
Input
273
Total output
141
Reasoning within output
stash_recovery f009
100%
Passed
Tokens
253
Input
229
Total output
121
Reasoning within output
stash_recovery f010
100%
Passed
Tokens
170
Input
281
Total output
207
Reasoning within output
stash_recovery f011
100%
Passed
Tokens
165
Input
260
Total output
175
Reasoning within output
stash_recovery f012
100%
Passed
Tokens
174
Input
245
Total output
184
Reasoning within output
submodule_usage f001
100%
Passed
Tokens
86
Input
84
Total output
74
Reasoning within output
submodule_usage f002
100%
Passed
Tokens
147
Input
119
Total output
109
Reasoning within output
submodule_usage f003
100%
Passed
Tokens
152
Input
563
Total output
533
Reasoning within output
submodule_usage f004
100%
Passed
Tokens
137
Input
85
Total output
79
Reasoning within output
submodule_usage f005
66.7%
Failed
Tokens
93
Input
271
Total output
246
Reasoning within output
submodule_usage f006
100%
Passed
Tokens
133
Input
139
Total output
133
Reasoning within output
submodule_usage f007
100%
Passed
Tokens
93
Input
117
Total output
97
Reasoning within output
submodule_usage f008
33.3%
Failed
Tokens
144
Input
98
Total output
91
Reasoning within output
submodule_usage f009
100%
Passed
Tokens
138
Input
319
Total output
313
Reasoning within output
submodule_usage f010
100%
Passed
Tokens
147
Input
352
Total output
343
Reasoning within output
submodule_usage f011
100%
Passed
Tokens
147
Input
56
Total output
49
Reasoning within output
submodule_usage f012
100%
Passed
Tokens
91
Input
177
Total output
164
Reasoning within output
tag_management f001
100%
Passed
Tokens
106
Input
121
Total output
113
Reasoning within output
tag_management f002
100%
Passed
Tokens
102
Input
135
Total output
115
Reasoning within output
tag_management f003
100%
Passed
Tokens
111
Input
275
Total output
265
Reasoning within output
tag_management f004
100%
Passed
Tokens
116
Input
247
Total output
242
Reasoning within output
tag_management f005
100%
Passed
Tokens
130
Input
217
Total output
207
Reasoning within output
tag_management f006
100%
Passed
Tokens
114
Input
478
Total output
452
Reasoning within output
tag_management f007
100%
Passed
Tokens
109
Input
160
Total output
147
Reasoning within output
tag_management f008
100%
Passed
Tokens
112
Input
73
Total output
48
Reasoning within output
tag_management f009
100%
Passed
Tokens
116
Input
215
Total output
199
Reasoning within output
tag_management f010
100%
Passed
Tokens
96
Input
60
Total output
50
Reasoning within output
tag_management f011
100%
Passed
Tokens
98
Input
94
Total output
86
Reasoning within output
tag_management f012
100%
Passed
Tokens
136
Input
397
Total output
385
Reasoning within output
worktree_usage f001
100%
Passed
Tokens
189
Input
139
Total output
128
Reasoning within output
worktree_usage f002
100%
Passed
Tokens
190
Input
478
Total output
432
Reasoning within output
worktree_usage f003
100%
Passed
Tokens
267
Input
152
Total output
141
Reasoning within output
worktree_usage f004
100%
Passed
Tokens
261
Input
90
Total output
83
Reasoning within output
worktree_usage f005
33.3%
Failed
Tokens
180
Input
347
Total output
337
Reasoning within output
worktree_usage f006
100%
Passed
Tokens
184
Input
139
Total output
124
Reasoning within output
worktree_usage f007
100%
Passed
Tokens
190
Input
871
Total output
849
Reasoning within output
worktree_usage f008
0%
Failed
Tokens
207
Input
694
Total output
638
Reasoning within output
worktree_usage f009
100%
Passed
Tokens
296
Input
210
Total output
203
Reasoning within output
worktree_usage f010
100%
Passed
Tokens
262
Input
227
Total output
211
Reasoning within output
worktree_usage f011
100%
Passed
Tokens
264
Input
129
Total output
118
Reasoning within output
worktree_usage f012
100%
Passed
Tokens
181
Input
118
Total output
104
Reasoning within output