Deepseek / deepseek-v4-flash
high
90.7%
185 / 204 fixtures
1 run(s)
34,780 input / 66,687 total output / 63,976 reasoning within output tokens $0.02002330
Reliability by Benchmark (Text)
Loading reliability summary…
Text vs JSON Schema Comparison
Pass Rate Delta
-28.9%
Text: 90.7% →
JSON: 61.8%
+7
Gained
JSON pass / text fail
−66
Lost
Text pass / JSON fail
119
Unchanged Pass
Both pass
12
Unchanged Fail
Both fail
Fixture Reliability Delta
| Fixture | Text | JSON | Delta |
|---|---|---|---|
| f011 | 0% (0/1) | 100% (1/1) | -100% |
Benchmark Deltas
| Benchmark | Text | JSON | Delta |
|---|---|---|---|
| branch_cleanup | 100% | 25% | -75% |
| reflog | 100% | 50% | -50% |
| blame_forensics | 100% | 58.3% | -41.7% |
| commit_messages | 91.7% | 50% | -41.7% |
| tag_management | 91.7% | 58.3% | -33.3% |
| git_grep | 100% | 66.7% | -33.3% |
| git_log_format | 100% | 66.7% | -33.3% |
| stash_recovery | 100% | 66.7% | -33.3% |
| worktree_usage | 83.3% | 50% | -33.3% |
| cherry_pick | 75% | 50% | -25% |
| commit_squash | 91.7% | 66.7% | -25% |
| git_bisect | 100% | 75% | -25% |
| merge_conflicts | 75% | 50% | -25% |
| git_clean | 91.7% | 83.3% | -8.3% |
| git_show | 91.7% | 100% | + 8.3% |
| rebase | 75% | 66.7% | -8.3% |
| submodule_usage | 75% | 66.7% | -8.3% |
Changed Fixtures (73)
f003 -loss f005 -loss f006 -loss f010 -loss f011 -loss f001 -loss f002 -loss f003 -loss f004 -loss f005 -loss f006 -loss f009 -loss f010 -loss f011 -loss f002 -loss f004 -loss f012 -loss f004 -loss f005 -loss f006 -loss f008 -loss f009 -loss f010 -loss f011 +gain f004 -loss f009 -loss f011 -loss f004 -loss f011 -loss f012 -loss f003 -loss f005 -loss f011 +gain f002 -loss f003 -loss f008 -loss f011 -loss f002 -loss f003 -loss f008 -loss f009 -loss f002 +gain f001 +gain f005 -loss f006 -loss f008 -loss f012 -loss f004 -loss f002 -loss f004 -loss f006 -loss f008 -loss f011 -loss f012 -loss f003 -loss f010 -loss f011 -loss f012 -loss f001 +gain f002 -loss f003 +gain f009 -loss f010 -loss f001 -loss f004 -loss f007 +gain f008 -loss f009 -loss f010 -loss f002 -loss f003 -loss f006 -loss f009 -loss
Fixture Gallery (204)
blame_forensics f001
100%
Passed
Tokens
217
Input
346
Total output
340
Reasoning within output
blame_forensics f002
100%
Passed
Tokens
199
Input
399
Total output
392
Reasoning within output
blame_forensics f003
100%
Passed
Tokens
228
Input
287
Total output
235
Reasoning within output
blame_forensics f004
100%
Passed
Tokens
163
Input
403
Total output
413
Reasoning within output
blame_forensics f005
100%
Passed
Tokens
231
Input
108
Total output
101
Reasoning within output
blame_forensics f006
100%
Passed
Tokens
204
Input
352
Total output
347
Reasoning within output
blame_forensics f007
100%
Passed
Tokens
191
Input
205
Total output
191
Reasoning within output
blame_forensics f008
100%
Passed
Tokens
201
Input
240
Total output
234
Reasoning within output
blame_forensics f009
100%
Passed
Tokens
174
Input
162
Total output
164
Reasoning within output
blame_forensics f010
100%
Passed
Tokens
368
Input
670
Total output
666
Reasoning within output
blame_forensics f011
100%
Passed
Tokens
206
Input
258
Total output
253
Reasoning within output
blame_forensics f012
100%
Passed
Tokens
123
Input
224
Total output
224
Reasoning within output
branch_cleanup f001
100%
Passed
Tokens
103
Input
248
Total output
242
Reasoning within output
branch_cleanup f002
100%
Passed
Tokens
109
Input
98
Total output
91
Reasoning within output
branch_cleanup f003
100%
Passed
Tokens
93
Input
62
Total output
67
Reasoning within output
branch_cleanup f004
100%
Passed
Tokens
125
Input
156
Total output
149
Reasoning within output
branch_cleanup f005
100%
Passed
Tokens
101
Input
164
Total output
159
Reasoning within output
branch_cleanup f006
100%
Passed
Tokens
118
Input
268
Total output
276
Reasoning within output
branch_cleanup f007
100%
Passed
Tokens
101
Input
246
Total output
260
Reasoning within output
branch_cleanup f008
100%
Passed
Tokens
116
Input
94
Total output
82
Reasoning within output
branch_cleanup f009
100%
Passed
Tokens
79
Input
74
Total output
71
Reasoning within output
branch_cleanup f010
100%
Passed
Tokens
116
Input
182
Total output
175
Reasoning within output
branch_cleanup f011
100%
Passed
Tokens
68
Input
66
Total output
63
Reasoning within output
branch_cleanup f012
100%
Passed
Tokens
152
Input
188
Total output
174
Reasoning within output
cherry_pick f001
0%
Failed
Tokens
94
Input
1,235
Total output
1,230
Reasoning within output
cherry_pick f002
100%
Passed
Tokens
136
Input
260
Total output
250
Reasoning within output
cherry_pick f003
100%
Passed
Tokens
87
Input
385
Total output
379
Reasoning within output
cherry_pick f004
100%
Passed
Tokens
163
Input
451
Total output
427
Reasoning within output
cherry_pick f005
0%
Failed
Tokens
129
Input
606
Total output
583
Reasoning within output
cherry_pick f006
100%
Passed
Tokens
134
Input
338
Total output
317
Reasoning within output
cherry_pick f007
100%
Passed
Tokens
128
Input
116
Total output
106
Reasoning within output
cherry_pick f008
100%
Passed
Tokens
123
Input
167
Total output
160
Reasoning within output
cherry_pick f009
100%
Passed
Tokens
107
Input
43
Total output
42
Reasoning within output
cherry_pick f010
0%
Failed
Tokens
211
Input
263
Total output
237
Reasoning within output
cherry_pick f011
100%
Passed
Tokens
162
Input
146
Total output
124
Reasoning within output
cherry_pick f012
100%
Passed
Tokens
260
Input
526
Total output
506
Reasoning within output
commit_messages f001
93.3%
Passed
Tokens
100
Input
133
Total output
128
Reasoning within output
commit_messages f002
90%
Passed
Tokens
213
Input
273
Total output
267
Reasoning within output
commit_messages f003
96%
Passed
Tokens
81
Input
105
Total output
91
Reasoning within output
commit_messages f004
92.7%
Passed
Tokens
108
Input
104
Total output
97
Reasoning within output
commit_messages f005
91%
Passed
Tokens
96
Input
133
Total output
127
Reasoning within output
commit_messages f006
93.3%
Passed
Tokens
131
Input
170
Total output
173
Reasoning within output
commit_messages f007
91.7%
Passed
Tokens
180
Input
118
Total output
110
Reasoning within output
commit_messages f008
81.7%
Passed
Tokens
66
Input
164
Total output
160
Reasoning within output
commit_messages f009
93.3%
Passed
Tokens
119
Input
331
Total output
323
Reasoning within output
commit_messages f010
89.3%
Passed
Tokens
266
Input
380
Total output
382
Reasoning within output
commit_messages f011
43.3%
Failed
Tokens
137
Input
344
Total output
330
Reasoning within output
commit_messages f012
89.3%
Passed
Tokens
120
Input
369
Total output
361
Reasoning within output
commit_squash f001
100%
Passed
Tokens
147
Input
1,116
Total output
1,068
Reasoning within output
commit_squash f002
100%
Passed
Tokens
113
Input
127
Total output
119
Reasoning within output
commit_squash f003
100%
Passed
Tokens
115
Input
3,832
Total output
3,818
Reasoning within output
commit_squash f004
100%
Passed
Tokens
107
Input
226
Total output
228
Reasoning within output
commit_squash f005
50%
Failed
Tokens
93
Input
523
Total output
559
Reasoning within output
commit_squash f006
100%
Passed
Tokens
85
Input
255
Total output
245
Reasoning within output
commit_squash f007
100%
Passed
Tokens
84
Input
563
Total output
604
Reasoning within output
commit_squash f008
100%
Passed
Tokens
97
Input
303
Total output
290
Reasoning within output
commit_squash f009
100%
Passed
Tokens
85
Input
451
Total output
462
Reasoning within output
commit_squash f010
100%
Passed
Tokens
119
Input
349
Total output
323
Reasoning within output
commit_squash f011
100%
Passed
Tokens
115
Input
180
Total output
159
Reasoning within output
commit_squash f012
100%
Passed
Tokens
109
Input
260
Total output
191
Reasoning within output
git_bisect f001
100%
Passed
Tokens
179
Input
130
Total output
124
Reasoning within output
git_bisect f002
100%
Passed
Tokens
191
Input
175
Total output
168
Reasoning within output
git_bisect f003
100%
Passed
Tokens
191
Input
106
Total output
106
Reasoning within output
git_bisect f004
100%
Passed
Tokens
189
Input
130
Total output
130
Reasoning within output
git_bisect f005
100%
Passed
Tokens
181
Input
114
Total output
110
Reasoning within output
git_bisect f006
100%
Passed
Tokens
210
Input
124
Total output
119
Reasoning within output
git_bisect f007
100%
Passed
Tokens
187
Input
172
Total output
167
Reasoning within output
git_bisect f008
100%
Passed
Tokens
193
Input
199
Total output
186
Reasoning within output
git_bisect f009
100%
Passed
Tokens
189
Input
146
Total output
140
Reasoning within output
git_bisect f010
100%
Passed
Tokens
193
Input
101
Total output
83
Reasoning within output
git_bisect f011
100%
Passed
Tokens
206
Input
110
Total output
104
Reasoning within output
git_bisect f012
100%
Passed
Tokens
218
Input
165
Total output
158
Reasoning within output
git_clean f001
100%
Passed
Tokens
49
Input
277
Total output
268
Reasoning within output
git_clean f002
100%
Passed
Tokens
59
Input
223
Total output
217
Reasoning within output
git_clean f003
100%
Passed
Tokens
57
Input
69
Total output
63
Reasoning within output
git_clean f004
100%
Passed
Tokens
57
Input
126
Total output
120
Reasoning within output
git_clean f005
100%
Passed
Tokens
47
Input
758
Total output
731
Reasoning within output
git_clean f006
100%
Passed
Tokens
74
Input
473
Total output
465
Reasoning within output
git_clean f007
100%
Passed
Tokens
69
Input
333
Total output
322
Reasoning within output
git_clean f008
100%
Passed
Tokens
72
Input
241
Total output
236
Reasoning within output
git_clean f009
100%
Passed
Tokens
67
Input
641
Total output
632
Reasoning within output
git_clean f010
100%
Passed
Tokens
63
Input
148
Total output
141
Reasoning within output
git_clean f011
60%
Failed
Tokens
60
Input
187
Total output
178
Reasoning within output
git_clean f012
100%
Passed
Tokens
63
Input
943
Total output
934
Reasoning within output
git_grep f001
100%
Passed
Tokens
52
Input
74
Total output
70
Reasoning within output
git_grep f002
100%
Passed
Tokens
60
Input
127
Total output
124
Reasoning within output
git_grep f003
100%
Passed
Tokens
84
Input
40
Total output
38
Reasoning within output
git_grep f004
100%
Passed
Tokens
72
Input
52
Total output
40
Reasoning within output
git_grep f005
100%
Passed
Tokens
107
Input
146
Total output
143
Reasoning within output
git_grep f006
100%
Passed
Tokens
37
Input
198
Total output
217
Reasoning within output
git_grep f007
100%
Passed
Tokens
156
Input
2,827
Total output
2,824
Reasoning within output
git_grep f008
100%
Passed
Tokens
57
Input
90
Total output
87
Reasoning within output
git_grep f009
100%
Passed
Tokens
50
Input
48
Total output
44
Reasoning within output
git_grep f010
100%
Passed
Tokens
60
Input
88
Total output
85
Reasoning within output
git_grep f011
100%
Passed
Tokens
114
Input
139
Total output
136
Reasoning within output
git_grep f012
100%
Passed
Tokens
64
Input
102
Total output
86
Reasoning within output
git_log_format f001
100%
Passed
Tokens
705
Input
44
Total output
41
Reasoning within output
git_log_format f002
100%
Passed
Tokens
710
Input
66
Total output
57
Reasoning within output
git_log_format f003
100%
Passed
Tokens
582
Input
53
Total output
42
Reasoning within output
git_log_format f004
100%
Passed
Tokens
704
Input
88
Total output
86
Reasoning within output
git_log_format f005
100%
Passed
Tokens
590
Input
191
Total output
184
Reasoning within output
git_log_format f006
100%
Passed
Tokens
703
Input
143
Total output
102
Reasoning within output
git_log_format f007
100%
Passed
Tokens
711
Input
156
Total output
151
Reasoning within output
git_log_format f008
100%
Passed
Tokens
935
Input
82
Total output
80
Reasoning within output
git_log_format f009
100%
Passed
Tokens
561
Input
71
Total output
68
Reasoning within output
git_log_format f010
100%
Passed
Tokens
1,069
Input
84
Total output
81
Reasoning within output
git_log_format f011
100%
Passed
Tokens
349
Input
68
Total output
64
Reasoning within output
git_log_format f012
100%
Passed
Tokens
318
Input
46
Total output
44
Reasoning within output
git_show f001
100%
Passed
Tokens
173
Input
74
Total output
67
Reasoning within output
git_show f002
0%
Failed
Tokens
165
Input
1,315
Total output
1,391
Reasoning within output
git_show f003
100%
Passed
Tokens
551
Input
106
Total output
98
Reasoning within output
git_show f004
100%
Passed
Tokens
388
Input
63
Total output
49
Reasoning within output
git_show f005
100%
Passed
Tokens
173
Input
69
Total output
68
Reasoning within output
git_show f006
100%
Passed
Tokens
181
Input
52
Total output
50
Reasoning within output
git_show f007
100%
Passed
Tokens
162
Input
65
Total output
59
Reasoning within output
git_show f008
100%
Passed
Tokens
176
Input
114
Total output
86
Reasoning within output
git_show f009
100%
Passed
Tokens
263
Input
120
Total output
96
Reasoning within output
git_show f010
100%
Passed
Tokens
168
Input
115
Total output
116
Reasoning within output
git_show f011
100%
Passed
Tokens
195
Input
49
Total output
47
Reasoning within output
git_show f012
100%
Passed
Tokens
292
Input
192
Total output
198
Reasoning within output
merge_conflicts f001
0%
Failed
Tokens
84
Input
1,235
Total output
1,354
Reasoning within output
merge_conflicts f002
100%
Passed
Tokens
109
Input
634
Total output
604
Reasoning within output
merge_conflicts f003
100%
Passed
Tokens
79
Input
529
Total output
564
Reasoning within output
merge_conflicts f004
100%
Passed
Tokens
134
Input
913
Total output
886
Reasoning within output
merge_conflicts f005
100%
Passed
Tokens
117
Input
491
Total output
547
Reasoning within output
merge_conflicts f006
100%
Passed
Tokens
123
Input
239
Total output
213
Reasoning within output
merge_conflicts f007
100%
Passed
Tokens
130
Input
113
Total output
103
Reasoning within output
merge_conflicts f008
100%
Passed
Tokens
106
Input
77
Total output
69
Reasoning within output
merge_conflicts f009
100%
Passed
Tokens
104
Input
207
Total output
214
Reasoning within output
merge_conflicts f010
0%
Failed
Tokens
178
Input
593
Total output
597
Reasoning within output
merge_conflicts f011
0%
Failed
Tokens
142
Input
281
Total output
253
Reasoning within output
merge_conflicts f012
100%
Passed
Tokens
225
Input
1,108
Total output
1,139
Reasoning within output
rebase f001
100%
Passed
Tokens
93
Input
737
Total output
820
Reasoning within output
rebase f002
100%
Passed
Tokens
129
Input
106
Total output
96
Reasoning within output
rebase f003
100%
Passed
Tokens
120
Input
120
Total output
115
Reasoning within output
rebase f004
100%
Passed
Tokens
145
Input
710
Total output
716
Reasoning within output
rebase f005
0%
Failed
Tokens
128
Input
271
Total output
291
Reasoning within output
rebase f006
100%
Passed
Tokens
135
Input
145
Total output
124
Reasoning within output
rebase f007
100%
Passed
Tokens
131
Input
141
Total output
130
Reasoning within output
rebase f008
100%
Passed
Tokens
118
Input
116
Total output
108
Reasoning within output
rebase f009
100%
Passed
Tokens
103
Input
95
Total output
88
Reasoning within output
rebase f010
0%
Failed
Tokens
201
Input
373
Total output
349
Reasoning within output
rebase f011
100%
Passed
Tokens
153
Input
335
Total output
312
Reasoning within output
rebase f012
0%
Failed
Tokens
254
Input
2,479
Total output
2,459
Reasoning within output
reflog f001
100%
Passed
Tokens
202
Input
200
Total output
173
Reasoning within output
reflog f002
100%
Passed
Tokens
188
Input
99
Total output
67
Reasoning within output
reflog f003
100%
Passed
Tokens
167
Input
281
Total output
217
Reasoning within output
reflog f004
100%
Passed
Tokens
278
Input
805
Total output
573
Reasoning within output
reflog f005
100%
Passed
Tokens
320
Input
1,128
Total output
1,084
Reasoning within output
reflog f006
100%
Passed
Tokens
217
Input
1,887
Total output
1,954
Reasoning within output
reflog f007
100%
Passed
Tokens
244
Input
212
Total output
145
Reasoning within output
reflog f008
100%
Passed
Tokens
305
Input
371
Total output
295
Reasoning within output
reflog f009
100%
Passed
Tokens
268
Input
1,079
Total output
894
Reasoning within output
reflog f010
100%
Passed
Tokens
230
Input
170
Total output
147
Reasoning within output
reflog f011
100%
Passed
Tokens
269
Input
887
Total output
698
Reasoning within output
reflog f012
100%
Passed
Tokens
283
Input
1,575
Total output
1,122
Reasoning within output
stash_recovery f001
100%
Passed
Tokens
123
Input
104
Total output
91
Reasoning within output
stash_recovery f002
100%
Passed
Tokens
269
Input
252
Total output
209
Reasoning within output
stash_recovery f003
100%
Passed
Tokens
111
Input
69
Total output
54
Reasoning within output
stash_recovery f004
100%
Passed
Tokens
110
Input
90
Total output
70
Reasoning within output
stash_recovery f005
100%
Passed
Tokens
120
Input
114
Total output
97
Reasoning within output
stash_recovery f006
100%
Passed
Tokens
167
Input
101
Total output
83
Reasoning within output
stash_recovery f007
100%
Passed
Tokens
192
Input
192
Total output
175
Reasoning within output
stash_recovery f008
100%
Passed
Tokens
163
Input
176
Total output
159
Reasoning within output
stash_recovery f009
100%
Passed
Tokens
215
Input
78
Total output
71
Reasoning within output
stash_recovery f010
100%
Passed
Tokens
118
Input
105
Total output
98
Reasoning within output
stash_recovery f011
100%
Passed
Tokens
108
Input
185
Total output
172
Reasoning within output
stash_recovery f012
100%
Passed
Tokens
127
Input
87
Total output
71
Reasoning within output
submodule_usage f001
0%
Failed
Tokens
40
Input
100
Total output
90
Reasoning within output
submodule_usage f002
100%
Passed
Tokens
92
Input
218
Total output
207
Reasoning within output
submodule_usage f003
75%
Failed
Tokens
94
Input
2,384
Total output
2,512
Reasoning within output
submodule_usage f004
100%
Passed
Tokens
82
Input
213
Total output
208
Reasoning within output
submodule_usage f005
66.7%
Failed
Tokens
47
Input
275
Total output
257
Reasoning within output
submodule_usage f006
100%
Passed
Tokens
79
Input
215
Total output
209
Reasoning within output
submodule_usage f007
100%
Passed
Tokens
48
Input
125
Total output
104
Reasoning within output
submodule_usage f008
100%
Passed
Tokens
87
Input
124
Total output
136
Reasoning within output
submodule_usage f009
100%
Passed
Tokens
82
Input
165
Total output
159
Reasoning within output
submodule_usage f010
100%
Passed
Tokens
92
Input
138
Total output
129
Reasoning within output
submodule_usage f011
100%
Passed
Tokens
93
Input
70
Total output
65
Reasoning within output
submodule_usage f012
100%
Passed
Tokens
46
Input
143
Total output
140
Reasoning within output
tag_management f001
100%
Passed
Tokens
54
Input
45
Total output
37
Reasoning within output
tag_management f002
100%
Passed
Tokens
69
Input
178
Total output
158
Reasoning within output
tag_management f003
100%
Passed
Tokens
58
Input
87
Total output
76
Reasoning within output
tag_management f004
100%
Passed
Tokens
63
Input
67
Total output
64
Reasoning within output
tag_management f005
100%
Passed
Tokens
75
Input
77
Total output
64
Reasoning within output
tag_management f006
100%
Passed
Tokens
59
Input
626
Total output
604
Reasoning within output
tag_management f007
66.7%
Failed
Tokens
59
Input
351
Total output
260
Reasoning within output
tag_management f008
100%
Passed
Tokens
65
Input
629
Total output
498
Reasoning within output
tag_management f009
100%
Passed
Tokens
67
Input
104
Total output
85
Reasoning within output
tag_management f010
100%
Passed
Tokens
45
Input
43
Total output
34
Reasoning within output
tag_management f011
100%
Passed
Tokens
43
Input
53
Total output
46
Reasoning within output
tag_management f012
100%
Passed
Tokens
80
Input
713
Total output
702
Reasoning within output
worktree_usage f001
100%
Passed
Tokens
126
Input
113
Total output
108
Reasoning within output
worktree_usage f002
100%
Passed
Tokens
136
Input
280
Total output
273
Reasoning within output
worktree_usage f003
100%
Passed
Tokens
192
Input
175
Total output
165
Reasoning within output
worktree_usage f004
100%
Passed
Tokens
193
Input
77
Total output
71
Reasoning within output
worktree_usage f005
33.3%
Failed
Tokens
119
Input
213
Total output
200
Reasoning within output
worktree_usage f006
100%
Passed
Tokens
121
Input
175
Total output
161
Reasoning within output
worktree_usage f007
100%
Passed
Tokens
122
Input
308
Total output
287
Reasoning within output
worktree_usage f008
0%
Failed
Tokens
170
Input
1,113
Total output
1,070
Reasoning within output
worktree_usage f009
100%
Passed
Tokens
230
Input
110
Total output
104
Reasoning within output
worktree_usage f010
100%
Passed
Tokens
194
Input
110
Total output
93
Reasoning within output
worktree_usage f011
100%
Passed
Tokens
194
Input
150
Total output
147
Reasoning within output
worktree_usage f012
100%
Passed
Tokens
122
Input
132
Total output
117
Reasoning within output