Deepseek / deepseek-v4-flash
none
79.4%
162 / 204 fixtures
1 run(s)
33,944 input / 5,380 total output / 2,342 reasoning within output tokens $0.00515977
Reliability by Benchmark (Text)
Loading reliability summary…
Text vs JSON Schema Comparison
Pass Rate Delta
+5.4%
Text: 79.4% →
JSON: 84.8%
+19
Gained
JSON pass / text fail
−8
Lost
Text pass / JSON fail
154
Unchanged Pass
Both pass
23
Unchanged Fail
Both fail
Fixture Reliability Delta
| Fixture | Text | JSON | Delta |
|---|---|---|---|
| f011 | 100% (1/1) | 0% (0/1) | +100% |
Benchmark Deltas
| Benchmark | Text | JSON | Delta |
|---|---|---|---|
| commit_squash | 50% | 100% | + 50% |
| git_grep | 58.3% | 83.3% | + 25% |
| blame_forensics | 58.3% | 75% | + 16.7% |
| git_log_format | 83.3% | 91.7% | + 8.3% |
| merge_conflicts | 66.7% | 58.3% | -8.3% |
| submodule_usage | 83.3% | 91.7% | + 8.3% |
| tag_management | 83.3% | 91.7% | + 8.3% |
| worktree_usage | 83.3% | 91.7% | + 8.3% |
| branch_cleanup | 83.3% | 75% | -8.3% |
| git_clean | 83.3% | 75% | -8.3% |
| reflog | 100% | 91.7% | -8.3% |
| cherry_pick | 66.7% | 66.7% | + 0% |
| commit_messages | 91.7% | 91.7% | + 0% |
| git_bisect | 100% | 100% | + 0% |
| git_show | 91.7% | 91.7% | + 0% |
| rebase | 66.7% | 66.7% | + 0% |
| stash_recovery | 100% | 100% | + 0% |
Changed Fixtures (27)
Fixture Gallery (204)
blame_forensics f001
0%
Failed
Tokens
217
Input
9
Total output
0
Reasoning within output
blame_forensics f002
100%
Passed
Tokens
210
Input
4
Total output
0
Reasoning within output
blame_forensics f003
100%
Passed
Tokens
232
Input
7
Total output
0
Reasoning within output
blame_forensics f004
100%
Passed
Tokens
165
Input
5
Total output
0
Reasoning within output
blame_forensics f005
0%
Failed
Tokens
242
Input
6
Total output
0
Reasoning within output
blame_forensics f006
0%
Failed
Tokens
204
Input
8
Total output
0
Reasoning within output
blame_forensics f007
100%
Passed
Tokens
188
Input
6
Total output
0
Reasoning within output
blame_forensics f008
100%
Passed
Tokens
202
Input
5
Total output
0
Reasoning within output
blame_forensics f009
0%
Failed
Tokens
170
Input
8
Total output
0
Reasoning within output
blame_forensics f010
0%
Failed
Tokens
372
Input
8
Total output
0
Reasoning within output
blame_forensics f011
100%
Passed
Tokens
202
Input
4
Total output
0
Reasoning within output
blame_forensics f012
100%
Passed
Tokens
132
Input
5
Total output
0
Reasoning within output
branch_cleanup f001
100%
Passed
Tokens
118
Input
4
Total output
0
Reasoning within output
branch_cleanup f002
100%
Passed
Tokens
109
Input
5
Total output
0
Reasoning within output
branch_cleanup f003
0%
Failed
Tokens
89
Input
6
Total output
0
Reasoning within output
branch_cleanup f004
100%
Passed
Tokens
123
Input
7
Total output
0
Reasoning within output
branch_cleanup f005
100%
Passed
Tokens
118
Input
3
Total output
0
Reasoning within output
branch_cleanup f006
100%
Passed
Tokens
136
Input
7
Total output
0
Reasoning within output
branch_cleanup f007
100%
Passed
Tokens
97
Input
5
Total output
0
Reasoning within output
branch_cleanup f008
100%
Passed
Tokens
115
Input
11
Total output
0
Reasoning within output
branch_cleanup f009
0%
Failed
Tokens
77
Input
4
Total output
0
Reasoning within output
branch_cleanup f010
100%
Passed
Tokens
111
Input
6
Total output
0
Reasoning within output
branch_cleanup f011
100%
Passed
Tokens
69
Input
2
Total output
0
Reasoning within output
branch_cleanup f012
100%
Passed
Tokens
146
Input
13
Total output
0
Reasoning within output
cherry_pick f001
0%
Failed
cherry_pick f002
100%
Passed
Tokens
121
Input
9
Total output
0
Reasoning within output
cherry_pick f003
100%
Passed
Tokens
87
Input
4
Total output
0
Reasoning within output
cherry_pick f004
100%
Passed
Tokens
147
Input
23
Total output
0
Reasoning within output
cherry_pick f005
0%
Failed
Tokens
128
Input
23
Total output
0
Reasoning within output
cherry_pick f006
100%
Passed
Tokens
137
Input
26
Total output
0
Reasoning within output
cherry_pick f007
100%
Passed
Tokens
129
Input
10
Total output
0
Reasoning within output
cherry_pick f008
100%
Passed
Tokens
122
Input
7
Total output
0
Reasoning within output
cherry_pick f009
100%
Passed
Tokens
106
Input
6
Total output
0
Reasoning within output
cherry_pick f010
0%
Failed
Tokens
211
Input
25
Total output
0
Reasoning within output
cherry_pick f011
100%
Passed
Tokens
161
Input
22
Total output
0
Reasoning within output
cherry_pick f012
0%
Failed
Tokens
258
Input
23
Total output
0
Reasoning within output
commit_messages f001
99%
Passed
Tokens
100
Input
4
Total output
0
Reasoning within output
commit_messages f002
89.3%
Passed
Tokens
196
Input
5
Total output
0
Reasoning within output
commit_messages f003
0%
Failed
commit_messages f004
91.7%
Passed
Tokens
108
Input
4
Total output
0
Reasoning within output
commit_messages f005
94%
Passed
Tokens
96
Input
4
Total output
0
Reasoning within output
commit_messages f006
93.3%
Passed
Tokens
131
Input
7
Total output
0
Reasoning within output
commit_messages f007
89.3%
Passed
Tokens
180
Input
3
Total output
0
Reasoning within output
commit_messages f008
91.7%
Passed
Tokens
66
Input
6
Total output
0
Reasoning within output
commit_messages f009
90%
Passed
Tokens
119
Input
7
Total output
0
Reasoning within output
commit_messages f010
91%
Passed
Tokens
266
Input
8
Total output
0
Reasoning within output
commit_messages f011
58.7%
Passed
Tokens
137
Input
13
Total output
0
Reasoning within output
commit_messages f012
61.3%
Passed
Tokens
120
Input
6
Total output
0
Reasoning within output
commit_squash f001
0%
Failed
commit_squash f002
100%
Passed
Tokens
110
Input
6
Total output
0
Reasoning within output
commit_squash f003
100%
Failed
Tokens
112
Input
138
Total output
0
Reasoning within output
commit_squash f004
100%
Passed
Tokens
108
Input
28
Total output
0
Reasoning within output
commit_squash f005
100%
Failed
Tokens
95
Input
90
Total output
0
Reasoning within output
commit_squash f006
100%
Failed
Tokens
82
Input
104
Total output
0
Reasoning within output
commit_squash f007
100%
Failed
Tokens
85
Input
100
Total output
0
Reasoning within output
commit_squash f008
100%
Failed
Tokens
96
Input
115
Total output
0
Reasoning within output
commit_squash f009
100%
Passed
Tokens
89
Input
98
Total output
0
Reasoning within output
commit_squash f010
100%
Passed
Tokens
103
Input
43
Total output
0
Reasoning within output
commit_squash f011
100%
Passed
Tokens
110
Input
56
Total output
0
Reasoning within output
commit_squash f012
100%
Passed
Tokens
105
Input
128
Total output
0
Reasoning within output
git_bisect f001
100%
Passed
Tokens
166
Input
11
Total output
0
Reasoning within output
git_bisect f002
100%
Passed
Tokens
195
Input
40
Total output
0
Reasoning within output
git_bisect f003
100%
Passed
Tokens
204
Input
4
Total output
0
Reasoning within output
git_bisect f004
100%
Passed
Tokens
195
Input
55
Total output
0
Reasoning within output
git_bisect f005
100%
Passed
Tokens
183
Input
44
Total output
0
Reasoning within output
git_bisect f006
100%
Passed
Tokens
208
Input
50
Total output
0
Reasoning within output
git_bisect f007
100%
Passed
Tokens
189
Input
80
Total output
0
Reasoning within output
git_bisect f008
100%
Passed
Tokens
189
Input
54
Total output
0
Reasoning within output
git_bisect f009
100%
Passed
Tokens
189
Input
45
Total output
0
Reasoning within output
git_bisect f010
100%
Passed
Tokens
183
Input
65
Total output
0
Reasoning within output
git_bisect f011
100%
Passed
Tokens
195
Input
50
Total output
0
Reasoning within output
git_bisect f012
100%
Passed
Tokens
208
Input
57
Total output
0
Reasoning within output
git_clean f001
100%
Passed
Tokens
49
Input
5
Total output
0
Reasoning within output
git_clean f002
100%
Passed
Tokens
59
Input
4
Total output
0
Reasoning within output
git_clean f003
100%
Passed
Tokens
57
Input
4
Total output
0
Reasoning within output
git_clean f004
33.3%
Failed
Tokens
57
Input
5
Total output
0
Reasoning within output
git_clean f005
75%
Failed
Tokens
47
Input
7
Total output
0
Reasoning within output
git_clean f006
100%
Passed
Tokens
74
Input
7
Total output
0
Reasoning within output
git_clean f007
100%
Passed
Tokens
69
Input
10
Total output
0
Reasoning within output
git_clean f008
100%
Passed
Tokens
72
Input
4
Total output
0
Reasoning within output
git_clean f009
100%
Passed
Tokens
67
Input
9
Total output
0
Reasoning within output
git_clean f010
100%
Passed
Tokens
63
Input
6
Total output
0
Reasoning within output
git_clean f011
100%
Passed
Tokens
60
Input
5
Total output
0
Reasoning within output
git_clean f012
100%
Passed
Tokens
63
Input
9
Total output
0
Reasoning within output
git_grep f001
100%
Passed
Tokens
52
Input
3
Total output
0
Reasoning within output
git_grep f002
0%
Failed
Tokens
60
Input
2
Total output
0
Reasoning within output
git_grep f003
100%
Passed
Tokens
84
Input
2
Total output
0
Reasoning within output
git_grep f004
100%
Passed
Tokens
72
Input
5
Total output
0
Reasoning within output
git_grep f005
100%
Passed
Tokens
107
Input
2
Total output
0
Reasoning within output
git_grep f006
0%
Failed
Tokens
37
Input
2
Total output
0
Reasoning within output
git_grep f007
0%
Failed
Tokens
156
Input
1
Total output
0
Reasoning within output
git_grep f008
0%
Failed
Tokens
57
Input
2
Total output
0
Reasoning within output
git_grep f009
100%
Passed
Tokens
50
Input
4
Total output
0
Reasoning within output
git_grep f010
0%
Failed
Tokens
60
Input
1
Total output
0
Reasoning within output
git_grep f011
100%
Passed
Tokens
114
Input
2
Total output
0
Reasoning within output
git_grep f012
100%
Passed
Tokens
64
Input
10
Total output
0
Reasoning within output
git_log_format f001
100%
Passed
Tokens
702
Input
1
Total output
0
Reasoning within output
git_log_format f002
100%
Passed
Tokens
699
Input
8
Total output
0
Reasoning within output
git_log_format f003
100%
Passed
Tokens
574
Input
10
Total output
0
Reasoning within output
git_log_format f004
100%
Passed
Tokens
700
Input
2
Total output
0
Reasoning within output
git_log_format f005
0%
Failed
Tokens
573
Input
6
Total output
0
Reasoning within output
git_log_format f006
0%
Failed
Tokens
703
Input
2
Total output
0
Reasoning within output
git_log_format f007
100%
Passed
Tokens
705
Input
4
Total output
0
Reasoning within output
git_log_format f008
100%
Passed
Tokens
926
Input
2
Total output
0
Reasoning within output
git_log_format f009
100%
Passed
Tokens
557
Input
1
Total output
0
Reasoning within output
git_log_format f010
100%
Passed
Tokens
1,054
Input
2
Total output
0
Reasoning within output
git_log_format f011
100%
Passed
Tokens
345
Input
3
Total output
0
Reasoning within output
git_log_format f012
100%
Passed
Tokens
319
Input
2
Total output
0
Reasoning within output
git_show f001
100%
Passed
Tokens
167
Input
6
Total output
0
Reasoning within output
git_show f002
100%
Passed
Tokens
168
Input
3
Total output
0
Reasoning within output
git_show f003
100%
Passed
Tokens
567
Input
7
Total output
0
Reasoning within output
git_show f004
100%
Passed
Tokens
384
Input
7
Total output
0
Reasoning within output
git_show f005
0%
Failed
git_show f006
100%
Passed
Tokens
182
Input
2
Total output
0
Reasoning within output
git_show f007
100%
Passed
Tokens
162
Input
5
Total output
0
Reasoning within output
git_show f008
100%
Passed
Tokens
174
Input
19
Total output
0
Reasoning within output
git_show f009
100%
Passed
Tokens
258
Input
2
Total output
0
Reasoning within output
git_show f010
100%
Passed
Tokens
169
Input
2
Total output
0
Reasoning within output
git_show f011
100%
Passed
Tokens
197
Input
3
Total output
0
Reasoning within output
git_show f012
100%
Passed
Tokens
292
Input
2
Total output
0
Reasoning within output
merge_conflicts f001
0%
Failed
Tokens
84
Input
4
Total output
0
Reasoning within output
merge_conflicts f002
100%
Passed
Tokens
109
Input
8
Total output
0
Reasoning within output
merge_conflicts f003
100%
Passed
Tokens
79
Input
5
Total output
0
Reasoning within output
merge_conflicts f004
100%
Passed
Tokens
134
Input
25
Total output
0
Reasoning within output
merge_conflicts f005
0%
Failed
Tokens
117
Input
23
Total output
0
Reasoning within output
merge_conflicts f006
100%
Passed
Tokens
123
Input
26
Total output
0
Reasoning within output
merge_conflicts f007
100%
Passed
Tokens
130
Input
10
Total output
0
Reasoning within output
merge_conflicts f008
100%
Passed
Tokens
106
Input
6
Total output
0
Reasoning within output
merge_conflicts f009
100%
Passed
Tokens
104
Input
6
Total output
0
Reasoning within output
merge_conflicts f010
0%
Failed
Tokens
178
Input
24
Total output
0
Reasoning within output
merge_conflicts f011
100%
Passed
Tokens
142
Input
25
Total output
0
Reasoning within output
merge_conflicts f012
0%
Failed
Tokens
225
Input
22
Total output
0
Reasoning within output
rebase f001
0%
Failed
Tokens
94
Input
5
Total output
0
Reasoning within output
rebase f002
100%
Passed
Tokens
127
Input
9
Total output
0
Reasoning within output
rebase f003
100%
Passed
Tokens
104
Input
3
Total output
0
Reasoning within output
rebase f004
100%
Passed
Tokens
145
Input
26
Total output
0
Reasoning within output
rebase f005
0%
Failed
Tokens
127
Input
23
Total output
0
Reasoning within output
rebase f006
100%
Passed
Tokens
135
Input
25
Total output
0
Reasoning within output
rebase f007
100%
Passed
Tokens
130
Input
10
Total output
0
Reasoning within output
rebase f008
100%
Passed
Tokens
135
Input
6
Total output
0
Reasoning within output
rebase f009
100%
Passed
Tokens
103
Input
6
Total output
0
Reasoning within output
rebase f010
0%
Failed
Tokens
201
Input
25
Total output
0
Reasoning within output
rebase f011
100%
Passed
Tokens
152
Input
26
Total output
0
Reasoning within output
rebase f012
0%
Failed
Tokens
256
Input
22
Total output
0
Reasoning within output
reflog f001
100%
Passed
Tokens
204
Input
24
Total output
0
Reasoning within output
reflog f002
100%
Passed
Tokens
186
Input
82
Total output
0
Reasoning within output
reflog f003
100%
Passed
Tokens
170
Input
131
Total output
0
Reasoning within output
reflog f004
100%
Passed
Tokens
279
Input
249
Total output
0
Reasoning within output
reflog f005
100%
Passed
Tokens
345
Input
35
Total output
0
Reasoning within output
reflog f006
100%
Passed
Tokens
227
Input
201
Total output
0
Reasoning within output
reflog f007
100%
Passed
Tokens
237
Input
62
Total output
0
Reasoning within output
reflog f008
100%
Passed
Tokens
299
Input
161
Total output
0
Reasoning within output
reflog f009
100%
Passed
Tokens
272
Input
146
Total output
0
Reasoning within output
reflog f010
100%
Passed
Tokens
224
Input
79
Total output
0
Reasoning within output
reflog f011
100%
Passed
Tokens
275
Input
251
Total output
0
Reasoning within output
reflog f012
100%
Passed
Tokens
263
Input
306
Total output
0
Reasoning within output
stash_recovery f001
100%
Passed
Tokens
139
Input
5
Total output
0
Reasoning within output
stash_recovery f002
100%
Passed
Tokens
272
Input
100
Total output
0
Reasoning within output
stash_recovery f003
100%
Passed
Tokens
111
Input
55
Total output
0
Reasoning within output
stash_recovery f004
100%
Passed
Tokens
110
Input
32
Total output
0
Reasoning within output
stash_recovery f005
100%
Passed
Tokens
119
Input
61
Total output
0
Reasoning within output
stash_recovery f006
100%
Passed
Tokens
184
Input
15
Total output
0
Reasoning within output
stash_recovery f007
100%
Passed
Tokens
194
Input
95
Total output
0
Reasoning within output
stash_recovery f008
100%
Passed
Tokens
163
Input
85
Total output
0
Reasoning within output
stash_recovery f009
100%
Passed
Tokens
198
Input
42
Total output
0
Reasoning within output
stash_recovery f010
100%
Passed
Tokens
118
Input
33
Total output
0
Reasoning within output
stash_recovery f011
100%
Passed
Tokens
107
Input
131
Total output
0
Reasoning within output
stash_recovery f012
100%
Passed
Tokens
127
Input
58
Total output
0
Reasoning within output
submodule_usage f001
100%
Passed
Tokens
40
Input
10
Total output
0
Reasoning within output
submodule_usage f002
100%
Passed
Tokens
88
Input
12
Total output
0
Reasoning within output
submodule_usage f003
100%
Passed
Tokens
96
Input
20
Total output
0
Reasoning within output
submodule_usage f004
100%
Passed
Tokens
83
Input
5
Total output
0
Reasoning within output
submodule_usage f005
0%
Failed
Tokens
47
Input
33
Total output
0
Reasoning within output
submodule_usage f006
100%
Passed
Tokens
77
Input
4
Total output
0
Reasoning within output
submodule_usage f007
100%
Passed
Tokens
48
Input
20
Total output
0
Reasoning within output
submodule_usage f008
100%
Passed
Tokens
87
Input
7
Total output
0
Reasoning within output
submodule_usage f009
0%
Failed
Tokens
81
Input
10
Total output
0
Reasoning within output
submodule_usage f010
100%
Passed
Tokens
90
Input
8
Total output
0
Reasoning within output
submodule_usage f011
100%
Passed
Tokens
96
Input
4
Total output
0
Reasoning within output
submodule_usage f012
100%
Passed
Tokens
46
Input
13
Total output
0
Reasoning within output
tag_management f001
100%
Passed
Tokens
55
Input
7
Total output
0
Reasoning within output
tag_management f002
100%
Passed
Tokens
51
Input
18
Total output
0
Reasoning within output
tag_management f003
100%
Passed
Tokens
56
Input
10
Total output
0
Reasoning within output
tag_management f004
100%
Passed
Tokens
62
Input
3
Total output
0
Reasoning within output
tag_management f005
100%
Passed
Tokens
75
Input
9
Total output
0
Reasoning within output
tag_management f006
100%
Passed
Tokens
61
Input
18
Total output
0
Reasoning within output
tag_management f007
66.7%
Failed
Tokens
60
Input
45
Total output
0
Reasoning within output
tag_management f008
100%
Passed
Tokens
65
Input
26
Total output
0
Reasoning within output
tag_management f009
100%
Passed
Tokens
83
Input
17
Total output
0
Reasoning within output
tag_management f010
100%
Passed
Tokens
45
Input
8
Total output
0
Reasoning within output
tag_management f011
0%
Failed
tag_management f012
100%
Passed
Tokens
77
Input
10
Total output
0
Reasoning within output
worktree_usage f001
100%
Passed
Tokens
129
Input
10
Total output
0
Reasoning within output
worktree_usage f002
100%
Passed
Tokens
133
Input
22
Total output
0
Reasoning within output
worktree_usage f003
100%
Passed
Tokens
191
Input
9
Total output
0
Reasoning within output
worktree_usage f004
0%
Failed
worktree_usage f005
100%
Passed
Tokens
118
Input
14
Total output
0
Reasoning within output
worktree_usage f006
100%
Passed
Tokens
124
Input
16
Total output
0
Reasoning within output
worktree_usage f007
100%
Passed
Tokens
125
Input
21
Total output
0
Reasoning within output
worktree_usage f008
100%
Passed
Tokens
148
Input
49
Total output
0
Reasoning within output
worktree_usage f009
100%
Passed
Tokens
230
Input
4
Total output
0
Reasoning within output
worktree_usage f010
0%
Failed
Tokens
200
Input
20
Total output
0
Reasoning within output
worktree_usage f011
100%
Passed
Tokens
197
Input
9
Total output
0
Reasoning within output
worktree_usage f012
100%
Passed
Tokens
121
Input
14
Total output
0
Reasoning within output