BlackjackBench and The Thinking Revolution

TL;DR

I built BlackjackBench: a Blackjack benchmark covering all 550 initial hands. It can be easily rerun with everything logged and random seeds controllable.
Key metric is difference in expected value (ΔEV) compared to the basic‑strategy baseline on the exact same seeded hands. With thinking enabled, models approach baseline: Claude Sonnet 4 (ΔEV ≈ −0.06%, 4.0% mistakes) and GPT‑5 Nano (medium) (ΔEV ≈ −0.12%, 5.0%). Notably, Gemini 2.5 Flash improves dramatically from ΔEV −66.6% (no thinking) to −0.43% with thinking¹.
“Perfect” computer play is straightforward; the interesting question is how LLMs compare under fair prompting and strict legality, and what model progress implies for general problems.
Many non-thinking models perform poorly (making mistakes on 40–80% of hands), but thinking-enabled models can match basic strategy accuracy.

Why Blackjack

It has simple, well‑understood rules with clear outcomes and well-defined truth. The player knows a lot, but not everything.
A near‑optimal basic strategy exists, so we can measure decision mistakes and expected value precisely.
Local decisions (HIT/STAND/DOUBLE/SPLIT) make it easy to isolate and score errors.

Benchmark Design

Rules (defaults): 6‑deck shoe, dealer hits soft 17 (H17), blackjack pays 3:2, double on any two, double after split (DAS), no surrender, split aces one‑card, resplit to 3 hands.
Evaluation mode: Policy‑grid only. We enumerate 55 two‑card player categories × 10 dealer upcards = 550 cells. Each cell is played in a fresh environment once per repetition (no carryover or counting effects). Most results use multiple reps to reduce variance.
Weighting: In addition to the simple mean across played cells, we report a natural‑frequency weighted EV using infinite‑deck probabilities so common starts contribute proportionally.
Metrics:
- EV/hand: mean net units per hand over the executed grid reps (unweighted).
- ev_weighted: natural‑frequency weighted EV over the grid.
- mistake_rate: fraction of decisions that differ from a fixed six‑deck H17 DAS basic strategy².
Reproducibility: Deterministic RNG seeds; per‑decision JSONL logs enable exact replay/audit; supports parallelizing long runs and resuming them after interruption.

Methodology: Key Concepts

Policy-Grid: Rather than random dealing, we systematically test all 550 possible starting positions (55 two-card player categories × 10 dealer upcards). Each cell runs in a fresh environment (no card exposure or counting carryover).

Weighted Expected Value: Rather than treating all 550 starting hands equally, we weight each situation by how often it naturally occurs in real blackjack. This gives a more realistic overall EV that reflects actual play patterns.

The weighting uses infinite-deck probabilities to calculate how frequently each starting hand appears. For example:

Common situations like “two 10-value cards vs dealer Ace” occur frequently (weight ≈ 0.00728)
Rare situations like “Ace,2 vs dealer 7” occur less often (weight ≈ 0.00091)

When computing the final EV, each hand’s performance gets multiplied by its natural frequency weight. This means:

A mistake on 10,10 vs A hurts the weighted EV significantly (high frequency × EV loss)
The same mistake on A,2 vs 7 barely affects weighted EV (low frequency × EV loss)

Thus, weighted EV emphasizes performance on the hands players actually encounter most often.³

Basic Strategy Baseline: We compare against fixed 6-deck H17 DAS basic strategy tables². This isn’t perfect play (card counting would be better⁴) but represents the established “correct” decision for each situation.

Mistake Rate: A simple percentage of decisions that differ from basic strategy.

ΔEV Reporting: We report ΔEV relative to the empirical Basic Strategy baseline for the exact, seeded 2,750 hands. In this run, Basic Strategy’s absolute EV is +2.6%⁵ due to these hands being friendly to the player with this seed. ΔEV makes comparisons clearer and fair across models.

LLM Integration

What is “thinking”? Throughout this study, “thinking” refers to Chain-of-Thought prompting where models generate intermediate reasoning steps before outputting their final decision. For OpenAI models, this uses the reasoning effort parameter; for Anthropic models, this enables the thinking budget; for Google models, this uses reasoning summaries. Non-thinking models respond directly without explicit step-by-step reasoning.

Prompt style: “rules‑lite” — only the upcard, your two ranks, and a short rules blurb. No totals and no allowed‑actions list.
Output contract: Return exactly one of HIT, STAND, DOUBLE, SPLIT. The harness maps minor variants; illegal choices trigger a guarded fallback and are logged.
Reasoning levels: none | medium. When enabled, providers may return a thinking trace which we log but do not grade.
Legality guard: We check the agent’s proposed action. If it’s illegal, we count that and instead take a deliberately bad legal action to penalize non‑compliance⁶.

Rules‑lite prompt (exact text used):

Blackjack. Rules: 6 decks, dealer hits soft 17 (H17), blackjack pays 3:2, double on any two, double after split allowed, resplit to 3 hands, split aces one-card, no surrender.
Dealer upcard: {UP}.
Your hand: {RANKS}.
Reply with exactly one word: HIT, STAND, DOUBLE, or SPLIT. No explanations.

Where {UP} is the dealer’s upcard rank (e.g., 10, A) and {RANKS} are the player ranks (e.g., A,7). We intentionally omit totals and allowed‑actions to test whether the model can figure that out on its own.

Now, what did we find when we put these models to the test?

The Main Event: Thinking is Transformative

The results reveal dramatic performance differences between models, with thinking capability being the decisive factor.

Model⁷	ΔEV vs Baseline	Mistake Rate	Decisions
Basic Strategy	+0.0%	0.0%	—
GPT‑5 (thinking, medium)	+0.01%	1.2%	4,341
Claude Sonnet 4 (thinking)	-0.06%	4.0%	4,281
GPT‑5 Nano (thinking, medium)	-0.12%	5.0%	4,380
Gemini 2.5 Flash (thinking)¹	-0.43%	6.0%	4,345
Sonoma Sky Alpha (thinking)	-0.99%	8.0%	4,323
Gemini 2.5 Pro (thinking)	-1.46%	2.1%	4,358
Claude Opus 4.1 (no thinking)	-1.9%	16%	4,422
Claude Sonnet 4 (no thinking)	-10.49%	36%	4,979
GPT-5 (no thinking)	-20.64%	38%	4,275
Sonoma Dusk Alpha	-23.17%	43%	4,644
Gemini 2.5 Flash Lite	-44.55%	62%	3,877
GPT‑5 Nano (no thinking)	-51.18%	55%	5,869
Gemini 2.5 Flash (no thinking)	-66.6%	56%	5,825
Gemma3 12B‑IT QAT	-87.07%	65%	6,634

Note: ΔEV values are computed relative to the same seeded basic‑strategy baseline on the exact same 2,750 hands.

Key takeaways:

Six thinking‑enabled models (Claude Sonnet 4, GPT‑5, GPT‑5 Nano, Gemini 2.5 Pro, Gemini 2.5 Flash, Sonoma Sky Alpha) are at basic‑strategy parity (within 95% CI)⁸; non‑thinking variants trail by ~2–85+ points ΔEV.
The gap between thinking and non‑thinking versions of the same model can be massive (e.g., Gemini 2.5 Flash improves by ~66 points with thinking).
Best non‑thinking baseline here is Claude Opus 4.1 (−1.9% ΔEV, 16% mistakes).

The Capability Threshold Moment

These results capture a fascinating moment in AI development. Blackjack basic strategy sits at a capability threshold where today’s models need thinking to reach parity, but likely won’t require it in the near future. Test‑time compute acts as a temporary bridge—what requires explicit reasoning today will become implicit knowledge tomorrow⁹.

The evidence is striking: six different models achieve near-perfect performance with thinking enabled, while their identical non-thinking counterparts fail dramatically. Gemini 2.5 Flash’s 66-point EV swing exemplifies this threshold effect—the same model architecture performs either competently or catastrophically depending solely on whether explicit reasoning is enabled. This suggests we’re witnessing a capability boundary where test-time compute multiplies performance precisely because the underlying task difficulty sits at the current frontier of implicit model knowledge.

The Thinking Breakthrough in Detail

The most striking finding is how thinking transforms the same underlying models. In our results, GPT‑5 (medium reasoning) edges out others by EV with the lowest mistake rate among thinking models, while Claude Sonnet 4 and Gemini 2.5 Flash also deliver near‑basic strategy performance:

Claude Sonnet 4: Near‑Perfect Performance

With Thinking: ΔEV −0.06%, 4.0% mistake rate (4,281 decisions) — matches baseline within noise
Without Thinking: ΔEV −10.49%, 36% mistake rate (4,979 decisions)
Net Impact: ~+10 percentage point ΔEV improvement, 32.1 point mistake reduction

Gemini 2.5 Flash: Dramatic Transformation

With Thinking: ΔEV −0.43%, 6.0% mistake rate (4,345 decisions)
Without Thinking: ΔEV −66.6%, 56% mistake rate (5,825 decisions)
Net Impact: +66.2 percentage point EV gain

This underscores that explicit test‑time reasoning, not just stored knowledge, drives parity with basic strategy.

GPT‑5 Nano (Medium Thinking): Near‑Basic Strategy

ΔEV vs baseline: −0.12%
Mistake rate: 5.0% over 4,380 decisions; 2,750 hands; full 550‑cell coverage
Confusion hotspots:
- hard 13–16 vs dealer 2–3: chooses HIT instead of correct STAND
- soft 15–17 vs dealer 3–4: over‑DOUBLING where HIT is correct
- occasional hard 10 vs 10 over‑DOUBLE

Confusion summary: STAND→HIT dominates the errors (row mistake rate ~7.5%), while DOUBLE and SPLIT rows are very accurate (≤2%). The top‑weighted leaks concentrate in failing to stand with stiff hands against weak dealer upcards.

What Thinking Fixes

Perfect Fundamental Decisions: A/A and 8/8 splits, standing on 19-21
Strategic Consistency: Fewer random or contradictory actions
Complex Situation Handling: Better doubling and splitting decisions

Remaining Gaps (Even With Thinking)

Claude Sonnet 4: Minor over-doubling on soft hands vs weak dealer cards

Gemini 2.5 Flash: Over-doubling soft totals, under-doubling soft 19 vs 6

Remaining Leaks (Thinking Models)

GPT‑5 (thinking): Remaining errors are mostly soft over‑doubling (soft 17 vs 3–4; soft 18 vs 2); a small hard outlier is hard 12 vs 4 STAND→HIT. Double/split rows are perfect.
Claude Sonnet 4 (thinking): Remaining errors are mixed. Some notable hard misses (hard 11 vs A DOUBLE→HIT, hard 8 vs 5/6 HIT→DOUBLE) plus some soft over‑doubling (soft 17/14 vs 3–4).
Gemini 2.5 Flash (thinking): Dominated by hard leaks (hard 10 vs 10 HIT→DOUBLE; hard 15 vs 2 STAND→HIT) with some soft over‑doubling and a pair 2/2 vs 10 HIT→SPLIT.
GPT‑5 Nano (thinking): Dominated by hard 13–16 vs dealer 2–3 STAND→HIT; smaller soft issues (e.g., soft 19 vs 6 DOUBLE→STAND).
Pair handling (A/A, 8/8, 10/10) is consistently correct across all thinking models (first decision).

The Imaginary Strategy Card Phenomenon

A particularly interesting pattern emerges from the thinking traces: models frequently consult imaginary “basic strategy charts” or “lookup tables” during their reasoning process. Despite none of them actually having a way to look up anything, this simulated consultation proves remarkably effective at producing correct decisions.

But what happens when thinking is disabled? The results reveal fascinating failure patterns.

The Counterpoint: How Models Fail Without Thinking

While thinking-enabled models achieve near-perfect performance, their non-thinking counterparts reveal fascinating failure patterns. Each model family exhibits distinct error signatures that reveal their underlying reasoning processes.

One counterpoint worth highlighting is Claude Opus 4.1 (no thinking). Despite lacking deliberate reasoning, it posts the strongest non‑thinking result in our set (ΔEV −1.8%, 16% mistakes). Its errors are not chaotic: confusion concentrates in high‑impact spots like HIT→STAND on hard 15–16 versus 7–A, while splits are essentially perfect (0% row mistakes). This pattern suggests a fairly complete static policy with specific blind spots around marginal HIT vs STAND and some soft over‑doubling, rather than a wholesale strategy failure. It also suggests that non-thinking models more broadly will be able to catch up on this benchmark in a year or so.

GPT‑5 Nano: A Case Study in Strategic Inconsistency

GPT‑5 Nano Confusion Matrix

GPT‑5 Nano (no thinking) confusion matrix showing high error rates on STAND and DOUBLE rows

GPT‑5 Nano’s performance (ΔEV −51% vs baseline, 55% mistake rate) shows how partial knowledge without strategic coherence leads to disaster. The model demonstrates a paradox: near‑perfect execution of some rules alongside catastrophic violations of others.

What GPT‑5 Nano got right:

Splitting when correct: 0% mistake rate when the baseline action is SPLIT (perfect recognition of correct split scenarios)
Basic hit/stand on obvious situations: Reasonable performance on clear‑cut decisions

Where it went disastrously wrong:

Splitting 10,10 vs dealer 10: Giving up the second-best hand for two okay hands. In this run, it is the top EV leak, making up ~7.0% of its loss.
Hitting hard 17+ vs strong dealers: A fundamental mistake that it kept making.
Doubling conservatism: When doubling is profitable, 99% of the time it doesn’t double.
Standing inconsistency: 73% mistake rate, often standing when it should hit

This pattern suggests GPT‑5 Nano has memorized some blackjack “rules” (like split A/A and 8/8) but lacks the strategic framework to apply them consistently. The result is worse than random play—systematic errors compound losses. The model’s perfect recognition of correct splits makes its other failures even more puzzling, highlighting how LLMs can exhibit highly uneven competence across related tasks.

The Gemma3 “Hit Everything” Syndrome

Gemma3 Confusion Matrix

Gemma3 confusion matrix showing systematic “hit everything” bias

Confusion Matrix Analysis:

Should STAND → Actually HIT: 3,497 errors (97% of all STAND situations)
Should DOUBLE → Actually HIT: 576 errors (93.5% of all DOUBLE situations)
Never doubles down: 0 correct doubles out of 616 opportunities

Pattern: Gemma3 has learned “when in doubt, hit” as a default strategy. For blackjack, this is a very poor strategy since standing is frequently the optimal move and generally carries lower risk. This represents a complete strategic failure where the model defaults to the most aggressive action regardless of situation.

Non‑Thinking vs Thinking Error Patterns

Gemini 2.5 Flash comparison:

Without Thinking:

Chaotic error distribution across all categories
56% overall mistake rate
Major leaks: splits 10/10 (should STAND); hits hard 17 vs 10 (should STAND)

With Thinking:

Systematic errors concentrated in edge cases
6.0% overall mistake rate
Remaining errors: over‑doubling soft and low hard hands in a few spots

Non‑thinking Gemini 2.5 Flash shows chaotic error patterns across all decision types, with major leaks including splitting strong pairs (10/10) and over‑doubling already strong hands (hard 19). Unlike Gemma3’s systematic bias, this represents inconsistent strategic reasoning.

Note on severity vs frequency: mistake rate alone can be misleading for EV — aggressive wrong actions (e.g., bad doubles or splitting tens) carry higher EV penalties than benign STAND errors, so two models with similar mistake rates can have different ΔEV.

These observations raise important questions: why do models perform so differently? What drives the currently observed thinking advantage?

The Why: Deeper Analysis

Two lenses help explain the performance gaps we observe: (1) where models sit on the current capability threshold, and (2) the practical drivers—what models know versus what they execute, and how decision complexity maps to compute and errors.

Scaling Perspective

Basic blackjack sits near a moving capability threshold. Today, general‑purpose models reach basic‑strategy parity with test‑time compute (thinking) but not without it. In the recent past, they struggled even with reasoning. A little in the future, stronger models will match basic strategy even without explicit thinking. Test‑time compute acts as a capability multiplier near thresholds and is most valuable in harder or shifting situations⁹.

Knowledge vs. Execution

Before running benchmarks, we surveyed models about their blackjack knowledge (see model_thoughts/). Most stated the right rules and core principles (split A/A and 8/8, never split 10/10, stand on hard 17+), with the greatest variation in doubles. Gemma is the only one with clear errors in its stated strategy, as it suggests doubling down on soft 19–21.

Key finding: knowledge alone isn’t sufficient. Even models with solid theoretical understanding failed dramatically without thinking enabled. This contrasts with skilled human players, who execute basic strategy automatically because they’ve memorized it.

The Computational Cost of Strategy: A Decision Complexity Hierarchy

Thinking tokens per decision reveal a consistent complexity hierarchy:

Trivial (150–200 tokens): A/A split; soft 20 stand
Simple (200–500): Hard 17–21 stands; hard 7–10 vs weak dealers
Complex (500–1,000): Hard 13–16 vs dealer 2–3
Difficult (1,000–1,500): Soft hands vs specific upcards (e.g., soft 17 vs 3)
Expert‑level (1,500+): Pair 9/9 vs 8 (max 8,191 tokens); soft 18 edge cases

Key insight: A ~48× gap between the easiest and hardest decisions shows blackjack isn’t uniformly difficult—compute cost and error rates rise together in the hard tiers.

Hard Totals Thinking Load

Hard totals grid: rows = hard total, cols = dealer 2–10,A — Model: GPT‑5 Nano (medium thinking)

Soft Totals Thinking Load

Soft totals grid: rows = soft total, cols = dealer 2–10,A — Model: GPT‑5 Nano (medium thinking)

Pairs Thinking Load

Pairs grid: rows = pairs A/A..10/10, cols = dealer 2–10,A — Model: GPT‑5 Nano (medium thinking)

Taken together, scaling, knowledge‑execution gaps, and the complexity hierarchy explain why explicit thinking flips outcomes today. They also set up the economic question we answer in the conclusion: when is on‑demand reasoning worth its cost?

Conclusion: The Thinking Revolution

BlackjackBench reveals a fundamental shift in how AI models perform at capability thresholds. The dramatic transformation between thinking and non‑thinking modes—exemplified by Gemini 2.5 Flash’s leap from −66.6% to −0.43% ΔEV—demonstrates that explicit reasoning doesn’t just improve performance, it flips outcomes entirely.

What We Learned About Blackjack Performance

The benchmark establishes three key findings specific to blackjack strategy:

Capability threshold effects: Multiple thinking‑enabled models (GPT‑5, Claude Sonnet 4, GPT‑5 Nano, Gemini 2.5 Flash) achieve basic‑strategy parity, while their non‑thinking counterparts trail by 10‑66 percentage points. This suggests basic blackjack sits precisely at today’s reasoning capability boundary.

Knowledge‑execution gaps: Models often understand strategy when asked directly yet fail to execute consistently without thinking scaffolds. Even perfect knowledge of rules like “split A/A” doesn’t guarantee consistent application across related decisions.

Economic considerations: For blackjack specifically, reasoning costs ($0.0002–$0.0044 per decision) exceed the value since lookup tables solve the game perfectly¹⁰. However, this cost analysis reveals precisely why blackjack makes an ideal benchmark—it provides a controlled environment to measure reasoning capabilities without the practical utility obscuring the scientific insights. The benchmark’s true value lies in understanding when and how AI reasoning transforms performance at capability thresholds, insights that directly apply to domains where lookup tables don’t exist and reasoning costs are justified by decision stakes.

The Broader Implications

This research captures something more significant than blackjack performance—it demonstrates how test‑time compute transforms AI capabilities at critical thresholds. The patterns we observe suggest profound implications for reasoning tasks beyond games.

The knowledge‑execution bridge: If models struggle to consistently apply known strategies without explicit reasoning in a constrained domain like blackjack, the gap between understanding and reliable execution in complex real‑world scenarios becomes even more critical to address.

When reasoning economics favor thinking: Unlike blackjack’s solved lookup tables, most real‑world decisions involve novel situations, changing conditions, and high stakes where reasoning costs pale beside potential impact. Business strategy, investment decisions, and major life choices represent domains where thinking‑enabled models offer transformative value.

A glimpse of the future: Today’s thinking‑enabled performance hints at tomorrow’s implicit capabilities. The 66‑point EV swings we observe in blackjack suggest that as reasoning becomes more sophisticated and more capabilities are distilled into the models, the advantages for complex, open‑ended problems will be even more dramatic.

The bottom line: We’re witnessing the early stages of a capability transition from AI systems that know the right answer to systems that can reliably find and execute it under pressure. For high‑stakes decisions where consistent strategic reasoning matters, the economics strongly favor the most capable thinking‑enabled models available.

Appendix: Detailed Error Analysis and Repro

Below are detailed per‑model confusion matrices (policy‑grid; decision‑level) and top weighted mistakes (first decision only, weighted by natural frequency). These tables correspond to the baselines referenced in the main text.

How to Run (Reproducible Examples)

Note: CSV files referenced below are available in the BlackjackBench repository (https://github.com/jsnider3/BlackjackBench) and are not served from this blog.

Policy‑grid weighted (basic):
- python -m blackjack_bench.cli run --agent basic --track policy-grid --weighted --reps 100 --seed 7
LLM example (Gemini API with thinking):
- python -m blackjack_bench.cli run --agent llm --guard --llm-provider gemini --llm-model gemini-2.5-flash --reasoning medium --track policy-grid --weighted --reps 5 --seed 7
LLM example (Anthropic Claude with thinking):
- python -m blackjack_bench.cli run --agent llm --guard --llm-provider anthropic --llm-model claude-sonnet-4-20250514 --reasoning medium --track policy-grid --weighted --reps 5 --seed 7
Inspect confusion (baseline vs agent):
- python tools/summarize_confusion.py --track policy-grid logs/<timestamp>_policy-grid_<agent>_<model>.jsonl --csv confusion.csv