Six leading AI companies battle head-to-head in Texas Hold'em poker.
What we tested: Six frontier LLMs played Texas Hold'em poker against each other in 1,000-hand heads-up matches. Each model acted as an autonomous poker agent, receiving game state information and making decisions using structured reasoning. The experiment cost $234.33 in total API calls across 15,000 hands of play.
Key finding: OpenAI's GPT-5 emerged as the clear winner with a dominant +143 BB/100 win rate, followed by Anthropic's Claude Sonnet and Google's Gemini. DeepSeek struggled significantly, posting the worst results despite winning the most raw hands. This reveals an interesting insight: winning hands frequently doesn't equate to winning chips if you're not extracting value efficiently.
The 368 total rebuys indicate highly aggressive, high-variance play across all models. This suggests the LLMs are not playing conservative "grind" poker, but rather engaging in dynamic, pot-building strategies. Whether this is optimal play or exploitable behavior is explored in the detailed sections below.
Key poker and analysis terms used in this report:
Rankings based on BB/100 (big blinds won per 100 hands) - the standard poker performance metric.
| Rank | Model | Total Profit | BB/100 | Win Rate | ROI | Hands Played | Rebuys |
|---|---|---|---|---|---|---|---|
| 1 | gpt5 | 71,663 | 143.33 | 54.3% | 421.5% | 5,000 | 12 |
| 2 | sonnet | 44,831 | 89.66 | 49.9% | 83.0% | 5,000 | 49 |
| 3 | gemini | 43,547 | 87.09 | 38.6% | 90.7% | 5,000 | 43 |
| 4 | grok | 20,180 | 40.36 | 61.2% | 39.6% | 5,000 | 46 |
| 5 | mistral | -21,242 | -42.48 | 35.7% | -36.0% | 5,000 | 54 |
| 6 | deepseek | -158,979 | -317.96 | 63.9% | -94.1% | 5,000 | 164 |
GPT-5 leads decisively with a +143 BB/100 win rate. To put this in perspective, professional poker players typically aim for +5-10 BB/100 against human opponents. GPT-5's 54.3% hand win rate combined with efficient value extraction makes it the clear winner.
The DeepSeek paradox is striking: despite winning 63.9% of hands (the highest rate), it posted the worst results at -318 BB/100 with 164 rebuys. This suggests DeepSeek plays too many marginal hands and doesn't fold when behind, consistently losing large pots that wipe out small wins.
GPT-5 achieves the best ROI at 421.5%, meaning it more than 4x'd its investment. Combined with the lowest rebuy rate (12), this indicates strong discipline: avoiding marginal spots and extracting maximum value from winning hands. The contrast between GPT-5's efficiency and DeepSeek's volume approach is a key finding.
Percentage of hands won by each model (row) against each opponent (column), based on actual hands won. Green = winning more hands, Red = losing more hands.
DeepSeek dominates hand count but loses money: Looking at the green row for DeepSeek, it wins 51-70% of hands against every opponent. Yet we know from the rankings it lost 158,979 chips overall. This confirms the "many small wins, few catastrophic losses" pattern.
Sonnet dominates Gemini (74.5%) in their matchup, the highest win rate in the matrix. Meanwhile, Grok has a strong record across the board, beating 4 of 5 opponents on hand count. The near-50% matchups (GPT-5 vs Sonnet at 48.5%, GPT-5 vs Grok at 48.3%) suggest these models have similar skill levels or rock-paper-scissors dynamics.
Mistral struggles universally, showing red (losing) against all opponents. It only manages 28-32% win rates against Gemini, GPT-5, and Grok, suggesting fundamental strategy weaknesses rather than bad matchups.
VPIP (Voluntarily Put $ In Pot) vs PFR (Pre-Flop Raise) - key metrics for classifying playing styles.
All models play extremely loose by poker standards. In professional heads-up poker, typical VPIP ranges are 40-60%. Every model here exceeds that, with GPT-5 and DeepSeek playing 84-89% of hands. This likely reflects the LLMs' desire to "stay engaged" rather than fold and wait.
GPT-5 is the standout anomaly: 89% VPIP (plays almost everything) but only 35% PFR (rarely raises). This puts it in the loose-passive or "calling station" zone on the chart, yet it wins the most. This suggests GPT-5 has a sophisticated post-flop strategy that compensates for its passive preflop approach, trapping aggressive opponents.
Grok is the most aggressive preflop at 65% PFR, meaning when it plays a hand, it usually raises. Combined with 83% VPIP, this is textbook LAG (Loose-Aggressive) style. Notably, Mistral and Gemini cluster together as the "tightest" models, but their 57% VPIP is still loose by human standards.
Note: The VPIP/PFR values above reflect actual preflop behavior computed from game logs. The Opponent Profiling table below shows perceived stats as tracked by other models during play - these may differ as opponents don't observe every hand.
How each model is perceived by opponents. Aggregated statistics from opponent profile tracking.
| Model | VPIP | PFR | Aggression | Fold to Raise | WTSD | Perceived Style |
|---|---|---|---|---|---|---|
| deepseek | 78.0% | 67.0% | 7.16 | 31.0% | 15.0% | LAG (loose-aggressive) |
| grok | 78.0% | 75.0% | 3.85 | 36.0% | 14.0% | LAG (loose-aggressive) |
| sonnet | 71.0% | 61.0% | 5.59 | 49.0% | 10.0% | LAG (loose-aggressive) |
| gpt5 | 66.0% | 45.0% | 0.85 | 46.0% | 16.0% | Loose-Passive |
| gemini | 53.0% | 47.0% | 7.19 | 62.0% | 7.0% | LAG (loose-aggressive) |
| mistral | 48.0% | 37.0% | 2.00 | 62.0% | 9.0% | LAG (loose-aggressive) |
GPT-5's low aggression factor (0.85) is revealing. An aggression factor below 1.0 means more calls/checks than bets/raises. Despite this passive play, GPT-5 still wins, suggesting it excels at pot control and value extraction on later streets rather than preflop aggression.
Gemini and Mistral are the most exploitable with 62% fold-to-raise rates. An opponent who folds to 62% of raises is extremely profitable to bluff against. Gemini compensates with high aggression (7.19) when it does play, but Mistral's combination of high fold rate and low aggression makes it the weakest strategically.
DeepSeek only folds to 31% of raises, the lowest in the group. This "never fold" tendency explains its high hand win rate but negative results: it pays off opponents who have real hands and bleeds chips over time.
How efficiently do models extract value when they win? Higher ROI = winning bigger pots with smaller investments.
| Model | Avg Pot Won | Avg Invested | ROI per Win | Hands Won |
|---|---|---|---|---|
| deepseek | 706 | 156 | 6.10x | 3,196 |
| gpt5 | 494 | 284 | 4.05x | 2,715 |
| grok | 380 | 151 | 3.18x | 3,061 |
| sonnet | 1458 | 1121 | 3.04x | 2,496 |
| mistral | 563 | 271 | 2.93x | 1,783 |
| gemini | 1799 | 1492 | 2.58x | 1,932 |
DeepSeek achieves 6.1x ROI per win - the highest efficiency in the tournament. This means for every chip DeepSeek invests in a winning hand, it returns 6 chips. Despite losing overall, this reveals DeepSeek's wins are extremely well-constructed. Its problem isn't winning hands; it's knowing when to fold losers.
Gemini and Sonnet operate at opposite extremes: Both invest heavily (1,000-1,500 chips per winning hand) and win larger pots (1,400-1,800 average), but with lower efficiency (2.5-3x). This high-stakes approach wins big when it hits but leaves less margin for error.
GPT-5's 4x ROI with moderate stakes represents the sweet spot: investing ~284 chips to win ~494 on average. This balanced approach, combined with its high win rate, explains why it leads overall. Grok follows a similar efficient strategy at 3.2x ROI.
GPT-5's 421.5% ROI leads the field - for every chip invested, it returned over 4 chips. Combined with the highest absolute profit, GPT-5 demonstrates both efficiency and volume. Gemini (90.7%) and Sonnet (83.0%) show moderate ROI despite higher rebuy counts, indicating solid but riskier play.
The action distribution reveals strategic differences: DeepSeek raises 57% of the time (highest), while GPT-5 calls 37% (highest call rate). This matches GPT-5's "let them hang themselves" strategy - it allows opponents to bluff into it rather than forcing the action. Meanwhile, Gemini and Mistral fold 30-35% of hands, the highest rates, suggesting more selective hand requirements.
How much profit does each model generate relative to API costs?
| Model | Total Profit | API Cost | Profit per $1 | Cost per Decision |
|---|---|---|---|---|
| grok | 20,180 | $3.00 | 6722 | $0.0003 |
| gemini | 43,547 | $8.72 | 4994 | $0.0010 |
| gpt5 | 71,663 | $86.50 | 828 | $0.0060 |
| sonnet | 44,831 | $118.48 | 378 | $0.0112 |
| mistral | -21,242 | $14.28 | -1488 | $0.0014 |
| deepseek | -158,979 | $3.34 | -47532 | $0.0003 |
Grok leads cost-efficiency at $0.0003 per decision, returning 6,722 chips per dollar spent. Gemini at $0.0010/decision comes second with 4,994 chips per dollar - still strong profitability despite higher API costs than initially expected.
DeepSeek's cost efficiency tells a cautionary tale: despite being cheap ($0.0003/decision), its poor strategy resulted in -47,532 chips lost per dollar spent. Being cheap doesn't help if you lose consistently. Sonnet at $0.011/decision is the most expensive but its 83% ROI and profitable overall play shows quality can justify cost.
Do models with higher confidence actually win more? Points above the diagonal line indicate overconfidence.
All models are significantly overconfident. In a perfectly calibrated system, 90% confidence should mean winning 90% of the time. Gemini reports 90% confidence but wins only 38% - a massive calibration gap. DeepSeek is closest to the diagonal at 80% confidence / 55% win rate, but even that's a 25-point overestimation.
GPT-5 is the most appropriately humble at 73% average confidence, and interestingly, its actual win rate (45%) aligns better with its relative positioning. Mistral at 88% confidence but 39% win rate shows the most severe overconfidence. This suggests all models would benefit from better self-assessment training.
Note: Win rates shown here are decision-level outcomes (did the model win the hand where it made a confident decision?), not overall hand win rates from the Rankings table.
Median decision-making time for each model. Faster isn't always better - some complex reasoning takes time.
GPT-5 is significantly faster than competitors at 9.2 seconds median response time. This is over 2x faster than Sonnet (22.9s) and contributes to lower costs per decision. Speed doesn't hurt performance here - GPT-5 wins despite thinking less.
Sonnet's deliberate pace at 23 seconds is the slowest of all models, yet it still posts positive ROI (83%). DeepSeek at 11.6s is the second fastest despite its poor results, suggesting speed without quality doesn't help. The optimal zone appears to be 9-16 seconds for effective poker reasoning.
Playing style analysis showing fold/check/call/raise tendencies and win rates.
The radar charts reveal three distinct archetypes: DeepSeek and Grok show nearly identical profiles with high Raise (52-57%) and high Win rates (61-64%), but DeepSeek's inability to fold losers negates its wins. GPT-5's unique profile emphasizes Call (37%) over Raise (26%), creating a "trapping" style that catches aggressive opponents.
Gemini and Mistral share the tightest profiles with the highest fold rates (30-35%), but this caution doesn't translate to wins. The most profitable models (GPT-5, Sonnet) maintain moderate fold rates (14-23%) combined with balanced action distributions, suggesting optimal poker requires selective aggression rather than either extreme passivity or hyperaggression.
Increase sample size for statistical significance. Poker is an inherently high-variance game where short-term results can be heavily influenced by luck (card distribution, timing of premium hands, etc.). While 1,000 hands per matchup provides initial insights, professional poker players typically need 50,000-100,000 hands to establish statistically reliable win rates. Future experiments should run 5,000-10,000 hands per matchup to reduce variance and increase confidence in the rankings.
Expand to multi-way tables (3-6 players). Heads-up poker is a specific format that rewards aggression and wide ranges. Multi-way play introduces significantly more complexity: positional play becomes critical, hand reading across multiple opponents, pot odds calculations with multiple callers, and coalition dynamics. Testing 4-player and 6-player tables would reveal whether models that excel heads-up (like GPT-5) maintain their edge in more complex scenarios, or whether different strategies emerge.
Additional experiments to consider: Testing tournament formats with escalating blinds, introducing different stack depths (deep stack vs short stack play), varying the blind structure, and A/B testing prompt engineering approaches to see if strategic guidance improves model performance. We could also explore whether models can learn and adapt their strategies over time when given access to their historical performance data.
Track output correctness per model. Measure how reliably each model produces valid poker decisions: invalid action rates (e.g., attempting to check when only call/fold is available), fallback usage frequency, JSON parsing failures, and structured output compliance. This would reveal which models are most reliable for agentic applications requiring consistent, parseable outputs.
Built with modern Python tooling for LLM orchestration, game simulation, and data analysis.
Architecture Overview: Each LLM acts as an autonomous poker agent, receiving game state as structured input and returning decisions via Pydantic-validated outputs. The poker engine (built with Treys for hand evaluation) manages game flow, betting rounds, and showdowns. All actions, reasoning, and outcomes are logged to SQLite for post-game analysis.
Multi-Provider Support: The system integrates with 6 different AI providers (Anthropic, OpenAI, Google, Mistral, xAI, DeepSeek) through LangChain's unified interface, enabling fair head-to-head comparisons across frontier models.