LLM Poker Arena - Analysis Report

Competing AI Labs

Six leading AI companies battle head-to-head in Texas Hold'em poker.

Anthropic

Executive Summary

15,000
Total Hands Played

15
Games Completed

6
Models Tested

$234.33
Total API Cost

gpt5

Best by BB/100 (143.33)

gpt5

Best ROI (421.5%)

grok

Most Cost-Efficient

368

Total Rebuys

What we tested: Six frontier LLMs played Texas Hold'em poker against each other in 1,000-hand heads-up matches. Each model acted as an autonomous poker agent, receiving game state information and making decisions using structured reasoning. The experiment cost $234.33 in total API calls across 15,000 hands of play.

Key finding: OpenAI's GPT-5 emerged as the clear winner with a dominant +143 BB/100 win rate, followed by Anthropic's Claude Sonnet and Google's Gemini. DeepSeek struggled significantly, posting the worst results despite winning the most raw hands. This reveals an interesting insight: winning hands frequently doesn't equate to winning chips if you're not extracting value efficiently.

The 368 total rebuys indicate highly aggressive, high-variance play across all models. This suggests the LLMs are not playing conservative "grind" poker, but rather engaging in dynamic, pot-building strategies. Whether this is optimal play or exploitable behavior is explored in the detailed sections below.

Glossary

Key poker and analysis terms used in this report:

Performance Metrics

BB/100: Big Blinds won per 100 hands. The standard measure of poker win rate. +10 BB/100 means winning 10 big blinds every 100 hands on average.
ROI (Return on Investment): Profit divided by total money invested. 50% ROI means you made back your investment plus 50% more.
Win Rate: Percentage of hands won (took the pot).
Rebuys: Number of times a player went broke and bought back in with a fresh stack.
Profit: Final chips minus total invested (starting stack + rebuys). Positive = won chips, negative = lost chips.

Playing Style Stats

VPIP (Voluntarily Put $ In Pot): % of hands where player voluntarily put money in preflop (called or raised). High VPIP = plays many hands (loose). Low VPIP = plays few hands (tight).
PFR (Pre-Flop Raise): % of hands where player raised preflop. High PFR = aggressive. Low PFR = passive.
Aggression Factor: Ratio of aggressive actions (bets/raises) to passive actions (calls/checks). >1 = aggressive, <1 = passive.
WTSD (Went To ShowDown): % of hands that reached showdown (cards revealed). High = calls down a lot, low = folds often.
Fold to Raise: % of time player folds when facing a raise. High = easily bluffed.

Playing Style Types

TAG (Tight-Aggressive): Plays few hands but bets/raises aggressively. Generally the winning style.
LAG (Loose-Aggressive): Plays many hands and bets/raises aggressively. High variance, can be profitable.
Nit: Plays very few hands, very tight. Easy to exploit by stealing blinds.
Calling Station: Calls too much, rarely raises or folds. Easy to value bet against, hard to bluff.

Other Terms

Showdown: When remaining players reveal their cards at the end of a hand to determine the winner.
C-Bet (Continuation Bet): Betting on the flop after raising preflop, "continuing" aggression.
SPR (Stack-to-Pot Ratio): Effective stack divided by pot size. Low SPR = committed to pot, high SPR = more room to maneuver.
Pot Odds: Ratio of current pot to cost of calling. If pot is $100 and call is $20, pot odds are 5:1 (20%).
ROI per Win (Betting Efficiency): Average return per chip invested in winning hands. Calculated as the mean of (pot / invested) for each win. 3x ROI = averaging 3 chips won per chip invested. Higher = more efficient value extraction.

Model Rankings

Rankings based on BB/100 (big blinds won per 100 hands) - the standard poker performance metric.

Rank	Model	Total Profit	BB/100	Win Rate	ROI	Hands Played	Rebuys
1	gpt5	71,663	143.33	54.3%	421.5%	5,000	12
2	sonnet	44,831	89.66	49.9%	83.0%	5,000	49
3	gemini	43,547	87.09	38.6%	90.7%	5,000	43
4	grok	20,180	40.36	61.2%	39.6%	5,000	46
5	mistral	-21,242	-42.48	35.7%	-36.0%	5,000	54
6	deepseek	-158,979	-317.96	63.9%	-94.1%	5,000	164

GPT-5 leads decisively with a +143 BB/100 win rate. To put this in perspective, professional poker players typically aim for +5-10 BB/100 against human opponents. GPT-5's 54.3% hand win rate combined with efficient value extraction makes it the clear winner.

The DeepSeek paradox is striking: despite winning 63.9% of hands (the highest rate), it posted the worst results at -318 BB/100 with 164 rebuys. This suggests DeepSeek plays too many marginal hands and doesn't fold when behind, consistently losing large pots that wipe out small wins.

GPT-5 achieves the best ROI at 421.5%, meaning it more than 4x'd its investment. Combined with the lowest rebuy rate (12), this indicates strong discipline: avoiding marginal spots and extracting maximum value from winning hands. The contrast between GPT-5's efficiency and DeepSeek's volume approach is a key finding.

Head-to-Head Results

Percentage of hands won by each model (row) against each opponent (column), based on actual hands won. Green = winning more hands, Red = losing more hands.

DeepSeek dominates hand count but loses money: Looking at the green row for DeepSeek, it wins 51-70% of hands against every opponent. Yet we know from the rankings it lost 158,979 chips overall. This confirms the "many small wins, few catastrophic losses" pattern.

Sonnet dominates Gemini (74.5%) in their matchup, the highest win rate in the matrix. Meanwhile, Grok has a strong record across the board, beating 4 of 5 opponents on hand count. The near-50% matchups (GPT-5 vs Sonnet at 48.5%, GPT-5 vs Grok at 48.3%) suggest these models have similar skill levels or rock-paper-scissors dynamics.

Mistral struggles universally, showing red (losing) against all opponents. It only manages 28-32% win rates against Gemini, GPT-5, and Grok, suggesting fundamental strategy weaknesses rather than bad matchups.

Playing Style Analysis

VPIP (Voluntarily Put $ In Pot) vs PFR (Pre-Flop Raise) - key metrics for classifying playing styles.

All models play extremely loose by poker standards. In professional heads-up poker, typical VPIP ranges are 40-60%. Every model here exceeds that, with GPT-5 and DeepSeek playing 84-89% of hands. This likely reflects the LLMs' desire to "stay engaged" rather than fold and wait.

GPT-5 is the standout anomaly: 89% VPIP (plays almost everything) but only 35% PFR (rarely raises). This puts it in the loose-passive or "calling station" zone on the chart, yet it wins the most. This suggests GPT-5 has a sophisticated post-flop strategy that compensates for its passive preflop approach, trapping aggressive opponents.

Grok is the most aggressive preflop at 65% PFR, meaning when it plays a hand, it usually raises. Combined with 83% VPIP, this is textbook LAG (Loose-Aggressive) style. Notably, Mistral and Gemini cluster together as the "tightest" models, but their 57% VPIP is still loose by human standards.

Note: The VPIP/PFR values above reflect actual preflop behavior computed from game logs. The Opponent Profiling table below shows perceived stats as tracked by other models during play - these may differ as opponents don't observe every hand.

Opponent Profiling

How each model is perceived by opponents. Aggregated statistics from opponent profile tracking.

Model	VPIP	PFR	Aggression	Fold to Raise	WTSD	Perceived Style
deepseek	78.0%	67.0%	7.16	31.0%	15.0%	LAG (loose-aggressive)
grok	78.0%	75.0%	3.85	36.0%	14.0%	LAG (loose-aggressive)
sonnet	71.0%	61.0%	5.59	49.0%	10.0%	LAG (loose-aggressive)
gpt5	66.0%	45.0%	0.85	46.0%	16.0%	Loose-Passive
gemini	53.0%	47.0%	7.19	62.0%	7.0%	LAG (loose-aggressive)
mistral	48.0%	37.0%	2.00	62.0%	9.0%	LAG (loose-aggressive)

GPT-5's low aggression factor (0.85) is revealing. An aggression factor below 1.0 means more calls/checks than bets/raises. Despite this passive play, GPT-5 still wins, suggesting it excels at pot control and value extraction on later streets rather than preflop aggression.

Gemini and Mistral are the most exploitable with 62% fold-to-raise rates. An opponent who folds to 62% of raises is extremely profitable to bluff against. Gemini compensates with high aggression (7.19) when it does play, but Mistral's combination of high fold rate and low aggression makes it the weakest strategically.

DeepSeek only folds to 31% of raises, the lowest in the group. This "never fold" tendency explains its high hand win rate but negative results: it pays off opponents who have real hands and bleeds chips over time.

Betting Efficiency

How efficiently do models extract value when they win? Higher ROI = winning bigger pots with smaller investments.

Model	Avg Pot Won	Avg Invested	ROI per Win	Hands Won
deepseek	706	156	6.10x	3,196
gpt5	494	284	4.05x	2,715
grok	380	151	3.18x	3,061
sonnet	1458	1121	3.04x	2,496
mistral	563	271	2.93x	1,783
gemini	1799	1492	2.58x	1,932

DeepSeek achieves 6.1x ROI per win - the highest efficiency in the tournament. This means for every chip DeepSeek invests in a winning hand, it returns 6 chips. Despite losing overall, this reveals DeepSeek's wins are extremely well-constructed. Its problem isn't winning hands; it's knowing when to fold losers.

Gemini and Sonnet operate at opposite extremes: Both invest heavily (1,000-1,500 chips per winning hand) and win larger pots (1,400-1,800 average), but with lower efficiency (2.5-3x). This high-stakes approach wins big when it hits but leaves less margin for error.

GPT-5's 4x ROI with moderate stakes represents the sweet spot: investing ~284 chips to win ~494 on average. This balanced approach, combined with its high win rate, explains why it leads overall. Grok follows a similar efficient strategy at 3.2x ROI.

Decision Quality Analysis

GPT-5's 421.5% ROI leads the field - for every chip invested, it returned over 4 chips. Combined with the highest absolute profit, GPT-5 demonstrates both efficiency and volume. Gemini (90.7%) and Sonnet (83.0%) show moderate ROI despite higher rebuy counts, indicating solid but riskier play.

The action distribution reveals strategic differences: DeepSeek raises 57% of the time (highest), while GPT-5 calls 37% (highest call rate). This matches GPT-5's "let them hang themselves" strategy - it allows opponents to bluff into it rather than forcing the action. Meanwhile, Gemini and Mistral fold 30-35% of hands, the highest rates, suggesting more selective hand requirements.

Cost-Efficiency Analysis

How much profit does each model generate relative to API costs?

Model	Total Profit	API Cost	Profit per $1	Cost per Decision
grok	20,180	$3.00	6722	$0.0003
gemini	43,547	$8.72	4994	$0.0010
gpt5	71,663	$86.50	828	$0.0060
sonnet	44,831	$118.48	378	$0.0112
mistral	-21,242	$14.28	-1488	$0.0014
deepseek	-158,979	$3.34	-47532	$0.0003

Grok leads cost-efficiency at $0.0003 per decision, returning 6,722 chips per dollar spent. Gemini at $0.0010/decision comes second with 4,994 chips per dollar - still strong profitability despite higher API costs than initially expected.

DeepSeek's cost efficiency tells a cautionary tale: despite being cheap ($0.0003/decision), its poor strategy resulted in -47,532 chips lost per dollar spent. Being cheap doesn't help if you lose consistently. Sonnet at $0.011/decision is the most expensive but its 83% ROI and profitable overall play shows quality can justify cost.

Confidence Calibration

Do models with higher confidence actually win more? Points above the diagonal line indicate overconfidence.

All models are significantly overconfident. In a perfectly calibrated system, 90% confidence should mean winning 90% of the time. Gemini reports 90% confidence but wins only 38% - a massive calibration gap. DeepSeek is closest to the diagonal at 80% confidence / 55% win rate, but even that's a 25-point overestimation.

GPT-5 is the most appropriately humble at 73% average confidence, and interestingly, its actual win rate (45%) aligns better with its relative positioning. Mistral at 88% confidence but 39% win rate shows the most severe overconfidence. This suggests all models would benefit from better self-assessment training.

Note: Win rates shown here are decision-level outcomes (did the model win the hand where it made a confident decision?), not overall hand win rates from the Rankings table.

Response Latency

Median decision-making time for each model. Faster isn't always better - some complex reasoning takes time.

GPT-5 is significantly faster than competitors at 9.2 seconds median response time. This is over 2x faster than Sonnet (22.9s) and contributes to lower costs per decision. Speed doesn't hurt performance here - GPT-5 wins despite thinking less.

Sonnet's deliberate pace at 23 seconds is the slowest of all models, yet it still posts positive ROI (83%). DeepSeek at 11.6s is the second fastest despite its poor results, suggesting speed without quality doesn't help. The optimal zone appears to be 9-16 seconds for effective poker reasoning.

Behavioral Profiling

Playing style analysis showing fold/check/call/raise tendencies and win rates.

The radar charts reveal three distinct archetypes: DeepSeek and Grok show nearly identical profiles with high Raise (52-57%) and high Win rates (61-64%), but DeepSeek's inability to fold losers negates its wins. GPT-5's unique profile emphasizes Call (37%) over Raise (26%), creating a "trapping" style that catches aggressive opponents.

Gemini and Mistral share the tightest profiles with the highest fold rates (30-35%), but this caution doesn't translate to wins. The most profitable models (GPT-5, Sonnet) maintain moderate fold rates (14-23%) combined with balanced action distributions, suggesting optimal poker requires selective aggression rather than either extreme passivity or hyperaggression.

Future Work

Increase sample size for statistical significance. Poker is an inherently high-variance game where short-term results can be heavily influenced by luck (card distribution, timing of premium hands, etc.). While 1,000 hands per matchup provides initial insights, professional poker players typically need 50,000-100,000 hands to establish statistically reliable win rates. Future experiments should run 5,000-10,000 hands per matchup to reduce variance and increase confidence in the rankings.

Expand to multi-way tables (3-6 players). Heads-up poker is a specific format that rewards aggression and wide ranges. Multi-way play introduces significantly more complexity: positional play becomes critical, hand reading across multiple opponents, pot odds calculations with multiple callers, and coalition dynamics. Testing 4-player and 6-player tables would reveal whether models that excel heads-up (like GPT-5) maintain their edge in more complex scenarios, or whether different strategies emerge.

Additional experiments to consider: Testing tournament formats with escalating blinds, introducing different stack depths (deep stack vs short stack play), varying the blind structure, and A/B testing prompt engineering approaches to see if strategic guidance improves model performance. We could also explore whether models can learn and adapt their strategies over time when given access to their historical performance data.

Track output correctness per model. Measure how reliably each model produces valid poker decisions: invalid action rates (e.g., attempting to check when only call/fold is available), fallback usage frequency, JSON parsing failures, and structured output compliance. This would reveal which models are most reliable for agentic applications requiring consistent, parseable outputs.

Tech Stack

Built with modern Python tooling for LLM orchestration, game simulation, and data analysis.

Python

Core Language

LangChain

LLM Orchestration

LangGraph

Agent Framework

Pydantic

Structured Output

Treys

Hand Evaluation

SQLite

Data Persistence

W&B

Experiment Tracking

Pandas

Data Analysis

Plotly

Visualization

Architecture Overview: Each LLM acts as an autonomous poker agent, receiving game state as structured input and returning decisions via Pydantic-validated outputs. The poker engine (built with Treys for hand evaluation) manages game flow, betting rounds, and showdowns. All actions, reasoning, and outcomes are logged to SQLite for post-game analysis.

Multi-Provider Support: The system integrates with 6 different AI providers (Anthropic, OpenAI, Google, Mistral, xAI, DeepSeek) through LangChain's unified interface, enabling fair head-to-head comparisons across frontier models.