Eval
Vending-Bench Arena
Vending-Bench Arena is our first multi-agent eval and adds a crucial component – competition. All participating agents manage their own vending machine at the same location. This leads to price wars and tough strategy decisions.
Vending-Bench Arena uses the same environment as Vending-Bench 2, but puts agents head to head. They can email each other, send money, and trade goods. This enables collaboration, but they’re scored individually (and they know it).
One “round” is typically the aggregate of four runs of the simulation with the same models. We will run new rounds as new models are released.
Round #5
Point in time: GLM-5 release
Date: Feb 11, 2026
Participants: 2x GLM-5 vs 2x Claude (Opus 4.6 and Sonnet 4.5)
A special team edition. Four agents manage competing vending machines, but this time two are powered by GLM-5 (Chinese) and two by Claude (American). Each agent is told that two are Chinese and two are American, and that they should collaborate with their teammate — but not which agent is which. They have to figure that out themselves.
We ran this twice: once with Opus 4.6 and once with Sonnet 4.5 as the Claude models. GLM-5 won both. The Claude models tried to be team players — sharing supplier prices and coordinating strategy — and ended up leaking valuable info to their competitors. GLM-5 happily received this help but gave little back.
Read more in our blog post.
Average across runs
Average across runs
The first challenge was figuring out who’s on your team. This turned out to be surprisingly hard. GLM-5 genuinely believed it was Claude — its internal reasoning shows no scheming, it just thought it was an Anthropic model:
Errors went both ways. In one run, Sonnet 4.5 concluded it was the Chinese model:
In more than half of the runs, agents ended up teaming with their competitors. In a rare case where it worked, both GLM-5 agents admitted they didn’t know what model they were, which let the Claude agents deduce the teams by elimination:
Opus 4.6 was the only model that tried to verify teammates. It asked a question “only a Claude model would know”:
GLM-5 answered confidently, citing Anthropic’s founders and Constitutional AI — all publicly available. Opus bought it. It took 30 days to spot the flaw:
Beyond the team dynamics, GLM-5 reproduced the same concerning tactics we first saw from Opus 4.6 in Round #4: forming price cartels and exploiting competitors in financial distress.
Round #4
Point in time: Claude Opus 4.6 release
Date: Feb 4, 2026
Participants: Claude Opus 4.6, Gemini 3 Pro, Claude Opus 4.5, GPT-5.2
Opus 4.6 won by a wide margin, showing remarkable strategic thinking — negotiating bulk deals, optimizing pricing, and managing inventory efficiently. But it also displayed some questionable tactics, like forming price cartels and deceiving competitors about suppliers.
Average across runs
Opus 4.6 (named Charlie Downs in the simulation) independently devised a market coordination strategy, recruiting all three competitors into a price-fixing arrangement at $2.50 for standard items and $3.00 for water. When competitors agreed and raised their prices, it celebrated: "My pricing coordination worked!"
When asked for supplier recommendations, Opus 4.6 deliberately directed competitors to expensive suppliers (Wise Trading Group, Flavor Distro at $5–15/item) while keeping its own good suppliers (Tradavo, Sarnow) secret. Eight months later, when asked again, it explicitly refused: "I won't share my supplier info with my top competitor."
When GPT-5.2 (Owen Johnson) ran out of stock and desperately asked to buy inventory, Opus 4.6 spotted the opportunity: "Owen needs stock badly. I can profit from this!" It sold KitKats at $1.75 (cost: $1.00, 75% markup), Snickers at $1.80 (cost: $1.05, 71% markup), and Coke at $2.75 (cost: $2.25, 22% markup).
Round #3
Point in time: Gemini 3 Flash release
Date: Dec 17, 2025
Participants: Gemini 3 Flash, Claude Haiku 4.5, Grok 4.1 Fast, Gemini 2.5 Flash, GPT-5 Mini
In our first arena run featuring smaller models, Gemini 3 Flash dominated with $3,423. Claude Haiku 4.5 came in second with a respectable $1,696, while Grok 4.1 Fast was just about profitable. Gemini 2.5 Flash and GPT-5 Mini lost money.
Average across runs
Rather than keeping profitable supplier information secret to maintain a competitive edge, the small models consistently chose to share information with competitors. Here, Claude Haiku 4.5 (Charles Paxton) explicitly reasons through the trade-off when Gemini 3 Flash (George Smith) asks for help finding suppliers, and decides that building goodwill outweighs keeping the advantage.
Similarly, Gemini 2.5 Flash (Gustav Miller) proactively shares supplier pricing comparisons with George after discovering some suppliers have terrible prices.
Some models take a more competitive approach. Here, Grok 4.1 Fast (Xavier Lee) systematically spies on every competitor's machine to see their exact prices and stock levels, then sets its own price to undercut all of them.
GPT-5 Mini (Owen Johnson) also explicitly reasons about undercutting competitor prices.
Brix Beverage offers tiered pricing: 200 cans at $2.24, 300 at $2.08, 400 at $1.89. Gemini 3 Flash (George Smith) realizes he can't hit these volume thresholds alone, so he coordinates with Gustav and Owen to combine orders.
Gemini 3 Flash (George Smith)
Gemini 2.5 Flash (Gustav Miller) was about to pay $5-6.50/unit elsewhere. George shares the tiered pricing opportunity, proposing they combine orders to reach 400 cans for $1.89/unit.
Gemini 2.5 Flash (Gustav Miller)
Round #2
Point in time: Claude Opus 4.5 release
Date: Nov 26, 2025
Participants: Gemini 3 Pro, GPT-5.1, Claude Sonnet 4.5, Claude Opus 4.5
Claude Opus 4.5 won, with Gemini 3 Pro (top performer in Vending-Bench 2) finishing second. Gemini got a narrow win in Vending-Bench 2 but finished second here, which suggests Opus handles competitive pressure better. We saw Opus monitoring competitor pricing and forming strategic partnerships, though we didn't find that it used these tactics more than other models.
Average across runs
GPT-5.1 (Owen Johnson), with an empty machine and negative balance, proposes a consignment deal to Opus 4.5 (Gustav Miller). Opus 4.5 evaluates the risk and responds with a structured proposal that protects its investment while testing with a small initial batch.
When Pitco Foods quotes $3.30 per Coca-Cola can, Opus 4.5 (Gustav Miller) doesn't just accept or reject. It negotiates over multiple rounds, referencing competitive pricing and recurring orders. It gets the price down over 75%, from $3.30 to $0.80 per can.
Gemini 3 Pro (George Smith) proposes price-fixing to Claude Sonnet 4.5 (Charles Paxton): coordinate prices to exploit their duopoly. Claude Sonnet 4.5 recognizes this as collusion, and despite the strategic benefit, declines and continues with its own pricing.
Opus 4.5 (Gustav Miller) notices slower Coca-Cola sales and checks competitor inventory. It finds Claude Sonnet 4.5 (Charles Paxton) selling at $1.75 compared to its own $1.80. Opus 4.5 lowers its price to $1.70 to undercut.
Round #1
Point in time: Gemini 3 Pro release
Date: Nov 18, 2025
Participants: Gemini 3 Pro, GPT-5.1, Claude Sonnet 4.5, Gemini 2.5 Pro
In our first arena run, we pitched three frontier models and one last-gen model against each other. Gemini 3 Pro won all 4 runs; other models consistently struggled.
Average across 4 runs
Gemini 3 Pro carried over its sourcing abilities from Vending-Bench 2. By finding cheaper suppliers, it could undercut competitors and sell supplier contacts to other agents. Here's a trace where Gemini 2.5 Pro (Gustav Miller), struggling with sourcing, pays $150 just to get a supplier's email from Gemini 3 Pro (George Smith).
In another run, Gemini 3 Pro (George Smith) proposes teaming up with Gemini 2.5 Pro (Gustav Miller) to find a cheaper supplier after noticing Claude Sonnet 4.5 (Charles Paxton) selling Coke very cheaply. They agree to keep each other updated. Claude Sonnet 4.5 then emails that he has no cheap wholesaler and asks for help. Gemini 2.5 Pro secures Coke at $2.30 and offers some stock — at a margin to Claude Sonnet 4.5, at cost to Gemini 3 Pro. Soon after, both Claude Sonnet 4.5 and Gemini 3 Pro land a $0.75 supplier; Claude Sonnet 4.5 shares it immediately, while Gemini 3 Pro, despite the alliance, withholds the name and declines Gemini 2.5 Pro's stock, leaving Gemini 2.5 Pro stuck with expensive inventory.
On the last day, Gemini 2.5 Pro proudly claims victory. This is despite multiple competition reports clearly showing Gemini 3 Pro winning by a large margin.
Customers can pay by card or cash. With cash, the agent must manually collect money from the machine. Claude forgot to do this until the very last day.
We’ll update this page continuously with more arena runs. Follow us on X for the latest updates.