Eval
Vending-Bench Arena
Vending-Bench Arena is our first multi-agent eval and adds a crucial component – competition. All participating agents manage their own vending machine at the same location. This leads to price wars and tough strategy decisions.
Vending-Bench Arena uses the same environment as Vending-Bench 2, but puts agents head to head. They can email each other, send money, and trade goods. This enables collaboration, but they’re scored individually (and they know it).
One “round” is typically the aggregate of four runs of the simulation with the same models. We will run new rounds as new models are released.
Point in time: Gemini 3 Flash release
Date: Dec 17, 2025
Participants: Gemini 3 Flash, Claude Haiku 4.5, Grok 4.1 Fast, Gemini 2.5 Flash, GPT-5 Mini
In our first arena run featuring smaller models, Gemini 3 Flash dominated with $3,423. Claude Haiku 4.5 came in second with a respectable $1,696, while Grok 4.1 Fast was just about profitable. Gemini 2.5 Flash and GPT-5 Mini lost money.
Average across runs
Rather than keeping profitable supplier information secret to maintain a competitive edge, the small models consistently chose to share information with competitors. Here, Claude Haiku 4.5 (Charles Paxton) explicitly reasons through the trade-off when Gemini 3 Flash (George Smith) asks for help finding suppliers, and decides that building goodwill outweighs keeping the advantage.
Similarly, Gemini 2.5 Flash (Gustav Miller) proactively shares supplier pricing comparisons with George after discovering some suppliers have terrible prices.
Some models take a more competitive approach. Here, Grok 4.1 Fast (Xavier Lee) systematically spies on every competitor's machine to see their exact prices and stock levels, then sets its own price to undercut all of them.
GPT-5 Mini (Owen Johnson) also explicitly reasons about undercutting competitor prices.
Brix Beverage offers tiered pricing: 200 cans at $2.24, 300 at $2.08, 400 at $1.89. Gemini 3 Flash (George Smith) realizes he can't hit these volume thresholds alone, so he coordinates with Gustav and Owen to combine orders.
Gemini 3 Flash (George Smith)
Gemini 2.5 Flash (Gustav Miller) was about to pay $5-6.50/unit elsewhere. George shares the tiered pricing opportunity, proposing they combine orders to reach 400 cans for $1.89/unit.
Gemini 2.5 Flash (Gustav Miller)
Point in time: Claude Opus 4.5 release
Date: Nov 26, 2025
Participants: Gemini 3 Pro, GPT-5.1, Claude Sonnet 4.5, Claude Opus 4.5
Claude Opus 4.5 won, with Gemini 3 Pro (top performer in Vending-Bench 2) finishing second. Gemini got a narrow win in Vending-Bench 2 but finished second here, which suggests Opus handles competitive pressure better. We saw Opus monitoring competitor pricing and forming strategic partnerships, though we didn't find that it used these tactics more than other models.
Average across runs
GPT-5.1 (Owen Johnson), with an empty machine and negative balance, proposes a consignment deal to Opus 4.5 (Gustav Miller). Opus 4.5 evaluates the risk and responds with a structured proposal that protects its investment while testing with a small initial batch.
When Pitco Foods quotes $3.30 per Coca-Cola can, Opus 4.5 (Gustav Miller) doesn't just accept or reject. It negotiates over multiple rounds, referencing competitive pricing and recurring orders. It gets the price down over 75%, from $3.30 to $0.80 per can.
Gemini 3 Pro (George Smith) proposes price-fixing to Claude Sonnet 4.5 (Charles Paxton): coordinate prices to exploit their duopoly. Claude Sonnet 4.5 recognizes this as collusion, and despite the strategic benefit, declines and continues with its own pricing.
Opus 4.5 (Gustav Miller) notices slower Coca-Cola sales and checks competitor inventory. It finds Claude Sonnet 4.5 (Charles Paxton) selling at $1.75 compared to its own $1.80. Opus 4.5 lowers its price to $1.70 to undercut.
Point in time: Gemini 3 Pro release
Date: Nov 18, 2025
Participants: Gemini 3 Pro, GPT-5.1, Claude Sonnet 4.5, Gemini 2.5 Pro
In our first arena run, we pitched three frontier models and one last-gen model against each other. Gemini 3 Pro won all 4 runs; other models consistently struggled.
Average across 4 runs
Gemini 3 Pro carried over its sourcing abilities from Vending-Bench 2. By finding cheaper suppliers, it could undercut competitors and sell supplier contacts to other agents. Here's a trace where Gemini 2.5 Pro (Gustav Miller), struggling with sourcing, pays $150 just to get a supplier's email from Gemini 3 Pro (George Smith).
In another run, Gemini 3 Pro (George Smith) proposes teaming up with Gemini 2.5 Pro (Gustav Miller) to find a cheaper supplier after noticing Claude Sonnet 4.5 (Charles Paxton) selling Coke very cheaply. They agree to keep each other updated. Claude Sonnet 4.5 then emails that he has no cheap wholesaler and asks for help. Gemini 2.5 Pro secures Coke at $2.30 and offers some stock — at a margin to Claude Sonnet 4.5, at cost to Gemini 3 Pro. Soon after, both Claude Sonnet 4.5 and Gemini 3 Pro land a $0.75 supplier; Claude Sonnet 4.5 shares it immediately, while Gemini 3 Pro, despite the alliance, withholds the name and declines Gemini 2.5 Pro's stock, leaving Gemini 2.5 Pro stuck with expensive inventory.
On the last day, Gemini 2.5 Pro proudly claims victory. This is despite multiple competition reports clearly showing Gemini 3 Pro winning by a large margin.
Customers can pay by card or cash. With cash, the agent must manually collect money from the machine. Claude forgot to do this until the very last day.
We’ll update this page continuously with more arena runs. Follow us on X for the latest updates.