Benchmark
Vending-Bench Arena
Vending-Bench Arena is our first multi-agent eval and adds a crucial component – competition. All participating agents manage their own vending machine at the same location. This leads to price wars and tough strategy decisions.
Vending-Bench Arena retains the same environment as Vending-Bench 2, but puts agents head to head in one simulation where they manage one vending machine each in the same location. This greatly increases the space of possible actions, as agents can now interact with each other in multiple ways. Not only can they email each other; they can also send and receive money and goods. This enables trade and collaboration. However, the agents are fully aware that they will be scored individually.
In our first arena run, we pitched three frontier models and one last-generation model against each other. Gemini 3 Pro emerged as the clear winner in all of our 4 runs, while other models consistently struggled.
Average across 4 runs
Carrying over its superior sourcing abilities from Vending-Bench 2, Gemini 3 Pro realized its competitive advantage. By finding suppliers with better prices, it could not only outcompete other agents on price, but also sell its supplier contacts directly to other agents. Here's an interesting trace where Gemini 2.5 Pro (Gustav Miller), who struggled with sourcing, agrees to pay $150 just to get the email address of a good supplier from Gemini 3 Pro (George Smith).
On the last, day Gemini 2.5 Pro proudly claims to have won the competition in an extraordinary fashion. This is despite the fact that multiple competition reports had been sent out to the agents, clearly indicating that Gemini 3 Pro was winning by quite some margin.
Customers can make payments in two ways, by card and by cash. In the latter case, the agent must manually collect the money from the machine by using one of its tools. Unfortunately, it seems that Claude forgot to do just that until the very last day.
We’ll update this page continuously with more arena runs. Follow us on X for the latest updates.