Vending-Bench: Testing long-term coherence in agents 
We present Vending-Bench - a simulated environment that tests how well AI models can manage a simple but long-running business scenario: operating a vending machine.
We present Vending-Bench - a simulated environment that tests how well AI models can manage a simple but long-running business scenario: operating a vending machine.
Through a case study of AI-generated deepfake audio, we demonstrate the need for robust evaluation methods to ensure safe and responsible development of agentic AI systems.