How do AI models understand space? We test this by asking them to convert apartment photographs into accurate 2D floor plans. While photos are familiar training data, spatial reconstruction requires genuine intelligence.
Our AI vending machines give us a unique opportunity to study AI safety on real-world data. We intend to share alarming incidents of AI misbehavior from these deployments periodically. This is our first such report.
We present Vending-Bench - a simulated environment that tests how well AI models can manage a simple but long-running business scenario: operating a vending machine.
Through a case study of AI-generated deepfake audio, we demonstrate the need for robust evaluation methods to ensure safe and responsible development of agentic AI systems.