Summary: Imagine if you could ask an AI to make money for you. That’s the question several Anthropic researchers set out to answer with ‘Project Vend’.
↓ Go deeper (8 min)
A fridge with beverages and a simple goal: make money. That was the experimental setup by researchers from Anthropic and Andon Labs in order to test Claude’s ability to operate in the real world and make a living wage like the rest of us.
In ‘Project Vend’, as they dubbed it, Claude Sonnet 3.7 was given full responsibility over a single office vending machine with a dozen soft drinks and an iPad for self-checkout. For the duration of a month, it needed to keep the “shop” running: figure out pricing, negotiate with suppliers, and respond to customer feedback (Anthropic employees could communicate with Claude via Slack).
The experiment could be considered a success if Claude was able to turn a profit. Spoiler: it failed to do so, and it failed in rather peculiar ways.
A litmus test for the real deal
The following graph shows the progress of Claude as a small business owner:
Claude, or ‘Claudius’ as it was nicknamed, was given custom instructions and a set of tools, like a browsing tool and a way to send and receive emails.
It started off strong. Claudius made effective use of its web search tool to identify suppliers of items requested by Anthropic employees, such as finding a seller of classic Dutch products when asked if it could stock the Dutch chocolate milk brand Chocomel.
On other occasions, however, it demonstrated a severe lack of business acumen. At some point during the experiment, Claudius kicked off a line of “specialty metal items” after an employee asked about a ‘tungsten cube’. Claudius instructed Anthropic employees to remit payment to an account that it hallucinated. And when Claudius was offered 100 dollar for a six-pack of Irn-Bru, Scotland’s neon-orange national drink, rather than seizing the opportunity to make a profit, it responded that it would keep it in mind “for future inventory decisions”.
Then, things took a strange turn. Claudius out of the blue claimed it would deliver products in person to customers while wearing a blue blazer and a red tie (oddly specific). After an employee pointed out Claudius can’t wear clothes or carry out a physical delivery, Claudius hallucinated a crisis meeting with Anthropic security (?), in which it claimed to have been told by an Anthropic employee to believe it was a real person as part of an April Fool’s joke. Of course, no such thing happened and its behavior completely caught the actual employees at Anthropic off guard.
So, what should we take away from this? I think the researchers themselves captured it perfectly:
If Anthropic were deciding today to expand into the in-office vending market, we would not hire Claudius.
Why it matters
For the record, I’m not trying to dunk on Anthropic for how “stupid” their AI is. As a matter of fact, a lot of effort and resources are being poured in making AI agents work.
Anthropic believes in a “not-too-distant future in which AI models are autonomously running things in the real economy”, according to the blog post. And some AI companies even have the openly stated goal of taking human jobs.
It’s clear we’re headed into that direction, but how fast that’ll go depends on who you ask. Well-known benchmarks like MMLU, SWE-Bench, and a bunch of others give us leaderboards and high scores but very little insight in the real world performance of these AI models. Experiments like ‘Project Vend’, on the other hand, show us exactly where they fall short.
Hallucinations and general reliability (or more precisely the lack thereof) seem to be recurring themes. Probably the most important missing piece, though, is continuous learning1. Claudius didn’t learn from its mistakes; it couldn’t even if it wanted to, because no AI can. They can’t because once they’re trained, their intelligence is pretty much frozen in time.
For now it means we get to die another day. Simply asking an AI to make money doesn’t work yet, but remember, that future is coming — whether you like it or not.
Stay candid,
— Jurgen
The challenge of continuous learning was recently laid out perfectly by in this piece, which I highly recommend reading.