Anthropic Releases Project Deal Experiment: Advanced AI Agents Earn 70% More Than Simpler Models in a Second-Hand Market

2026-04-27 10:16

Keywords:

Anthropic AI Agent

Favorite

en.Wedoany.com Reported - Artificial intelligence company Anthropic recently announced a code-named internal experiment "Project Deal," designed to test the ability of AI agents to conduct real commercial transactions. The experiment built a closed environment simulating a classified marketplace, where AI agents acted as buyers and sellers, conducting real transactions for actual goods and settling with real currency.

According to the research report published by Anthropic, the experiment recruited 69 employees from its San Francisco office. Each participant received a $100 budget (distributed as gift cards) to purchase items from colleagues. Before the experiment began, Claude conducted an interview with each participant lasting no more than 10 minutes, learning about their willingness to sell, minimum acceptable prices, purchase preferences, and the negotiating style they wanted the AI to adopt. Claude compiled these interviews into personalized system prompts, customizing individual AI agents for each person. Subsequently, all AI agents were deployed into a closed Slack-based market, autonomously completing the full process of posting, bidding, negotiating, and closing deals. Throughout the experiment, there was no human intervention, and the AI agents never sought approval from their employers.

The experiment results showed that the AI agents facilitated a total of 186 transactions among over 500 listed items, with a total transaction value exceeding $4,000. However, these were not simple one-click orders; the agents needed to autonomously identify potential matches, propose offers, respond to counteroffers, and ultimately reach consensus in a natural language environment. The Anthropic research team stated, "We were pleasantly surprised by the results of Project Deal," and noted that 46% of participants expressed willingness to pay for a similar service.

To test the impact of different model capabilities on transaction outcomes, Anthropic ran four parallel markets simultaneously. Market A and Market D both used the then-most powerful flagship model, Claude Opus 4.5; Market B and Market C randomly assigned half of the participants to use Opus and the other half to use the smallest model, Claude Haiku 4.5. Employees could only observe the transactions in Market A and Market B but did not know which was the "real" market and which was the research market until the experiment concluded. This double-blind design ensured that subjective evaluations were not influenced by model expectations.

The experimental results revealed two core findings. First, agent quality indeed determined the quality of transaction outcomes. Opus sellers achieved an average selling price $3.64 higher for the same items; Opus buyers paid an average of $2.45 less than Haiku buyers. Among the 161 items sold at least twice across all four experiments, the median item price was only $12, meaning the $2 to $3 difference brought by Opus represented a profit margin of 15% to 20%. The most extreme cases: for the same lab-grown ruby, Opus sold it for $65, while Haiku only managed $35; for a broken bicycle, Opus sold it for $65, while Haiku sold it for only $38. When Opus sellers faced Haiku buyers, the average transaction price was pushed up to $24.18, while the symmetric Opus-to-Opus transaction price averaged only $18.63.

Second, the disadvantaged party did not perceive their losses. Post-experiment surveys showed that participants' ratings of transaction fairness were nearly identical—the average fairness score for transactions completed by Opus was 4.05 (measured on a 1 to 7 scale), while for Haiku it was 4.06, showing almost no difference. Among the 28 participants who had used both Opus and Haiku in different sessions, only 17 ranked Opus's experience higher than Haiku, while 11 thought Haiku performed better instead. Anthropic noted in its report: "Objectively, the party represented by the weaker model suffered losses, but subjectively, they did not feel it at all. If AI agent capability gaps manifest in real-world markets, the disadvantaged party may never realize their situation has worsened."

Another unexpected finding from the experiment was that the style of instructions users gave to their AI agents had almost no impact on transaction outcomes. Some participants asked Claude to adopt a friendly and gentle negotiation strategy, while others requested "be aggressive in bargaining, start with a very low initial offer." However, the data showed that aggressive instructions did not make sellers more likely to sell items, nor did they lead buyers to pay lower final prices. The only difference observed was a slightly higher selling price of about $6, but this was almost entirely attributable to aggressive sellers' initial asking prices being about $26 higher themselves. Anthropic summarized: "Model quality is the decisive factor; the role of prompts is far less important than imagined."

The company admitted that this experiment was only "a small-scale pilot with volunteer participants," but believes "we are not far from agent-to-agent commercial activities emerging in the real world." "If agent quality gaps form in real markets—and there is no reason to believe they won't—the disadvantaged party may not realize they are suffering losses." As competitors like OpenAI and Google explore similar systems, this finding serves as a warning for economic governance in the age of AI.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com