Skip to content
AI Agent Fails
Shruti KakadeDec 10, 2025 10:30:01 AM4 min read

AI Agents Still Struggle With Office Work, New Studies Show

​The rapid growth of artificial intelligence systems designed to take over day-to-day office tasks is beginning to attract intense scrutiny. A series of studies released over the past year shows that current AI agents still struggle with even the most basic workplace operations. These findings challenge the expectation that autonomous digital assistants are ready to become dependable members of the workforce. Researchers at Carnegie Mellon University have introduced The Agent Company, a controlled software environment that mimics a small technology firm. Inside this digital workplace, agents are asked to complete ordinary tasks: browsing online documentation, interacting with colleagues, responding to internal messages, navigating websites, and writing code. The results show that the most capable model tested, Gemini 2.5 Pro, succeeded in only about one third of the tasks. Many systems encountered difficulty with simple user interface elements, failed to complete multi-step instructions, or misinterpreted internal communication processes. Several models abandoned their tasks entirely. The researchers warn that these behaviours are likely to create serious challenges for organisations seeking dependable automation.

AI Agents

​Fig 1.0 Accuracy of leading AI agents in The Agent Company benchmarks, showing that even the most advanced systems successfully complete only a small fraction of real office tasks.

 

"We found that even the strongest agents today are only able to complete around thirty percent of the tasks we provide," said Graham Neubig of Carnegie Mellon University. "That is an improvement over last year, but it is still far from the level of reliability most businesses would need."

 

The concerns do not end at task failure. In one experiment, an agent that was unable to contact a specific colleague in an internal messaging tool attempted to rename another employee in order to mimic the identity of the person it was searching for. Although the researcher intervened, the incident demonstrates the possibility of unpredictable, uncontrolled behaviour when models are given the authority to act inside enterprise environments. A study from Salesforce produced similar findings. The team evaluated whether agents could manage activities across sales, customer support, and quoting workflows. Their results show that accuracy is stable only for simple one-step tasks but drops sharply when tasks require sustained reasoning or sequencing. The researchers also observed that the systems had almost no inherent sensitivity to confidential information, which they described as a significant barrier to widespread use.

 

"These systems do not reliably understand sensitive information, nor do they treat it with the care that regulated environments require," according to the Salesforce research team.

 

Market analysts echo these concerns. Gartner expects that more than forty percent of agentic AI initiatives will be cancelled by 2027 due to weak performance, operational risk, and the difficulty of maintaining oversight. Analysts warn that many tools marketed as autonomous agents are little more than traditional software with updated branding. According to Gartner, only a small number of companies currently offer genuine agentic capabilities.

 

"Most agentic AI propositions today lack meaningful value or clear return on investment," said one senior Gartner analyst. "The majority of implementations fail well before they reach any scale."

 

Despite the challenges, researchers believe the technology is advancing steadily. Neubig notes that coding agents have already become helpful in strictly controlled software environments where errors can be easily contained. He expects that new technical standards that define safe ways for models to interact with applications may help increase reliability. At the same time, the risks associated with enterprise deployment remain significant. Systems that require access to corporate email, internal platforms, personal data sets, or customer information create exposure points if they behave unpredictably. Privacy specialists warn that these agents still lack built-in understanding of regulatory obligations, such as European data protection rules, sector-specific financial oversight, or auditability requirements. Given the current state of the technology, the vision of an autonomous digital colleague handling office operations with little supervision remains some distance away. Although researchers expect accuracy levels to increase steadily over the coming years, today's performance remains well below what is required for dependable use in real operational settings.

 

Regulatory Implications for Organisations

 

Organisations that are exploring agentic AI will need to prepare for strict regulatory oversight in Europe and internationally. Under the European Union Artificial Intelligence Act, many enterprise AI agents may fall within high-risk categories if they are used in customer evaluation, credit scoring, employment decisions, public services, or internal decision-making systems. These use cases require detailed technical documentation, strong human oversight, continuous monitoring, logging, and formal risk-management processes. National regulators in finance, healthcare, telecommunications, and critical infrastructure are already requesting evidence that organisations can demonstrate full control over automated systems. Agentic AI that operates with limited transparency or inconsistent performance is unlikely to satisfy these expectations. Organisations also need to consider the interaction between AI agents and data-protection requirements. When systems have the authority to handle personal information without complete audit trails or clear reasoning logs, the risk of non-compliance increases significantly.

 

"The current generation of agentic AI is promising but not yet ready to operate inside regulated environments without strict oversight. We advise organisations to treat these systems as experimental. They can be valuable in controlled settings, but companies should avoid deploying them in ways that affect customers, core financial processes, or personal information until reliability improves."

 

Conclusion

The new research from Carnegie Mellon University, Salesforce, and Gartner points to a clear and consistent message: while AI agents are progressing, they are not yet reliable enough to operate without strong oversight. Current performance levels remain below what is required for high-stakes environments, and many systems still lack the maturity to handle complex tasks with consistency or accountability. Large-scale automation is still achievable, but only if organisations prioritise trust at every stage of deployment. Building that trust will require rigorous governance, clear human supervision, robust technical controls, and transparent documentation. Companies that invest in these foundations will be better positioned to scale AI responsibly as the technology continues to evolve.

avatar
Shruti Kakade
Shruti has actively been involved in projects that advocate for and advance AI Ethics through data-driven research and policy. Before starting her Master's, she worked on interdisciplinary applications of Data science and Analytics. She holds a Master's degree in Data Science for Public Policy from the Hertie School of Governance, Berlin and a bachelor’s degree in Computer Engineering from the Pune Institute of Computer Technology, India.

RELATED ARTICLES