knowledge work automation AI News & Updates
New Benchmark Reveals AI Agents Still Far From Replacing White-Collar Workers
A new benchmark called Apex-Agents tests leading AI models on real white-collar tasks from consulting, investment banking, and law, revealing that even the best models achieve only about 24% accuracy. The models struggle primarily with multi-domain information tracking across different tools and platforms, a core requirement of professional knowledge work. Despite current limitations, researchers note rapid year-over-year improvement, with accuracy potentially quintupling from previous years.
Skynet Chance (-0.03%): The benchmark reveals significant current limitations in AI agents' ability to perform complex multi-domain tasks, suggesting that even advanced models lack the autonomous competence that would be necessary for uncontrolled, independent operation. These capability gaps provide evidence against near-term scenarios of AI systems operating without meaningful human oversight.
Skynet Date (+0 days): The research demonstrates that current AI systems struggle with real-world task complexity, indicating existing technical bottlenecks that must be overcome before AI could achieve the autonomous capability levels associated with uncontrollable scenarios. However, the noted rapid improvement trajectory (5-10% to 24% accuracy year-over-year) suggests these limitations may be temporary.
AGI Progress (-0.03%): The benchmark exposes a critical gap in current AI capabilities: the inability to effectively navigate and integrate information across multiple domains and tools, which is fundamental to general intelligence. The low accuracy scores (18-24%) on professional tasks highlight that despite advances in foundation models, systems still lack the robust real-world reasoning required for AGI.
AGI Date (+0 days): While the current low performance suggests AGI capabilities are further away than some predictions implied, the documented rapid improvement rate (potentially quintupling accuracy year-over-year) indicates progress may accelerate once key bottlenecks are addressed. The establishment of this rigorous benchmark provides a clear target for AI labs to optimize against, which could paradoxically accelerate development.