benchmarking AI News & Updates

Research Breakthrough

A new benchmark called Apex-Agents tests leading AI models on real white-collar tasks from consulting, investment banking, and law, revealing that even the best models achieve only about 24% accuracy. The models struggle primarily with multi-domain information tracking across different tools and platforms, a core requirement of professional knowledge work. Despite current limitations, researchers note rapid year-over-year improvement, with accuracy potentially quintupling from previous years.

AI Agents benchmarking knowledge work automation workplace AI multi-domain reasoning

-0.03% 0 days

Skynet Chance (-0.03%): The benchmark reveals significant current limitations in AI agents' ability to perform complex multi-domain tasks, suggesting that even advanced models lack the autonomous competence that would be necessary for uncontrolled, independent operation. These capability gaps provide evidence against near-term scenarios of AI systems operating without meaningful human oversight.

Skynet Date (+0 days): The research demonstrates that current AI systems struggle with real-world task complexity, indicating existing technical bottlenecks that must be overcome before AI could achieve the autonomous capability levels associated with uncontrollable scenarios. However, the noted rapid improvement trajectory (5-10% to 24% accuracy year-over-year) suggests these limitations may be temporary.

AGI Progress (-0.03%): The benchmark exposes a critical gap in current AI capabilities: the inability to effectively navigate and integrate information across multiple domains and tools, which is fundamental to general intelligence. The low accuracy scores (18-24%) on professional tasks highlight that despite advances in foundation models, systems still lack the robust real-world reasoning required for AGI.

AGI Date (+0 days): While the current low performance suggests AGI capabilities are further away than some predictions implied, the documented rapid improvement rate (potentially quintupling accuracy year-over-year) indicates progress may accelerate once key bottlenecks are addressed. The establishment of this rigorous benchmark provides a clear target for AI labs to optimize against, which could paradoxically accelerate development.

Industry Trend

The Laude Institute announced its first Slingshots grants program, providing fifteen AI research projects with funding, compute resources, and engineering support. The initial cohort focuses heavily on AI evaluation challenges, including projects like Terminal Bench, ARC-AGI, and new benchmarks for code optimization and white-collar AI agents.

AI Agents Code Generation AI Evaluation benchmarking research funding

-0.03% 0 days

+0.02% 0 days

Skynet Chance (-0.03%): Investment in rigorous AI evaluation and benchmarking infrastructure strengthens our ability to assess AI capabilities and limitations, contributing marginally to safer AI development. The focus on third-party, non-company-specific benchmarks helps maintain transparency and reduces risks of unmonitored capability advances.

Skynet Date (+0 days): Enhanced evaluation frameworks may slow deployment of inadequately tested AI systems by establishing higher standards for capability assessment. However, the impact on timeline is modest as this is primarily infrastructure building rather than direct safety intervention.

AGI Progress (+0.02%): The program accelerates AI research by providing compute and resources typically unavailable in academic settings, with projects targeting key AGI-relevant challenges like code optimization and general reasoning (ARC-AGI). Better evaluation tools also help identify and address capability gaps more effectively.

AGI Date (+0 days): By removing resource constraints for promising AI research projects and focusing on capability evaluation that drives progress, the program modestly accelerates the pace of AI development. The emphasis on benchmarking helps researchers identify and pursue productive research directions more efficiently.

Research Breakthrough

OpenAI released GDPval, a new benchmark testing AI models against human professionals across 44 occupations in nine major industries. GPT-5 performed at or above human expert level 40.6% of the time, while Anthropic's Claude Opus 4.1 achieved 49%, representing significant progress from GPT-4o's 13.7% score just 15 months prior.

Economic Impact GPT-5 benchmarking professional tasks human-AI comparison

+0.04% -1 days

Skynet Chance (+0.04%): AI models approaching human-level performance across diverse professional tasks suggests rapid capability advancement that could lead to unforeseen emergent behaviors. However, the limited scope of current testing and acknowledgment of gaps provides some reassurance about maintaining oversight.

Skynet Date (-1 days): The dramatic improvement from 13.7% to 40.6% human-level performance in just 15 months indicates an accelerating pace of AI capability development. This rapid progress timeline suggests potential risks may emerge sooner than previously expected.

AGI Progress (+0.04%): Demonstrating near-human performance across diverse professional domains represents significant progress toward AGI's goal of general intelligence across multiple fields. The benchmark directly measures economically valuable cognitive work, a key component of human-level general intelligence.

AGI Date (-1 days): The rapid improvement trajectory shown in GDPval results, with nearly triple performance gains in 15 months, suggests AGI development is accelerating faster than anticipated. OpenAI's systematic approach to measuring progress across economic sectors indicates focused advancement toward general capabilities.

Safety Concern

Former Intel CEO Pat Gelsinger has partnered with faith tech company Gloo to launch the Flourishing AI (FAI) benchmark, designed to test how well AI models align with human values. The benchmark is based on The Global Flourishing Study from Harvard and Baylor University and evaluates AI models across seven categories including character, relationships, happiness, meaning, health, financial stability, and faith.

AI Alignment human values benchmarking Pat Gelsinger faith tech

-0.08% 0 days

0% 0 days

Skynet Chance (-0.08%): The development of new alignment benchmarks focused on human values represents a positive step toward ensuring AI systems remain beneficial and controllable. While modest in scope, such tools contribute to better measurement and mitigation of AI alignment risks.

Skynet Date (+0 days): The introduction of alignment benchmarks may slow deployment of AI systems as developers incorporate additional safety evaluations. However, the impact is minimal as this is one benchmark among many emerging safety tools.

AGI Progress (0%): This benchmark focuses on value alignment rather than advancing core AI capabilities or intelligence. It represents a safety tool rather than a technical breakthrough that would accelerate AGI development.

AGI Date (+0 days): The benchmark addresses alignment concerns but doesn't fundamentally change the pace of AGI research or development. It's a complementary safety tool rather than a factor that would significantly accelerate or decelerate AGI timelines.

benchmarking AI News & Updates

New Benchmark Reveals AI Agents Still Far From Replacing White-Collar Workers

Laude Institute Launches Slingshots Grant Program to Accelerate AI Research and Evaluation

OpenAI's GPT-5 Shows Near-Human Performance Across Professional Tasks in New Economic Benchmark

Former Intel CEO Pat Gelsinger Launches Flourishing AI Benchmark for Human Values Alignment