benchmarking AI News & Updates
OpenAI's GPT-5 Shows Near-Human Performance Across Professional Tasks in New Economic Benchmark
OpenAI released GDPval, a new benchmark testing AI models against human professionals across 44 occupations in nine major industries. GPT-5 performed at or above human expert level 40.6% of the time, while Anthropic's Claude Opus 4.1 achieved 49%, representing significant progress from GPT-4o's 13.7% score just 15 months prior.
Skynet Chance (+0.04%): AI models approaching human-level performance across diverse professional tasks suggests rapid capability advancement that could lead to unforeseen emergent behaviors. However, the limited scope of current testing and acknowledgment of gaps provides some reassurance about maintaining oversight.
Skynet Date (-1 days): The dramatic improvement from 13.7% to 40.6% human-level performance in just 15 months indicates an accelerating pace of AI capability development. This rapid progress timeline suggests potential risks may emerge sooner than previously expected.
AGI Progress (+0.04%): Demonstrating near-human performance across diverse professional domains represents significant progress toward AGI's goal of general intelligence across multiple fields. The benchmark directly measures economically valuable cognitive work, a key component of human-level general intelligence.
AGI Date (-1 days): The rapid improvement trajectory shown in GDPval results, with nearly triple performance gains in 15 months, suggests AGI development is accelerating faster than anticipated. OpenAI's systematic approach to measuring progress across economic sectors indicates focused advancement toward general capabilities.
Former Intel CEO Pat Gelsinger Launches Flourishing AI Benchmark for Human Values Alignment
Former Intel CEO Pat Gelsinger has partnered with faith tech company Gloo to launch the Flourishing AI (FAI) benchmark, designed to test how well AI models align with human values. The benchmark is based on The Global Flourishing Study from Harvard and Baylor University and evaluates AI models across seven categories including character, relationships, happiness, meaning, health, financial stability, and faith.
Skynet Chance (-0.08%): The development of new alignment benchmarks focused on human values represents a positive step toward ensuring AI systems remain beneficial and controllable. While modest in scope, such tools contribute to better measurement and mitigation of AI alignment risks.
Skynet Date (+0 days): The introduction of alignment benchmarks may slow deployment of AI systems as developers incorporate additional safety evaluations. However, the impact is minimal as this is one benchmark among many emerging safety tools.
AGI Progress (0%): This benchmark focuses on value alignment rather than advancing core AI capabilities or intelligence. It represents a safety tool rather than a technical breakthrough that would accelerate AGI development.
AGI Date (+0 days): The benchmark addresses alignment concerns but doesn't fundamentally change the pace of AGI research or development. It's a complementary safety tool rather than a factor that would significantly accelerate or decelerate AGI timelines.