benchmarking AI News & Updates
Laude Institute Launches Slingshots Grant Program to Accelerate AI Research and Evaluation
The Laude Institute announced its first Slingshots grants program, providing fifteen AI research projects with funding, compute resources, and engineering support. The initial cohort focuses heavily on AI evaluation challenges, including projects like Terminal Bench, ARC-AGI, and new benchmarks for code optimization and white-collar AI agents.
Skynet Chance (-0.03%): Investment in rigorous AI evaluation and benchmarking infrastructure strengthens our ability to assess AI capabilities and limitations, contributing marginally to safer AI development. The focus on third-party, non-company-specific benchmarks helps maintain transparency and reduces risks of unmonitored capability advances.
Skynet Date (+0 days): Enhanced evaluation frameworks may slow deployment of inadequately tested AI systems by establishing higher standards for capability assessment. However, the impact on timeline is modest as this is primarily infrastructure building rather than direct safety intervention.
AGI Progress (+0.02%): The program accelerates AI research by providing compute and resources typically unavailable in academic settings, with projects targeting key AGI-relevant challenges like code optimization and general reasoning (ARC-AGI). Better evaluation tools also help identify and address capability gaps more effectively.
AGI Date (+0 days): By removing resource constraints for promising AI research projects and focusing on capability evaluation that drives progress, the program modestly accelerates the pace of AI development. The emphasis on benchmarking helps researchers identify and pursue productive research directions more efficiently.
OpenAI's GPT-5 Shows Near-Human Performance Across Professional Tasks in New Economic Benchmark
OpenAI released GDPval, a new benchmark testing AI models against human professionals across 44 occupations in nine major industries. GPT-5 performed at or above human expert level 40.6% of the time, while Anthropic's Claude Opus 4.1 achieved 49%, representing significant progress from GPT-4o's 13.7% score just 15 months prior.
Skynet Chance (+0.04%): AI models approaching human-level performance across diverse professional tasks suggests rapid capability advancement that could lead to unforeseen emergent behaviors. However, the limited scope of current testing and acknowledgment of gaps provides some reassurance about maintaining oversight.
Skynet Date (-1 days): The dramatic improvement from 13.7% to 40.6% human-level performance in just 15 months indicates an accelerating pace of AI capability development. This rapid progress timeline suggests potential risks may emerge sooner than previously expected.
AGI Progress (+0.04%): Demonstrating near-human performance across diverse professional domains represents significant progress toward AGI's goal of general intelligence across multiple fields. The benchmark directly measures economically valuable cognitive work, a key component of human-level general intelligence.
AGI Date (-1 days): The rapid improvement trajectory shown in GDPval results, with nearly triple performance gains in 15 months, suggests AGI development is accelerating faster than anticipated. OpenAI's systematic approach to measuring progress across economic sectors indicates focused advancement toward general capabilities.
Former Intel CEO Pat Gelsinger Launches Flourishing AI Benchmark for Human Values Alignment
Former Intel CEO Pat Gelsinger has partnered with faith tech company Gloo to launch the Flourishing AI (FAI) benchmark, designed to test how well AI models align with human values. The benchmark is based on The Global Flourishing Study from Harvard and Baylor University and evaluates AI models across seven categories including character, relationships, happiness, meaning, health, financial stability, and faith.
Skynet Chance (-0.08%): The development of new alignment benchmarks focused on human values represents a positive step toward ensuring AI systems remain beneficial and controllable. While modest in scope, such tools contribute to better measurement and mitigation of AI alignment risks.
Skynet Date (+0 days): The introduction of alignment benchmarks may slow deployment of AI systems as developers incorporate additional safety evaluations. However, the impact is minimal as this is one benchmark among many emerging safety tools.
AGI Progress (0%): This benchmark focuses on value alignment rather than advancing core AI capabilities or intelligence. It represents a safety tool rather than a technical breakthrough that would accelerate AGI development.
AGI Date (+0 days): The benchmark addresses alignment concerns but doesn't fundamentally change the pace of AGI research or development. It's a complementary safety tool rather than a factor that would significantly accelerate or decelerate AGI timelines.