AI Benchmarking AI News & Updates
Harvey Legal AI Expands Beyond OpenAI to Incorporate Anthropic and Google Models
Legal AI tool Harvey announced it will now utilize foundation models from Anthropic and Google alongside OpenAI's models. Despite being backed by the OpenAI Startup Fund, Harvey's internal benchmarks revealed different models excel at specific legal tasks, prompting the $3 billion valuation startup to adopt a multi-model approach for its services.
Skynet Chance (-0.05%): The shift toward using multiple AI models rather than a single provider indicates a move toward comparative selection based on specialized performance rather than pure capability scaling, which slightly reduces control risks by preventing single-model dominance.
Skynet Date (+1 days): Harvey's approach of selecting specialized models for specific tasks rather than pursuing increasingly powerful general models suggests a more measured, task-oriented development path that could modestly decelerate the timeline toward potential uncontrolled AI scenarios.
AGI Progress (+0.04%): The discovery that different foundation models excel at specific reasoning tasks demonstrates meaningful progress in AI capabilities relevant to AGI, as these models are showing domain-specific reasoning abilities that collectively cover more comprehensive intelligence domains.
AGI Date (-1 days): The competitive dynamic between major AI providers and transparent benchmarking could slightly accelerate AGI development as it creates market pressure for improvements in reasoning capabilities across specialized domains, a key component of general intelligence.
Major AI Labs Accused of Benchmark Manipulation in LM Arena Controversy
Researchers from Cohere, Stanford, MIT, and Ai2 have published a paper alleging that LM Arena, which runs the popular Chatbot Arena benchmark, gave preferential treatment to major AI companies like Meta, OpenAI, Google, and Amazon. The study claims these companies were allowed to privately test multiple model variants and selectively publish only high-performing results, creating an unfair advantage in the industry-standard leaderboard.
Skynet Chance (+0.05%): The alleged benchmark manipulation indicates a prioritization of competitive advantage over honest technical assessment, potentially leading to overhyped capability claims and rushed deployment of insufficiently tested models. This increases risk as systems might appear safer or more capable than they actually are.
Skynet Date (-2 days): Competition-driven benchmark gaming accelerates the race to develop and deploy increasingly powerful AI systems without proper safety assessments. The pressure to show leaderboard improvements could rush development timelines and skip thorough safety evaluations.
AGI Progress (-0.05%): Benchmark manipulation distorts our understanding of actual AI progress, creating artificial inflation of capability metrics rather than genuine technological advancement. This reduces our ability to accurately assess the state of progress toward AGI and may misdirect research resources.
AGI Date (-1 days): While benchmark gaming doesn't directly accelerate technical capabilities, the competitive pressure it reveals may slightly compress AGI timelines as companies race to demonstrate superiority. However, resources wasted on optimization for specific benchmarks rather than fundamental capabilities may partially counterbalance this effect.
OpenAI's Noam Brown Claims Reasoning AI Models Could Have Existed Decades Earlier
OpenAI's AI reasoning research lead Noam Brown suggested at Nvidia's GTC conference that certain reasoning AI models could have been developed 20 years earlier if researchers had used the right approach. Brown, who previously worked on game-playing AI including Pluribus poker AI and helped create OpenAI's reasoning model o1, also addressed the challenges academia faces in competing with AI labs and identified AI benchmarking as an area where academia could make significant contributions despite compute limitations.
Skynet Chance (+0.05%): Brown's comments suggest that powerful reasoning capabilities were algorithmically feasible much earlier than realized, indicating our understanding of AI progress may be systematically underestimating potential capabilities. This revelation increases concern that other unexplored approaches might enable rapid capability jumps without corresponding safety preparations.
Skynet Date (-2 days): The realization that reasoning capabilities could have emerged decades earlier suggests we may be underestimating how quickly other advanced capabilities could emerge, potentially accelerating timelines for dangerous AI capabilities through similar algorithmic insights rather than just scaling.
AGI Progress (+0.06%): The revelation that reasoning capabilities were algorithmically possible decades ago suggests that current rapid progress in AI reasoning isn't just about compute scaling but about fundamental algorithmic insights. This indicates that similar conceptual breakthroughs could unlock other AGI components more readily than previously thought.
AGI Date (-3 days): Brown's assertion that powerful reasoning AI could have existed decades earlier with the right approach suggests that AGI development may be more gated by conceptual breakthroughs than computational limitations, potentially shortening timelines if similar insights occur in other AGI-relevant capabilities.
Researchers Use NPR Sunday Puzzle to Test AI Reasoning Capabilities
Researchers from several academic institutions created a new AI benchmark using NPR's Sunday Puzzle riddles to test reasoning models like OpenAI's o1 and DeepSeek's R1. The benchmark, consisting of about 600 puzzles, revealed intriguing limitations in current models, including models that "give up" when frustrated, provide answers they know are incorrect, or get stuck in circular reasoning patterns.
Skynet Chance (-0.08%): This research exposes significant limitations in current AI reasoning capabilities, revealing models that get frustrated, give up, or know they're providing incorrect answers. These documented weaknesses demonstrate that even advanced reasoning models remain far from the robust, generalized problem-solving abilities needed for uncontrolled AI risk scenarios.
Skynet Date (+2 days): The benchmark reveals fundamental reasoning limitations in current AI systems, suggesting that robust generalized reasoning remains more challenging than previously understood. The documented failures in puzzle-solving and self-contradictory behaviors indicate that truly capable reasoning systems are likely further away than anticipated.
AGI Progress (+0.03%): While the research itself doesn't advance capabilities, it provides valuable insights into current reasoning limitations and establishes a more accessible benchmark that could accelerate future progress. The identification of specific failure modes in reasoning models creates clearer targets for improvement in future systems.
AGI Date (+2 days): The revealed limitations in current reasoning models' abilities to solve relatively straightforward puzzles suggests that the path to robust general reasoning is more complex than anticipated. These documented weaknesses indicate significant remaining challenges before achieving the kind of general problem-solving capabilities central to AGI.