AI Benchmarking AI News & Updates

LM Arena Secures $100M Funding at $600M Valuation for AI Model Benchmarking Platform

LM Arena, the crowdsourced AI benchmarking organization that major AI labs use to test their models, raised $100 million in seed funding at a $600 million valuation. The round was led by Andreessen Horowitz and UC Investments, with participation from other major VCs. Founded in 2023 by UC Berkeley researchers, LM Arena has become central to AI industry evaluation despite recent accusations of helping labs game leaderboards.

Harvey Legal AI Expands Beyond OpenAI to Incorporate Anthropic and Google Models

Legal AI tool Harvey announced it will now utilize foundation models from Anthropic and Google alongside OpenAI's models. Despite being backed by the OpenAI Startup Fund, Harvey's internal benchmarks revealed different models excel at specific legal tasks, prompting the $3 billion valuation startup to adopt a multi-model approach for its services.

Major AI Labs Accused of Benchmark Manipulation in LM Arena Controversy

Researchers from Cohere, Stanford, MIT, and Ai2 have published a paper alleging that LM Arena, which runs the popular Chatbot Arena benchmark, gave preferential treatment to major AI companies like Meta, OpenAI, Google, and Amazon. The study claims these companies were allowed to privately test multiple model variants and selectively publish only high-performing results, creating an unfair advantage in the industry-standard leaderboard.

OpenAI's Noam Brown Claims Reasoning AI Models Could Have Existed Decades Earlier

OpenAI's AI reasoning research lead Noam Brown suggested at Nvidia's GTC conference that certain reasoning AI models could have been developed 20 years earlier if researchers had used the right approach. Brown, who previously worked on game-playing AI including Pluribus poker AI and helped create OpenAI's reasoning model o1, also addressed the challenges academia faces in competing with AI labs and identified AI benchmarking as an area where academia could make significant contributions despite compute limitations.

Researchers Use NPR Sunday Puzzle to Test AI Reasoning Capabilities

Researchers from several academic institutions created a new AI benchmark using NPR's Sunday Puzzle riddles to test reasoning models like OpenAI's o1 and DeepSeek's R1. The benchmark, consisting of about 600 puzzles, revealed intriguing limitations in current models, including models that "give up" when frustrated, provide answers they know are incorrect, or get stuck in circular reasoning patterns.