AI Benchmarking AI News & Updates

Harvey Legal AI Expands Beyond OpenAI to Incorporate Anthropic and Google Models

Legal AI tool Harvey announced it will now utilize foundation models from Anthropic and Google alongside OpenAI's models. Despite being backed by the OpenAI Startup Fund, Harvey's internal benchmarks revealed different models excel at specific legal tasks, prompting the $3 billion valuation startup to adopt a multi-model approach for its services.

Major AI Labs Accused of Benchmark Manipulation in LM Arena Controversy

Researchers from Cohere, Stanford, MIT, and Ai2 have published a paper alleging that LM Arena, which runs the popular Chatbot Arena benchmark, gave preferential treatment to major AI companies like Meta, OpenAI, Google, and Amazon. The study claims these companies were allowed to privately test multiple model variants and selectively publish only high-performing results, creating an unfair advantage in the industry-standard leaderboard.

OpenAI's Noam Brown Claims Reasoning AI Models Could Have Existed Decades Earlier

OpenAI's AI reasoning research lead Noam Brown suggested at Nvidia's GTC conference that certain reasoning AI models could have been developed 20 years earlier if researchers had used the right approach. Brown, who previously worked on game-playing AI including Pluribus poker AI and helped create OpenAI's reasoning model o1, also addressed the challenges academia faces in competing with AI labs and identified AI benchmarking as an area where academia could make significant contributions despite compute limitations.

Researchers Use NPR Sunday Puzzle to Test AI Reasoning Capabilities

Researchers from several academic institutions created a new AI benchmark using NPR's Sunday Puzzle riddles to test reasoning models like OpenAI's o1 and DeepSeek's R1. The benchmark, consisting of about 600 puzzles, revealed intriguing limitations in current models, including models that "give up" when frustrated, provide answers they know are incorrect, or get stuck in circular reasoning patterns.