Benchmark Performance AI News & Updates

Commercial Release

Anthropic released Opus 4.5, completing its 4.5 model series, featuring state-of-the-art performance across coding, tool use, and problem-solving benchmarks, including being the first model to exceed 80% on SWE-Bench ver...

Anthropic Large Language Models Benchmark Performance agent systems memory management

+0.04% -1 days

+0.03% -1 days

Full analysis

Commercial Release

Google has launched Gemini 3, its most advanced foundation model to date, available immediately through the Gemini app and AI search interface. The model achieved record-breaking benchmark scores, including 37.4 on Human...

Google Large Language Models Reasoning Capabilities Benchmark Performance agentic coding

+0.04% -1 days

Full analysis

Commercial Release

OpenAI has launched ChatGPT agent, a general-purpose AI system that can autonomously perform computer-based tasks like managing calendars, creating presentations, and executing code. The agent combines capabilities from...

ChatGPT OpenAI AI Agents Autonomous Systems Benchmark Performance

+0.04% -1 days

+0.03% -1 days

Full analysis

Commercial Release

Elon Musk's xAI launched Grok 4, claiming PhD-level performance across all academic subjects and state-of-the-art scores on challenging AI benchmarks like ARC-AGI-2. The release comes alongside a $300/month premium subsc...

Large Language Models AI Safety xAI Benchmark Performance Grok 4

+0.04% 0 days

+0.03% 0 days

Full analysis

Research Breakthrough

Google introduced Deep Think, an enhanced reasoning mode for Gemini 2.5 Pro that considers multiple answers before responding, similar to OpenAI's o1 models. The technology topped coding benchmarks and beat OpenAI's o3 o...

Reasoning Models Gemini Benchmark Performance deep think safety evaluation

+0.06% 0 days

+0.04% 0 days

Full analysis

Research Breakthrough

Nonprofit AI research institute Ai2 has released Tulu 3 405B, an open-source AI model containing 405 billion parameters that reportedly outperforms DeepSeek V3 and OpenAI's GPT-4o on certain benchmarks. The model, which...

Large Language Models Open-Source AI Model Scaling Reinforcement Learning Benchmark Performance

+0.06% -2 days

+0.05% -1 days

Full analysis

Benchmark Performance AI News & Updates

Anthropic Launches Opus 4.5 with Enhanced Memory and Agent Capabilities

Google Releases Gemini 3 Foundation Model with Record-Breaking Reasoning Capabilities

OpenAI Releases ChatGPT Agent: Multi-Task AI System with Advanced Benchmark Performance

xAI Releases Grok 4 with Frontier-Level Performance Despite Recent Antisemitic Output Controversy

Google Unveils Deep Think Reasoning Mode for Enhanced Gemini Model Performance

Ai2 Claims New Open-Source Model Outperforms DeepSeek and GPT-4o