Model Scaling AI News & Updates

Research Breakthrough

Chinese AI lab DeepSeek has released an upgraded version of its mathematics-focused AI model Prover V2, built on their V3 model with 671 billion parameters using a mixture-of-experts architecture. The company, which previously made Prover available for formal theorem proving and mathematical reasoning, is reportedly considering raising outside funding for the first time while continuing to update its model lineup.

DeepSeek Mathematical Reasoning Formal Theorem Proving Mixture-of-Experts Model Scaling

+0.05% -1 days

+0.04% -1 days

Skynet Chance (+0.05%): Advanced mathematical reasoning capabilities significantly enhance AI problem-solving autonomy, potentially enabling systems to discover novel solutions humans might not anticipate. This specialized capability could contribute to AI systems developing unexpected approaches to circumvent safety constraints.

Skynet Date (-1 days): The rapid improvement in specialized mathematical reasoning accelerates development of AI systems that can independently work through complex theoretical problems, potentially shortening timelines for AI systems capable of sophisticated autonomous planning and strategy formulation.

AGI Progress (+0.04%): Mathematical reasoning is a critical aspect of general intelligence that has historically been challenging for AI systems. This substantial improvement in formal theorem proving represents meaningful progress toward the robust reasoning capabilities necessary for AGI.

AGI Date (-1 days): The combination of 671 billion parameters, mixture-of-experts architecture, and advanced mathematical reasoning capabilities suggests acceleration in solving a crucial AGI bottleneck. This targeted breakthrough likely brings forward AGI development timelines by addressing a specific cognitive challenge.

Research Breakthrough

OpenAI's new reasoning models, o3 and o4-mini, are exhibiting higher hallucination rates than their predecessors, with o3 hallucinating 33% of the time on OpenAI's PersonQA benchmark and o4-mini reaching 48%. Researchers are puzzled by this increase as scaling up reasoning models appears to exacerbate hallucination issues, potentially undermining their utility despite improvements in other areas like coding and math.

Reasoning Models OpenAI Model Scaling Hallucinations AI Accuracy

+0.04% +1 days

+0.01% +1 days

Skynet Chance (+0.04%): Increased hallucination rates in advanced reasoning models raise concerns about reliability and unpredictability in AI systems as they scale up. The inability to understand why these hallucinations increase with model scale highlights fundamental alignment challenges that could lead to unpredictable behaviors in more capable systems.

Skynet Date (+1 days): This unexpected hallucination problem represents a significant technical hurdle that may slow development of reliable reasoning systems, potentially delaying scenarios where AI systems could operate autonomously without human oversight. The industry pivot toward reasoning models now faces a significant challenge that requires solving.

AGI Progress (+0.01%): While the reasoning capabilities represent progress toward more AGI-like systems, the increased hallucination rates reveal a fundamental limitation in current approaches to scaling AI reasoning. The models show both advancement (better performance on coding/math) and regression (increased hallucinations), suggesting mixed progress toward AGI capabilities.

AGI Date (+1 days): This technical hurdle could significantly delay development of reliable AGI systems as it reveals that simply scaling up reasoning models produces new problems that weren't anticipated. Until researchers understand and solve the increased hallucination problem in reasoning models, progress toward trustworthy AGI systems may be impeded.

Commercial Release

OpenAI has begun rolling out its largest AI model, GPT-4.5, to ChatGPT Plus subscribers, with the rollout expected to take 1-3 days. Despite being OpenAI's largest model with deeper world knowledge and higher emotional intelligence, GPT-4.5 is extremely expensive to run, costing 30x more for input and 15x more for output compared to GPT-4o, raising questions about its long-term viability in the API.

OpenAI Large Language Models Model Scaling GPT-4.5 Compute Costs

+0.04% +1 days

+0.03% 0 days

Skynet Chance (+0.04%): GPT-4.5's reported persuasive capabilities—specifically being "particularly good at convincing another AI to give it cash and tell it a secret code word"—raises moderate concerns about potential manipulation abilities. This demonstrates emerging capabilities that could make alignment and control more challenging as models advance.

Skynet Date (+1 days): The extreme operational costs of GPT-4.5 (30x input and 15x output costs versus GPT-4o) indicate economic constraints that will likely slow wider deployment of advanced models. These economic limitations suggest practical barriers to rapid scaling of the most advanced AI systems.

AGI Progress (+0.03%): As OpenAI's largest model yet, GPT-4.5 represents significant progress in scaling AI capabilities, despite not outperforming newer reasoning models on all benchmarks. Its deeper world knowledge, higher emotional intelligence, and reduced hallucination rate demonstrate meaningful improvements in capabilities relevant to general intelligence.

AGI Date (+0 days): The prohibitive operational costs and OpenAI's uncertainty about long-term API viability indicate economic constraints that may slow the deployment of increasingly advanced models. This suggests practical limitations are emerging that could moderately extend the timeline to achieving and deploying AGI-level systems.

Research Breakthrough

Nonprofit AI research institute Ai2 has released Tulu 3 405B, an open-source AI model containing 405 billion parameters that reportedly outperforms DeepSeek V3 and OpenAI's GPT-4o on certain benchmarks. The model, which required 256 GPUs to train, utilizes reinforcement learning with verifiable rewards (RLVR) and demonstrates superior performance on specialized knowledge questions and grade-school math problems.

Large Language Models Open-Source AI Model Scaling Reinforcement Learning Benchmark Performance

+0.06% -2 days

+0.05% -1 days

Skynet Chance (+0.06%): The release of a fully open-source, state-of-the-art model with 405 billion parameters democratizes access to frontier AI capabilities, reducing barriers that previously limited deployment of powerful models while potentially accelerating proliferation of advanced AI systems without robust safety measures.

Skynet Date (-2 days): The rapid back-and-forth leapfrogging between AI labs (from DeepSeek to Ai2) demonstrates an accelerating competitive dynamic in AI model development, with increasingly capable systems being developed and publicly released at a pace far exceeding previous expectations.

AGI Progress (+0.05%): The significant improvements in specialized knowledge and mathematical reasoning capabilities, combined with the novel reinforcement learning with verifiable rewards technique, represent meaningful progress toward more generally capable AI systems that can reliably solve complex problems across domains.

AGI Date (-1 days): The rapid development of a 405 billion parameter model that outperforms previous state-of-the-art systems indicates that scaling and methodological improvements are delivering faster-than-expected gains, likely compressing the timeline to AGI through accelerated capability improvements.

Model Scaling AI News & Updates

DeepSeek Updates Prover V2 for Advanced Mathematical Reasoning

OpenAI's Reasoning Models Show Increased Hallucination Rates

OpenAI Expands GPT-4.5 Access Despite High Operational Costs

Ai2 Claims New Open-Source Model Outperforms DeepSeek and GPT-4o