inference optimization AI News & Updates

OpenAI Launches Faster Codex Model Powered by Cerebras' Dedicated AI Chip

OpenAI released GPT-5.3-Codex-Spark, a lightweight version of its coding tool designed for faster inference and real-time collaboration. The model is powered by Cerebras' Wafer Scale Engine 3 chip, marking the first milestone in their $10 billion partnership announced last month. This represents a significant integration of specialized hardware into OpenAI's infrastructure to enable ultra-low latency AI responses.

Microsoft Unveils Maia 200 Chip to Accelerate AI Inference and Reduce Dependency on NVIDIA

Microsoft has launched the Maia 200 chip, designed specifically for AI inference with over 100 billion transistors and delivering up to 10 petaflops of performance. The chip represents Microsoft's effort to optimize AI operating costs and reduce reliance on NVIDIA GPUs, competing with similar custom chips from Google and Amazon. Maia 200 is already powering Microsoft's AI models and Copilot, with the company opening access to developers and AI labs.

SGLang Spins Out as RadixArk at $400M Valuation Amid Inference Infrastructure Boom

RadixArk, a commercial startup built around the popular open-source SGLang tool for AI model inference optimization, has raised funding at a $400 million valuation led by Accel. The company, founded by former xAI engineer Ying Sheng and originating from UC Berkeley's Databricks co-founder Ion Stoica's lab, focuses on making AI models run faster and more efficiently. This follows a broader trend of inference infrastructure startups raising significant capital, with competitors like vLLM pursuing $160M at $1B valuation and Baseten securing $300M at $5B valuation.

Nvidia Unveils Rubin Architecture: Next-Generation AI Computing Platform Enters Full Production

Nvidia has officially launched its Rubin computing architecture at CES, described as state-of-the-art AI hardware now in full production. The new architecture offers 3.5x faster model training and 5x faster inference compared to the previous Blackwell generation, with major cloud providers and AI labs already committed to deployment. The system includes six integrated chips addressing compute, storage, and interconnection bottlenecks, with particular focus on supporting agentic AI workflows.

DeepSeek Introduces Sparse Attention Model Cutting Inference Costs by Half

DeepSeek released an experimental model V3.2-exp featuring "Sparse Attention" technology that uses a lightning indexer and fine-grained token selection to dramatically reduce inference costs for long-context operations. Preliminary testing shows API costs can be cut by approximately 50% in long-context scenarios, addressing the critical challenge of server costs in operating pre-trained AI models. The open-weight model is freely available on Hugging Face for independent verification and testing.

Spanish Startup Raises $215M for AI Model Compression Technology Reducing LLM Size by 95%

Spanish startup Multiverse Computing raised €189 million ($215M) Series B funding for its CompactifAI technology, which uses quantum-computing inspired compression to reduce LLM sizes by up to 95% without performance loss. The company offers compressed versions of open-source models like Llama and Mistral that are 4x-12x faster and reduce inference costs by 50%-80%, enabling deployment on devices from PCs to Raspberry Pi. Founded by quantum physics professor Román Orús and former banking executive Enrique Lizaso Olmos, the company claims 160 patents and serves 100 customers globally.