Multimodal AI AI News & Updates

OpenAI Launches GPT-5 Pro, Sora 2 Video Model, and Cost-Efficient Voice API at Dev Day

OpenAI announced major API updates at its Dev Day, introducing GPT-5 Pro for high-accuracy reasoning tasks, Sora 2 for advanced video generation with synchronized audio, and a cheaper voice model called gpt-realtime mini. These releases target developers across finance, legal, healthcare, and creative industries, aiming to expand OpenAI's developer ecosystem with more powerful and cost-effective tools.

OpenAI Launches Sora 2 Video Generator with TikTok-Style Social Platform

OpenAI released Sora 2, an advanced audio and video generation model with improved physics simulation, alongside a new social app called Sora. The platform features a "cameos" function allowing users to insert their own likeness into AI-generated videos and share them on a TikTok-style feed. The app raises significant safety concerns regarding non-consensual content and misuse of personal likenesses.

Mistral Launches Voxtral: Open-Source Speech AI Models Challenge Closed Corporate Systems

French AI startup Mistral has released Voxtral, its first open-source audio model family designed for speech transcription and understanding. The models offer multilingual capabilities, can process up to 30 minutes of audio, and are positioned as affordable alternatives to closed corporate systems at less than half the price of comparable solutions.

Google Deploys Veo 3 Video Generation AI Model to Global Gemini Users

Google has rolled out its Veo 3 video generation model to Gemini users in over 159 countries, allowing paid subscribers to create 8-second videos from text prompts. The service is limited to 3 videos per day for AI Pro plan subscribers, with image-to-video capabilities planned for future release.

Google Launches Real-Time Voice Conversations with AI-Powered Search

Google has introduced Search Live, enabling back-and-forth voice conversations with its AI Mode search feature using a custom version of Gemini. Users can now engage in free-flowing voice dialogues with Google Search, receiving AI-generated audio responses and exploring web links conversationally. The feature supports multitasking and background operation, with plans to add real-time camera-based queries in the future.

Google Integrates Project Astra's Real-Time Multimodal AI Across Search and Developer APIs

Google announced Project Astra will power new real-time, multimodal AI experiences across Search, Gemini, and developer tools through its Live API. The technology enables low-latency voice and visual interactions, with plans for smart glasses partnerships with Samsung and Warby Parker, though no launch date is set.

Amazon Releases Nova Premier: High-Context AI Model with Mixed Benchmark Performance

Amazon has launched Nova Premier, its most capable AI model in the Nova family, which can process text, images, and videos with a context length of 1 million tokens. While it performs well on knowledge retrieval and visual understanding tests, it lags behind competitors like Google's Gemini on coding, math, and science benchmarks and lacks reasoning capabilities found in models from OpenAI and DeepSeek.

OpenAI Releases Advanced AI Reasoning Models with Enhanced Visual and Coding Capabilities

OpenAI has launched o3 and o4-mini, new AI reasoning models designed to pause and think through questions before responding, with significant improvements in math, coding, reasoning, science, and visual understanding capabilities. The models outperform previous iterations on key benchmarks, can integrate with tools like web browsing and code execution, and uniquely can "think with images" by analyzing visual content during their reasoning process.

Google Plans to Combine Gemini Language Models with Veo Video Generation Capabilities

Google DeepMind CEO Demis Hassabis announced plans to eventually merge their Gemini AI models with Veo video-generating models to create more capable multimodal systems with better understanding of the physical world. This aligns with the broader industry trend toward "omni" models that can understand and generate multiple forms of media, with Hassabis noting that Veo's physical world understanding comes largely from training on YouTube videos.

Meta Launches Advanced Llama 4 AI Models with Multimodal Capabilities and Trillion-Parameter Variant

Meta has released its new Llama 4 family of AI models, including Scout, Maverick, and the unreleased Behemoth, featuring multimodal capabilities and more efficient mixture-of-experts architecture. The models boast improvements in reasoning, coding, and document processing with expanded context windows, while Meta has also adjusted them to refuse fewer controversial questions and achieve better political balance.