Multimodal AI AI News & Updates

Mistral AI Launches Open-Source Voxtral TTS Model for Real-Time Speech Generation

Mistral AI released Voxtral TTS, an open-source text-to-speech model supporting nine languages that can run on edge devices like smartphones and smartwatches. The model features rapid voice adaptation from five-second samples, real-time performance with 90ms time-to-first-audio, and multi-language support while preserving voice characteristics. This positions Mistral to compete with ElevenLabs, Deepgram, and OpenAI in enterprise voice AI applications like customer support and sales.

Luma Launches Multimodal AI Agents with Unified Intelligence Architecture

AI video startup Luma has launched Luma Agents, powered by its new Unified Intelligence (Uni-1) model family, designed to handle end-to-end creative work across text, image, video, and audio. The agents can plan, generate, and self-critique multimodal content while coordinating with other AI models, targeting ad agencies, marketing teams, and enterprises. Early deployments with companies like Publicis Groupe and Adidas demonstrate significant cost and time reductions, turning a $15 million year-long campaign into localized ads in 40 hours for under $20,000.

Moonshot AI Launches Multimodal Open-Source Model Kimi K2.5 with Advanced Coding Capabilities

China's Moonshot AI released Kimi K2.5, a new open-source multimodal model trained on 15 trillion tokens that processes text, images, and video. The model demonstrates competitive performance against proprietary models like GPT-5.2 and Gemini 3 Pro, particularly excelling in coding benchmarks and video understanding tasks. Moonshot also launched Kimi Code, an open-source coding tool that accepts multimodal inputs and integrates with popular development environments.

Meta Developing "Mango" Image/Video Model and "Avocado" Text Model Under New Superintelligence Lab for 2026 Release

Meta is developing two new AI models under its superintelligence lab: "Mango" for image and video generation, and "Avocado" for text-based tasks with improved coding capabilities, both planned for release in the first half of 2026. The company is also exploring world models that can understand visual information and reason without exhaustive training. This effort comes amid leadership changes, researcher departures, and Meta falling behind competitors like OpenAI and Anthropic in the AI race.

Google Releases Gemini 3 Flash as Default Model, Intensifying Competition with OpenAI

Google has launched Gemini 3 Flash, a fast and cost-effective AI model that outperforms its predecessor Gemini 2.5 Flash and matches frontier models like GPT-5.2 on several benchmarks. The model is now the default in Google's Gemini app and features enhanced multimodal capabilities, reasoning, and visual content generation. This release continues the intense competition between Google and OpenAI, with Google processing over 1 trillion tokens daily through its API.

OpenAI Launches GPT-5 Pro, Sora 2 Video Model, and Cost-Efficient Voice API at Dev Day

OpenAI announced major API updates at its Dev Day, introducing GPT-5 Pro for high-accuracy reasoning tasks, Sora 2 for advanced video generation with synchronized audio, and a cheaper voice model called gpt-realtime mini. These releases target developers across finance, legal, healthcare, and creative industries, aiming to expand OpenAI's developer ecosystem with more powerful and cost-effective tools.

OpenAI Launches Sora 2 Video Generator with TikTok-Style Social Platform

OpenAI released Sora 2, an advanced audio and video generation model with improved physics simulation, alongside a new social app called Sora. The platform features a "cameos" function allowing users to insert their own likeness into AI-generated videos and share them on a TikTok-style feed. The app raises significant safety concerns regarding non-consensual content and misuse of personal likenesses.

Mistral Launches Voxtral: Open-Source Speech AI Models Challenge Closed Corporate Systems

French AI startup Mistral has released Voxtral, its first open-source audio model family designed for speech transcription and understanding. The models offer multilingual capabilities, can process up to 30 minutes of audio, and are positioned as affordable alternatives to closed corporate systems at less than half the price of comparable solutions.

Google Deploys Veo 3 Video Generation AI Model to Global Gemini Users

Google has rolled out its Veo 3 video generation model to Gemini users in over 159 countries, allowing paid subscribers to create 8-second videos from text prompts. The service is limited to 3 videos per day for AI Pro plan subscribers, with image-to-video capabilities planned for future release.

Google Launches Real-Time Voice Conversations with AI-Powered Search

Google has introduced Search Live, enabling back-and-forth voice conversations with its AI Mode search feature using a custom version of Gemini. Users can now engage in free-flowing voice dialogues with Google Search, receiving AI-generated audio responses and exploring web links conversationally. The feature supports multitasking and background operation, with plans to add real-time camera-based queries in the future.