Multimodal AI AI News & Updates

Google Plans to Combine Gemini Language Models with Veo Video Generation Capabilities

Google DeepMind CEO Demis Hassabis announced plans to eventually merge their Gemini AI models with Veo video-generating models to create more capable multimodal systems with better understanding of the physical world. This aligns with the broader industry trend toward "omni" models that can understand and generate multiple forms of media, with Hassabis noting that Veo's physical world understanding comes largely from training on YouTube videos.

Meta Launches Advanced Llama 4 AI Models with Multimodal Capabilities and Trillion-Parameter Variant

Meta has released its new Llama 4 family of AI models, including Scout, Maverick, and the unreleased Behemoth, featuring multimodal capabilities and more efficient mixture-of-experts architecture. The models boast improvements in reasoning, coding, and document processing with expanded context windows, while Meta has also adjusted them to refuse fewer controversial questions and achieve better political balance.

Microsoft Enhances Copilot with Web Browsing, Action Capabilities, and Improved Memory

Microsoft has significantly upgraded its Copilot AI assistant with new capabilities including performing actions on websites, remembering user preferences, analyzing real-time video, and creating podcast-like content summaries. These features, similar to those offered by competitors like OpenAI's Operator and Google's Gemini, allow Copilot to complete tasks such as booking tickets and reservations across partner websites.

Elon Musk's xAI Acquires Hotshot to Accelerate Video Generation Capabilities

Elon Musk's AI company, xAI, has acquired Hotshot, a startup specializing in AI-powered video generation technologies similar to OpenAI's Sora. The acquisition positions xAI to integrate video generation capabilities into its Grok platform, with Musk previously indicating that a "Grok Video" model could be released within months.

Baidu Unveils Ernie 4.5 and Ernie X1 Models with Multimodal Capabilities

Chinese tech giant Baidu has launched two new AI models - Ernie 4.5, featuring enhanced emotional intelligence for understanding memes and satire, and Ernie X1, a reasoning model claimed to match DeepSeek R1's performance at half the cost. Both models offer multimodal capabilities for processing text, images, video, and audio, with plans for a more advanced Ernie 5 model later this year.

Google DeepMind Launches Gemini Robotics Models for Advanced Robot Control

Google DeepMind has announced new AI models called Gemini Robotics designed to control physical robots for tasks like object manipulation and environmental navigation via voice commands. The models reportedly demonstrate generalization capabilities across different robotics hardware and environments, with DeepMind releasing a slimmed-down version called Gemini Robotics-ER for researchers along with a safety benchmark named Asimov.

Amazon Unveils 'Model Agnostic' Alexa+ with Agentic Capabilities

Amazon introduced Alexa+, a new AI assistant that uses a 'model agnostic' approach to select the best AI model for each specific task. The system utilizes Amazon's Bedrock cloud platform, their in-house Nova models, and partnerships with companies like Anthropic, enabling new capabilities such as website navigation, service coordination, and interaction with thousands of devices and services.

Amazon Launches AI-Powered Alexa+ with Enhanced Personalization and Capabilities

Amazon has announced Alexa+, a comprehensively redesigned AI assistant powered by generative AI that offers enhanced personalization and contextual understanding. The upgraded assistant can access personal data like schedules and preferences, interpret visual information, understand tone, process documents, and integrate deeply with Amazon's smart home ecosystem.

Alibaba Launches Qwen2.5-VL Models with PC and Mobile Control Capabilities

Alibaba's Qwen team released new AI models called Qwen2.5-VL which can perform various text and image analysis tasks as well as control PCs and mobile devices. According to benchmarks, the top model outperforms offerings from OpenAI, Anthropic, and Google on various evaluations, though it appears to have content restrictions aligned with Chinese regulations.