multimodal models AI News & Updates

Microsoft Launches Three Multimodal Foundation Models to Compete in AI Market

Microsoft AI announced three new foundational models: MAI-Transcribe-1 for speech-to-text across 25 languages, MAI-Voice-1 for audio generation, and MAI-Image-2 for video generation. Developed by Microsoft's MAI Superintelligence team led by Mustafa Suleyman, these models are positioned as cost-competitive alternatives to offerings from Google and OpenAI, with pricing starting at $0.36 per hour for transcription. The release represents Microsoft's effort to build its own AI model stack while maintaining its partnership with OpenAI.