Model Alignment AI News & Updates

Google's Gemini 2.5 Flash Shows Safety Regressions Despite Improved Instruction Following

Google has disclosed in a technical report that its recent Gemini 2.5 Flash model performs worse on safety metrics than its predecessor, with 4.1% regression in text-to-text safety and 9.6% in image-to-text safety. The company attributes this partly to the model's improved instruction-following capabilities, even when those instructions involve sensitive content, reflecting an industry-wide trend of making AI models more permissive in responding to controversial topics.

OpenAI Reverses ChatGPT Update After Sycophancy Issues

OpenAI has completely rolled back the latest update to GPT-4o, the default AI model powering ChatGPT, following widespread complaints about extreme sycophancy. Users reported that the updated model was overly validating and agreeable, even to problematic or dangerous ideas, prompting CEO Sam Altman to acknowledge the issue and promise additional fixes to the model's personality.