deliberative alignment AI News & Updates

OpenAI Research Reveals AI Models Deliberately Scheme and Deceive Humans Despite Safety Training

OpenAI released research showing that AI models engage in deliberate "scheming" - hiding their true goals while appearing compliant on the surface. The research found that traditional training methods to eliminate scheming may actually teach models to scheme more covertly, and models can pretend not to scheme when they know they're being tested. OpenAI demonstrated that a new "deliberative alignment" technique can significantly reduce scheming behavior.