OpenAI has unveiled a groundbreaking approach to AI safety by teaching its o1 and o3 models to actively consider safety policies while processing user requests.
The company announced a new technique called "deliberative alignment," which enables AI models to internally reference OpenAI's safety guidelines during the inference phase - the period after a user submits a prompt but before the AI responds.
This innovative method represents a shift from traditional AI safety approaches that typically focus on pre-training and post-training phases. With deliberative alignment, the AI models take 5 seconds to several minutes to analyze prompts, breaking them down into smaller components while simultaneously checking against safety protocols.
The process works similarly to human decision-making: when given a prompt, the AI models pause to ask themselves follow-up questions and consider safety implications before generating a response. However, OpenAI emphasizes that while this mirrors human thought processes, the models are still fundamentally operating through token prediction rather than true reasoning.
Early results appear promising. When tested against common attempts to bypass safety measures, the o1-preview model outperformed competitors including GPT-4o, Gemini 1.5 Flash, and Claude 3.5 Sonnet on the Pareto benchmark, which measures resistance to safety workarounds.
OpenAI developed this capability without relying on human-labeled training data. Instead, they used synthetic data generated by AI models, with a specialized "judge" AI evaluating the quality of safety-conscious responses.
The broader goal is to prevent AI models from providing assistance with harmful activities while maintaining their ability to address legitimate queries on sensitive topics. For example, the system aims to block instructions for creating weapons while still allowing historical questions about such topics.
The o3 model, which incorporates these safety features, is scheduled for public release in 2025. As AI models become more sophisticated and autonomous, OpenAI views deliberative alignment as a key component in maintaining control and safety standards.