OpenAI releases new “Safeguard” models for online safety

OpenAI has released two new artificial intelligence models designed to help developers detect and classify harmful or unsafe content online.

The models, named gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, are available to download from Hugging Face, a popular open AI platform. They form part of a research preview and are the latest in a growing movement towards more transparent, community-driven safety tools for AI.

Both models are open-weight, meaning their underlying parameters are visible and freely usable under the Apache 2.0 licence, but they are not fully open-source. This allows developers to study, adapt, and deploy them while maintaining control over safety policies and performance.

Tailored safety for different online communities

Unlike traditional moderation systems that rely on fixed datasets, the Safeguard models can reason about a policy written by the developer at runtime. In practice, this means a platform can define what “harmful” means for its specific community and the AI will apply those rules directly.

For example, a gaming forum could create a policy to flag discussions about cheating, while an e-commerce site might identify potentially fake reviews. The models can also explain their reasoning, giving developers visibility into how each classification is made.

This flexible, “policy-based reasoning” approach marks a shift from earlier moderation tools that needed retraining every time a safety rule changed. Now, developers can update their policies instantly and see how the model interprets them.

Developed in partnership with ROOST

OpenAI built the models in collaboration with ROOST (Robust Open Online Safety Tools), an organisation focused on open safety infrastructure for AI.

As part of the launch, ROOST has established the ROOST Model Community, a new space for researchers and practitioners to share methods for applying AI to safeguard online platforms. Partners involved in testing include Discord and SafetyKit, both known for their work on digital trust and moderation.

ROOST President Camille François said:
“As AI becomes more powerful, safety tools and fundamental safety research must evolve just as fast, and they must be accessible to everyone.”

How the models work

The gpt-oss-safeguard models use reasoning to interpret safety policies dynamically. They take two inputs, a developer’s policy and the user’s content, and output both a classification and the reasoning behind it.

This reasoning-first design enables developers to understand why a piece of content has been flagged, offering transparency that traditional classifiers often lack.

According to OpenAI, the models perform strongly in complex or emerging risk areas where small datasets or evolving harms make conventional moderation difficult.

Building a safer AI ecosystem

OpenAI says the release is part of its broader “defence in depth” strategy, combining model-level safety with external safeguards. The company hopes that by opening access to these tools, more developers can take responsibility for how AI systems behave in real-world settings.

Early tests suggest the Safeguard models perform competitively with larger proprietary systems, despite being smaller in size. However, OpenAI acknowledges they can be computationally intensive and may not match the raw accuracy of classifiers trained on vast labelled datasets.

Even so, by making these models freely available, OpenAI aims to encourage collaboration on one of the most pressing challenges in AI, keeping digital spaces safe, transparent, and adaptable as technology continues to evolve.

The gpt-oss-safeguard models are available now on Hugging Face, with full documentation and evaluation data provided for developers and researchers.