
Cursor 2.0: First Look and Impression
Code with Beto
This technical report details the capabilities and baseline safety evaluations for two open-weight reasoning models, **gpt-oss-safeguard-120b and gpt-oss-safeguard-20b**, which were post-trained from the original gpt-oss models specifically to function as content classifiers by reasoning from a given policy. The primary recommendation is to use these text-only models to **classify content against a provided policy**, rather than for direct end-user interaction like chat. In safety evaluations focused on classification, the gpt-oss-safeguard models significantly outperformed their gpt-oss counterparts and even exceeded gpt-5-thinking in multi-policy accuracy. The report also provided baseline safety scores in unintended chat settings, showing that the safeguard models generally performed on par with gpt-oss models on standard and new, more challenging "Production Benchmarks" for disallowed content. Furthermore, the models showed parity with gpt-oss in multilingual capabilities when using low, medium, and high reasoning efforts, and notably outperformed their counterparts in fairness evaluations using the BBQ dataset. Key challenges noted include the models being potentially time and compute-intensive, slight performance degradation in certain categories of instruction hierarchy adherence, and the fact that their critical Chain-of-Thought (CoT) reasoning, though provided for monitorability, can contain hallucinated content. https://cdn.openai.com/pdf/08b7dee4-8bc6-4955-a219-7793fb69090c/Technical_report__Research_Preview_of_gpt_oss_safeguard.pdf