#safety alignment

61 posts

23 Jan 2025

23 Jan 2025 1 min read

Operator System Card

OpenAI Engineering

Drawing from OpenAI’s established safety frameworks, this document highlights our multi-layered approach, including model and product mitigations we’ve implemented to protect against prompt engineering and jailbreaks, protect privacy and security, as well as details our external red teaming efforts, safety evaluations, and ongoing work to further refine these safeguards.

safety alignment

20 Dec 2024

20 Dec 2024 1 min read

Deliberative alignment: reasoning enables safer language models

OpenAI Engineering

Deliberative alignment: reasoning enables safer language models Introducing our new alignment strategy for o1 models, which are directly taught safety specifications and how to reason over them.

safety alignment

9 Dec 2024

9 Dec 2024 1 min read

Sora System Card

OpenAI Engineering

Sora is OpenAI’s video generation model, designed to take text, image, and video inputs and generate a new video as an output. Sora builds on learnings from DALL-E and GPT models, and is designed to give people expanded tools for storytelling and creative expression.

safety alignment

16 Sept 2024

16 Sept 2024 1 min read

An update on our safety & security practices

OpenAI Engineering

An update on our safety & security practices

safety alignment

16 Aug 2024

16 Aug 2024

Disrupting a covert Iranian influence operation

OpenAI Engineering

safety alignment

8 Aug 2024

8 Aug 2024

GPT-4o System Card

OpenAI Engineering

safety alignment

27 Jun 2024

27 Jun 2024 1 min read

Finding GPT-4’s mistakes with GPT-4

OpenAI Engineering

CriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF

safety alignment

7 Jun 2024

7 Jun 2024 1 min read

Expanding on how Voice Engine works and our safety research

OpenAI Engineering

Exploring the technology behind our text-to-speech model.

safety alignment

21 May 2024

21 May 2024 1 min read

OpenAI safety practices

OpenAI Engineering

Artificial general intelligence has the potential to benefit nearly every aspect of our lives—so it must be developed and deployed responsibly.

safety alignment

23 Apr 2024

23 Apr 2024

OpenAI’s commitment to child safety: adopting safety by design principles

OpenAI Engineering

safety alignment

16 Jan 2024

16 Jan 2024 1 min read

Democratic inputs to AI grant program: lessons learned and implementation plans

OpenAI Engineering

We funded 10 teams from around the world to design ideas and tools to collectively govern AI. We summarize the innovations, outline our learnings, and call for researchers and engineers to join us as we continue this work.

safety alignment

15 Jan 2024

15 Jan 2024 1 min read

How OpenAI is approaching 2024 worldwide elections

OpenAI Engineering

We’re working to prevent abuse, provide transparency on AI-generated content, and improve access to accurate voting information.

safety alignment

14 Dec 2023

14 Dec 2023 1 min read

Superalignment Fast Grants

OpenAI Engineering

We’re launching $10M in grants to support technical research towards the alignment and safety of superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more.

safety alignment

14 Dec 2023

Practices for Governing Agentic AI Systems

OpenAI Engineering

safety alignment

14 Dec 2023 1 min read

Weak-to-strong generalization

OpenAI Engineering

We present a new research direction for superalignment, together with promising initial results: can we leverage the generalization properties of deep learning to control strong models with weak supervisors?

safety alignment

26 Oct 2023

26 Oct 2023 1 min read

Frontier risk and preparedness

OpenAI Engineering

To support the safety of highly-capable AI systems, we are developing our approach to catastrophic risk preparedness, including building a Preparedness team and launching a challenge.

safety alignment

3 Oct 2023

3 Oct 2023

DALL·E 3 system card

OpenAI Engineering

safety alignment

25 Sept 2023

25 Sept 2023

GPT-4V(ision) system card

OpenAI Engineering

safety alignment

19 Sept 2023

19 Sept 2023 1 min read

OpenAI Red Teaming Network

OpenAI Engineering

We’re announcing an open call for the OpenAI Red Teaming Network and invite domain experts interested in improving the safety of OpenAI’s models to join our efforts.

safety alignment

15 Aug 2023

15 Aug 2023 1 min read

Using GPT-4 for content moderation

OpenAI Engineering

We use GPT-4 for content policy development and content moderation decisions, enabling more consistent labeling, a faster feedback loop for policy refinement, and less involvement from human moderators.

safety alignment

1 Aug 2023

1 Aug 2023

Confidence-Building Measures for Artificial Intelligence: Workshop proceedings

OpenAI Engineering

safety alignment

26 Jul 2023

26 Jul 2023 1 min read

Frontier Model Forum

OpenAI Engineering

We’re forming a new industry body to promote the safe and responsible development of frontier AI systems: advancing AI safety research, identifying best practices and standards, and facilitating information sharing among policymakers and industry.

safety alignment

21 Jul 2023

21 Jul 2023 1 min read

Moving AI governance forward

OpenAI Engineering

OpenAI and other leading labs reinforce AI safety, security and trustworthiness through voluntary commitments.

safety alignment

6 Jul 2023

6 Jul 2023

Frontier AI regulation: Managing emerging risks to public safety

OpenAI Engineering

safety alignment

29 Jun 2023

29 Jun 2023 1 min read

Insights from global conversations

OpenAI Engineering

We are sharing what we learned from our conversations across 22 countries, and how we will be incorporating those insights moving forward.

safety alignment

22 May 2023

22 May 2023 1 min read

Governance of superintelligence

OpenAI Engineering

Now is a good time to start thinking about the governance of superintelligence—future AI systems dramatically more capable than even AGI.

safety alignment

9 May 2023

9 May 2023 1 min read

Language models can explain neurons in language models

OpenAI Engineering

We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.

safety alignment

5 Apr 2023

5 Apr 2023 1 min read

Our approach to AI safety

OpenAI Engineering

Ensuring that AI systems are built, deployed, and used safely is critical to our mission.

safety alignment

24 Feb 2023

24 Feb 2023 1 min read

Planning for AGI and beyond

OpenAI Engineering

Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity.

safety alignment

16 Feb 2023

16 Feb 2023 1 min read

How should AI systems behave, and who should decide?

OpenAI Engineering

We’re clarifying how ChatGPT’s behavior is shaped and our plans for improving that behavior, allowing more user customization, and getting more public input into our decision-making in these areas.

safety alignment

11 Jan 2023

11 Jan 2023 1 min read

Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk

OpenAI Engineering

OpenAI researchers collaborated with Georgetown University’s Center for Security and Emerging Technology and the Stanford Internet Observatory to investigate how large language models might be misused for disinformation purposes. The collaboration included an October 2021 workshop bringing together 30 disinformation researchers, machine learning experts, and policy analysts, and culminated in a co-authored report building on more than a year of…

safety alignment

24 Aug 2022

24 Aug 2022 1 min read

Our approach to alignment research

OpenAI Engineering

We are improving our AI systems’ ability to learn from human feedback and to assist humans at evaluating AI. Our goal is to build a sufficiently aligned AI system that can help us solve all other alignment problems.

safety alignment

25 Jul 2022

25 Jul 2022

A hazard analysis framework for code synthesis large language models

OpenAI Engineering

safety alignment

13 Jun 2022

13 Jun 2022 1 min read

AI-written critiques help humans notice flaws

OpenAI Engineering

We trained “critique-writing” models to describe flaws in summaries. Human evaluators find flaws in summaries much more often when shown our model’s critiques. Larger models are better at self-critiquing, with scale improving critique-writing more than summary-writing. This shows promise for using AI systems to assist human supervision of AI systems on difficult tasks.

safety alignment

2 Jun 2022

2 Jun 2022 1 min read

Best practices for deploying language models

OpenAI Engineering

Cohere, OpenAI, and AI21 Labs have developed a preliminary set of best practices applicable to any organization developing or deploying large language models.

safety alignment

13 Apr 2022

13 Apr 2022 1 min read

Measuring Goodhart’s law

OpenAI Engineering

Goodhart’s law famously says: “When a measure becomes a target, it ceases to be a good measure.” Although originally from economics, it’s something we have to grapple with at OpenAI when figuring out how to optimize objectives that are difficult or costly to measure.

safety alignment

3 Mar 2022

3 Mar 2022 1 min read

Economic impacts research at OpenAI

OpenAI Engineering

Call for expressions of interest to study the economic impacts of large language models.

safety alignment

3 Mar 2022 1 min read

Lessons learned on language model safety and misuse

OpenAI Engineering

We describe our latest thinking in the hope of helping other AI developers address safety and misuse of deployed models.

safety alignment

27 Jan 2022

27 Jan 2022

Aligning language models to follow instructions

OpenAI Engineering

safety alignment

23 Sept 2021

23 Sept 2021 1 min read

Summarizing books with human feedback

OpenAI Engineering

Scaling human oversight of AI systems for tasks that are difficult to evaluate.

safety alignment

10 Jun 2021

10 Jun 2021 1 min read

Improving language model behavior by training on a curated dataset

OpenAI Engineering

Our latest research finds we can improve language model behavior with respect to specific behavioral values by fine-tuning on a small, curated dataset.

safety alignment

4 Sept 2020

4 Sept 2020 1 min read

Learning to summarize with human feedback

OpenAI Engineering

We’ve applied reinforcement learning from human feedback to train language models that are better at summarization.

safety alignment

21 Nov 2019

21 Nov 2019

Benchmarking safe exploration in deep reinforcement learning

OpenAI Engineering

safety alignment

21 Nov 2019 1 min read

Safety Gym

OpenAI Engineering

We’re releasing Safety Gym, a suite of environments and tools for measuring progress towards reinforcement learning agents that respect safety constraints while training.

safety alignment

19 Sept 2019

19 Sept 2019 1 min read

Fine-tuning GPT-2 from human preferences

OpenAI Engineering

We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we’d only asked them to ensure accuracy), so our models learned to copy. Summarization required…

safety alignment

22 Aug 2019

22 Aug 2019 1 min read

Testing robustness against unforeseen adversaries

OpenAI Engineering

We’ve developed a method to assess whether a neural network classifier can reliably defend against adversarial attacks not seen during training. Our method yields a new metric, UAR (Unforeseen Attack Robustness), which evaluates the robustness of a single model against an unanticipated attack, and highlights the need to measure performance across a more diverse range of unforeseen attacks.

safety alignment

10 Jul 2019

10 Jul 2019 1 min read

Why responsible AI development needs cooperation on safety

OpenAI Engineering

We’ve written a policy research paper identifying four strategies that can be used today to improve the likelihood of long-term industry cooperation on safety norms in AI: communicating risks and benefits, technical collaboration, increased transparency, and incentivizing standards. Our analysis shows that industry cooperation on safety will be instrumental in ensuring that AI systems are safe and beneficial, but competitive…

safety alignment

3 May 2019

3 May 2019

Transfer of adversarial robustness between perturbation types

OpenAI Engineering

safety alignment

6 Mar 2019

6 Mar 2019 1 min read

Introducing Activation Atlases

OpenAI Engineering

We’ve created activation atlases (in collaboration with Google researchers), a new technique for visualizing what interactions between neurons can represent. As AI systems are deployed in increasingly sensitive contexts, having a better understanding of their internal decision-making processes will let us identify weaknesses and investigate failures.

safety alignment

19 Feb 2019

19 Feb 2019 1 min read

AI safety needs social scientists

OpenAI Engineering

We’ve written a paper arguing that long-term AI safety research needs social scientists to ensure AI alignment algorithms succeed when actual humans are involved. Properly aligning advanced AI systems with human values requires resolving many uncertainties related to the psychology of human rationality, emotion, and biases. The aim of this paper is to spark further collaboration between machine learning and…

safety alignment

22 Oct 2018

22 Oct 2018 1 min read

Learning complex goals with iterated amplification

OpenAI Engineering

We’re proposing an AI safety technique called iterated amplification that lets us specify complicated behaviors and goals that are beyond human scale, by demonstrating how to decompose a task into simpler sub-tasks, rather than by providing labeled data or a reward function. Although this idea is in its very early stages and we have only completed experiments on simple toy…

safety alignment

11 Jun 2018

11 Jun 2018 1 min read

Improving language understanding with unsupervised learning

OpenAI Engineering

We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformers and unsupervised pre-training. These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the…

safety alignment

3 May 2018

3 May 2018 1 min read

AI safety via debate

OpenAI Engineering

We’re proposing an AI safety technique which trains agents to debate topics with one another, using a human to judge who wins.

safety alignment

20 Feb 2018

20 Feb 2018 1 min read

Preparing for malicious uses of AI

OpenAI Engineering

We’ve co-authored a paper that forecasts how malicious actors could misuse AI technology, and potential ways we can prevent and mitigate these threats. This paper is the outcome of almost a year of sustained work with our colleagues at the Future of Humanity Institute, the Centre for the Study of Existential Risk, the Center for a New American Security, the…

safety alignment

13 Jun 2017

13 Jun 2017 1 min read

Learning from human preferences

OpenAI Engineering

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by…

safety alignment

24 Feb 2017

24 Feb 2017 1 min read

Attacking machine learning with adversarial examples

OpenAI Engineering

Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines. In this post we’ll show how adversarial examples work across different mediums, and will discuss why securing systems against them can be difficult.

safety alignment

8 Feb 2017

8 Feb 2017

Adversarial attacks on neural network policies

OpenAI Engineering

safety alignment

21 Dec 2016

21 Dec 2016 1 min read

Faulty reward functions in the wild

OpenAI Engineering

Reinforcement learning algorithms can break in surprising, counterintuitive ways. In this post we’ll explore one failure mode, which is where you misspecify your reward function.

safety alignment

18 Oct 2016

18 Oct 2016

Semi-supervised knowledge transfer for deep learning from private training data

OpenAI Engineering

safety alignment

21 Jun 2016

21 Jun 2016 1 min read

Concrete AI safety problems

OpenAI Engineering

We (along with researchers from Berkeley and Stanford) are co-authors on today’s paper led by Google Brain researchers, Concrete Problems in AI Safety. The paper explores many research problems around ensuring that modern machine learning systems operate as intended.

safety alignment

25 May 2016

25 May 2016

Adversarial training methods for semi-supervised text classification

OpenAI Engineering

safety alignment