The last year has been a transformative year for AI. ChatGPT was released November 30th, 2022 and sparked a media firestorm. By February of this year, a little over 2 months after launch, ChatGPT had 100 million active users.
Experts recognise the risk that AI poses to our economy, society, and existence. How should we respond? What solutions have been proposed? How can we each contribute to a safe future? In this article, we’ll explore how researchers and policymakers are working to make AI safe, and how you can help.
Artificial intelligence threatens to overwhelm our economy, transform our society, and even to end our existence.
AI is already more capable than humans at many tasks, such as data processing, pattern recognition, and games like chess, Go, and Starcraft. Recently, we’ve even seen advanced AI start to approach human-level skill at more creative tasks, such as writing and generating images. While AI can’t currently write or create art better than the best humans, it can write and create art just about as well as most humans. Crucially, AI can do all these tasks much faster and cheaper than a person can, and we’re already seeing this affect jobs and economies.
In the near future, we may witness the emergence of AI models that are more capable than humans at most tasks – that are, for most intents and purposes, smarter than us. This could lead to artificial general intelligence, or AGI: AI that can learn, that can develop its own goals and take actions to achieve them as flexibly as humans. Such an AGI could overthrow humanity as the dominant force on Earth, as custodians of our own future. This will be a profound and unprecedented change in humanity’s story – it could even be the end of our story.
So, what should we do about it?
Some argue that the safest approach to AI is to stop developing it. The problem is, so many now recognise the opportunities that AI can afford them that this seems infeasible – like asking Rockefeller to leave the oil in the ground and move on to some other project. Further, convincing amenable organizations to stop developing may just mean that the most powerful AI is developed by the most heedless and reckless developers.
That’s not to say that coordination and efforts to slow development aren’t worthwhile. Like many concerned with the future of AI, I signed the petition calling for a pause to large AI experiments, and I believe most AI labs do recognise and care about the risk of AI. I also believe we shouldn’t dismiss AI’s promise: it will be transformative, and if we develop and deploy it safely, that transformation can be for the good of humanity.
It’s naive to rely on people not simply choosing not to develop dangerous AI. AI development will continue, and risks will arise. Instead, we are morally obligated to figure out how to safely navigate a future that features powerful AI. In this article, I hope to convince you that the problem truly is tractable; that there are promising avenues of technical research, practical policy proposals, and plenty for you to contribute towards ensuring humanity not only survives but flourishes in an age of AI.
Technical AI safety research is focused on how we build, train, deploy, test, and monitor AI safely and ethically. The field is too broad to fully cover here, but we’ll go through some of the most significant areas of study and explore how ethical questions are entwined with technical work. This research may seem esoteric, but the next few decades of AI safety research will be pivotal for the future of humanity, and the more people understand about it, the better equipped we’ll be to advocate for and prioritize the right research. For context, we’ll start with a brief summary of how AI works.
At its broadest, artificial intelligence simply refers to computerized problem-solving, but most people use the term to refer to machine learning, where models algorithmically learn from data rather than being completely pre-programmed. The AI models that’ve been in the news this year – Large Language Models (LLMs) such as OpenAI’s GPT-4, Google’s Bard, and Anthropic’s Claude – are part of a subfield called deep learning, which focuses on a specific type of model called a neural network, inspired by the structure of the human brain. There’s plenty of variation in how each AI learns, but the key is that AI models are trained to spot patterns in data.
Language models like GPT-4 are trained on vast swathes of text from humanity’s collective corpus, from Shakespeare to Wikipedia to Facebook and beyond. The model processes the text and spots patterns, which it then uses to try to predict the next word in a sequence. When it makes a mistake, its internal structure is adjusted, improving its success rate.
This is followed by reinforcement learning from human feedback (RLHF), where engineers and testers interact with the model and give it positive and negative feedback. This is necessary to fine-tune the model’s outward behavior, ensuring that it aligns more closely with human values and produces more contextually appropriate responses. After that, the model is wrapped up in some user-friendly interface and deployed into the world (for example, the ChatGPT interface is built around the core GPT-4 model).
There are opportunities to improve safety throughout the development, training, and deployment phases: some focus on the underlying structure of the model, some on the training period and how the AI learns, some on the RLHF, and so on. Below, we’ll summarize three of the most popular avenues of research, but again, this isn’t exhaustive.
The goal of interpretability research is to give us tools and techniques to better understand why AIs do what they do.
Many powerful AIs are called black boxes, as their inner machinations are beyond human comprehension due to their sheer complexity. As with the human brain, we can identify individual neurons and when they activate, we can recognise large-scale structures and theorize how they might interact, but we generally can’t answer questions like “why did it make that decision?”.
This can cause problems. For example, people often have a right to an explanation. If your insurer denies you coverage, they may be legally obliged to explain why. As well as a moral right, this ensures the insurer isn’t discriminating against you based on protected characteristics such as race, gender, age, religion, disability, and so on. These characteristics may, sometimes, be correlated with outcomes of interest, but as a society, we’ve deemed it immoral and illegal to discriminate based on them. It’s hard to ensure an AI is being fair when we don’t know how it makes decisions.
Interpretability is a diverse field. For example, one branch focuses on explainable AI, where models are designed to provide better insight into their decision-making process, increasing transparency and accountability. Another branch focuses on mechanistic interpretability, where researchers dig into the nuts and bolts of a specific model and figure out how it works, neuron by neuron.
However, most interpretability research isn’t focused on safety specifically. Insights into how models work are useful for building more powerful models, and so companies that develop AI often invest in interpretability research to improve their products.
The goal of AI alignment is to develop AI that reflects humanity’s values.
AIs learn to pursue goals. During training, their programming is adjusted to fit a reward function. For example, GPT-4 was trained to produce coherent text by “rewarding” it when it correctly predicted the next word in a sequence from the training data. AI models adapt to maximize these rewards, defining the goals they’ll pursue in deployment.
Problems arise when these goals don’t align with human goals. For example, GPT-4 was trained to predict the next word, not to create truthful, correct sentences. Even if the corpus it was trained on contained no false information, the model operates probabilistically. It can construct new, false sentences from patterns derived from old, correct sentences. This is a simple example of misalignment: users might want to write a great essay, but GPT-4 only wants to predict the most likely next word.
If we develop a superintelligence or an AGI, we need its goals to align with ours. A classic (if extreme) example of the alternative is the paperclip maximizer: a superintelligence whose goal is simply to build paperclips. A paperclip maximizer would quickly recognize that life will only be an inconvenience for its plans, and so we may live just long enough to see it wrest control of the internet from us and 3D-print its first drones-that convert-all-matter-to-paperclips factories.
In reality, misalignment can be much subtler than this, though even subtle misalignment could be catastrophic with a superintelligence. Most goals are easier to achieve the more power and control the model can gain, leading to emergent power-seeking behaviour that could harm humanity. Researchers try to solve this problem by improving our ability to instill complex values into the AI during building and training, while avoiding power-seeking behaviour or other issues like AI that deceives users in order to achieve its “true” goals.
The goal of threat modeling is to predict and understand the future of AI.
One focus is on predicting the uses and misuses of particular models, which is difficult due to the flexibility of powerful models. LLMs may only predict the next word in a sequence, but if you prompt one with “The ingredients for making dynamite at home are…”, the model can become a recipe book for wannabe terrorists. Through RLHF, GPT and other LLMs have safeguards to prevent this kind of malicious usage, but these aren’t as robust as we’d like: it’s still possible to manipulate them into providing dangerous information. As AI grows more powerful, it’ll become more and more important to avoid these unintended uses and behaviours, and so we need to get better at predicting what models are capable of.
One approach to this is developing evaluations or “evals”; standardized tests that measure various aspects of models’ behaviour. Some evals focus on measuring the AI’s capabilities – how powerful and flexible they are. Others focus on measuring how well a model is aligned with its users. Some involve red-teaming, a popular technique in cybersecurity and airport security, in which people deliberately try to find and exploit vulnerabilities, prompting dangerous or otherwise unforeseen responses to expose how users could misuse the AI. Evals are a new but promising technique, and may be crucial in governance, as we’ll discuss below.
Another aspect of threat modeling is longer term forecasting or scenario modeling, in which researchers predict the future of AI development. How much more powerful will AI be in ten years? When could we see the first AGI? What areas of safety research will yield the most practical results in the short term? Of course, predicting even the near future of AI is difficult – just think about the dramatic changes in generative AI we’ve witnessed over the last 3 years – but this long-term forecasting is essential for us to determine what research and governance will be effective.
Technical research will be necessary for a safe future with AI, but it won’t be sufficient on its own. If we figure out how to build perfectly interpretable, aligned, eval-passing AI, how do we ensure everyone builds their AI with that technique? How do we stop people deliberately misusing AI, or building their own malicious AI? What happens if catastrophically dangerous AI arises before we’ve resolved the technical problems?
To actually make AI safe, we need a holistic approach that combines technical research, ethics, and governance. We need AI developers and corporations to be incentivized towards safety, we need mechanisms to stop and prevent misuse, we need to be able to hold people liable for AI harm they introduce, and to achieve any of that, we need politicians to recognise the danger of AI and the necessity and political value of reducing that danger.
In this section, we’ll go through some of the governance and policy proposals from AI safety experts, and explore how they intertwine with technical research and ethics. We won’t go into detail on the institutions and organizations that will need to enact these interventions, but we can recommend Matthijs Maas and José Jaime Villalobos’s International AI Institutions: A Literature Review of Models, Examples, and Proposals.
The goal here is to monitor, regulate, and license the resources necessary to build and run powerful AI.
If dangerous or uncontrollable AI emerges, it’ll likely emerge from the largest and most powerful experiments, which will be those with the most computing power, or “compute”, and trained on the most data. Powerful models require staggering amounts of both. For example, GPT-3, which was released in 2020 and was the basis of the first iteration of ChatGPT, is estimated to have cost around $4.6 million US dollars to train. Its training data was half a trillion words, and, if run on a single GPU rather than many in parallel, would’ve taken 355 years to train.
So, if we can regulate access to these resources, we can regulate powerful AI. A useful analogy here is nuclear weapons, and the global effort to monitor and regulate the production and trading of weapons-grade nuclear material.
There are subtleties here. AI is so widely used that immense resources are poured into R&D. This means that models grow exponentially larger over time, but also exponentially more efficient. The data and compute required to train a certain model to a certain standard goes down: they can do more with less, over time. For a practical example, the computing power required to train a neural network to a certain standard performance is estimated to halve every 9 months. This means that any constraints we place on data and computing power must become stricter as efficiency increases.
Enforcing limits on computing power may require hardware mandates, which we’ll discuss below, but it may be sufficient just to facilitate inspection and compliance, similar to how the US and USSR’s agreement to disarm nuclear weapons involved mutual inspection of nuclear sites, or other military “trust but verify” monitoring systems.
The goal of hardware mandates is to control the physical infrastructure that AI is built on.
A complementary target for legislation is the hardware underpinning AI, the physical silicon chips like CPUs and GPUs. These could be designed with built-in “circuit breakers” to halt operation if they detect risky behavior, or identify that they’re in an illegally powerful neural network. Chips could also support real-time monitoring to flag anomalies to regulatory bodies and transparency features for auditing the AI’s data processing or decision-making process.
Chip manufacturers are unlikely to adopt these changes voluntarily as they will increase cost, and legal mandates may only be effective if they are global, meaning truly effective hardware mandates will require international cooperation. Nonetheless, this is a promising avenue for governance, and such global hardware control does have precedents: GPS receivers must be built to have limited functionality above an altitude of 18km or a speed of 1,900km/h, so that GPS can’t be used in intercontinental ballistic missiles. By incorporating such safeguards into the AI’s physical infrastructure, we add a robust and immediate mechanism to manage potential risks.
The goal here is to ensure AI developers can be held legally liable for the harms caused by AI, and are thus incentivized to maintain conscientious safety.
People and companies can be held liable when they sell dangerous products, which incentivizes companies to diligently research and mitigate the side effects and misuses of the products they sell. This is more difficult for AI due to its flexibility. For instance, AI can learn from new data and can be exposed to situations it was never trained for. GPT started as a language model, but people have already started fiddling with it, using plug-ins and other pieces of software to connect it directly to other computer tools, for example allowing GPT to autonomously purchase airline tickets online. If someone uses those tools to commit a crime, who is responsible? Who is legally accountable?
These questions are complex and though there are proposals and discussions addressing them, they haven’t yet been tested in court. They are crucial, though. If our legal system doesn’t hold AI developers accountable, there’ll be fewer incentives to ensure their product isn’t prone to misuse.
The goal of mandatory evaluations is to rigorously test AI systems before they’re deployed.
As we explored in the Technical Research section, researchers are developing “evals” to measure the alignment and capabilities of AI. If these are further developed and standardized, they could be used as mandatory tests before models can legally be released, providing a simple safety net against dangerous models. Such mandatory evaluations would need to be continually refined since outdated or easily manipulable tests could lead to false confidence in dangerous models.
Regulating dangerous information
The goal here is to restrict the spread of hazardous information.
Most fields benefit from open communication and sharing information, and programmers and researchers often share work open source, meaning anyone can freely access and use their code however they like. In general, this encourages innovation and community building, but we’ve reached the point where information about building AI can be hazardous.
We can teach people precisely how Oppenheimer and his colleagues built an atomic bomb because people don’t have the opportunity to weaponize that information – they can’t build one at home. AI is different, though. Individual people and small research groups can download and run code, instantiating powerful AIs on their home computers and adapting them however they like. We’ve already seen individual people hack together tools based on GPT-4 for purposes its creators didn’t intend. It will soon be reckless to allow these tools to be freely available to anyone. We need to develop systems to prevent and punish the sharing of hazardous information before then.
The Future of Life Institute goes further, arguing that we should mandate much stricter cybersecurity for AI developers, citing the danger introduced by recent cybersecurity breaches, such as when hackers leaked Meta’s LLaMa model to the internet. For example, one specific proposal from the FLI is that we mandate “watermarking” of AI models, enabling the identification and prosecution of people leaking hazardous info.
What you can do
We’ve looked at the technical research and governance necessary for a safe future with AI. Next, we’ll look at how you as an individual can contribute.
AI is already a polarizing issue, and is likely to become more so as global angst rises and incentives clash harder. Predictions about AI vary wildly, and even those who agree on the broad strokes get into heated debates about different scenarios and timelines. All this makes it hard to get reliable information and, therefore, hard to know how to respond to the threat of AI.
Fortunately, there are still reliable sources of information. For a list of great sources of information, and more in-depth discussion of information in the modern age, we recommend Dr. Christopher DiCarlo’s article, Empowering Yourself Against Misinformation, Disinformation, and Conspiracy Theories, in the previous issue of Humanist Perspectives.
You can also participate in discussions about technical AI safety and policy on rationalist and AI discussion forums such as LessWrong, the Effective Altruism Forum, and the AI Alignment Forum, where many AI researchers share their research and debate the issues of AI safety.
Voice your concerns.
On social media, to politicians, to industry leaders, we need to communicate that the danger that AI poses is real, acute, urgent, and tractable. AI is increasingly recognized as a danger, and organizations and governments are increasingly promoting and advocating for AI safety research and governance. This is a critical opportunity and we need to use this rising pressure to establish safe, practical governance.
We encourage you to write to politicians and industry leaders, to voice your concerns, and to vote and advocate for politicians that recognise the critical need for governance and funding for AI safety research. In 2020, billions of dollars were spent on advancing AI capabilities, but only $50 million was spent on reducing catastrophic risks from AI. We need governments to increase funding for this vital research.
We also encourage you to vote with your wallet. You can support AI safety organizations directly by donating to AI-focused non-profits such as the Centre for AI Safety, the Future of Life Institute, and the Machine Intelligence Research Institute. Further, we should boycott companies that develop AI recklessly. We’ve seen excellent successes from this kind of public action just this year, with author John Green’s successful campaigns to persuade Johnson & Johnson to allow generic versions of their tuberculosis drug, and Cepheid & Danaher to lower the cost of their tuberculosis detection cartridges in lower and middle income countries.
Work directly on AI safety.
Given the dangers we face, how many people do you think should be working on technical AI safety? Currently, there are probably fewer than 1,000 people working full-time on AI safety – around three quarters of whom work on technical research, the rest split between strategy and governance. Do you think that’s enough?
We don’t think so. There’s plenty of opportunities for more people to have huge impact by focusing on AI safety. If you agree, you could consider joining the field yourself. It might seem a daunting consideration, especially if you, like me, lack a technical background in programming, computing, or politics. However, there are probably more opportunities than you expect. AI safety is increasingly recognised as vital for our future and so it’s receiving increasing investment from charities, companies, and governments, generating more opportunities to work directly on the problem.
The charity 80,000 Hours promote the idea that people’s working lives can be their greatest positive impact, and they especially encourage and provide support for working to reduce “problems that threaten human extinction or could make the long-run future worse”. On the topic of AI, they conclude “The biggest way you could help would be to pursue a career in either [technical AI safety research or AI strategy/policy research and implementation], or in a supporting area.” If you’re curious whether there are roles suitable for you, check out 80,000 Hours excellent career guide, or their AI-specific Preventing an AI-related catastrophe: What can you do concretely to help?.
We face a unique situation. Artificial intelligence has an unprecedented potential to help us flourish, as well as to end life on Earth. How we navigate the next few decades will profoundly change the story of humanity. As Toby Ord says in The Precipice:
If all goes well, human history is just beginning. Humanity is about two hundred thousand years old. But the Earth will remain habitable for hundreds of millions more—enough time for millions of future generations; enough to end disease, poverty and injustice forever; enough to create heights of flourishing unimaginable today.
It is our responsibility to work towards the future where AI is a powerful tool for good, where humanity flourishes. We know what work is necessary: we need to enact governance and foster international cooperation, we need to promote and fund technical research, and we need to resolve the ethical quandaries that are baked into these fields. To do any of that, we need politicians, investors, researchers, and the public to recognise how important this work is. We can all contribute to this, from the ballot box, from your email account, or through your working life. We hope this article has given you a basic understanding of how people are working to ensure a safe future with AI, so that you can make an informed decision, however you choose to contribute.