EA Talks✓ Claimed Priorities in AGI governance research | Jade Leung | EA Global: SF 22

Tags
podcast
Date
Aug 29, 2023 2:03 PM
URL

Welcome. Thanks for coming. Today Jade Long is going to talk about priorities in AGI governance research. Jade is the governance lead at OpenAI. She previously helped to co-found and lead the Center for the Governance of AI when it was at the Future of Humanity Institute at the University of Oxford. Keep in mind while she's talking that you can submit questions on Swapcard and I will see them here. Cool. Please welcome Jake. [APPLAUSE] Imagine you're the coach of a football team. And your team is going to play the championship game. But for some weird reasons in this world, mostly because I need it for this metaphor to work, you can't practice before the game. You don't know when the game is going to start. You don't know what team you're playing. going to do to ensure that your team has the best chance of winning? You might spend some time looking at the history of football. Look past games up, look at other teams, write up little historical case studies. You might try to recruit better players. Improve your players by sending them to the gym, have them run laps, build up their career capital by sending them to football school. You might get some football pundits to help you forecast how they think the game's going to go, depending on what players you put on, how soon the game's going to start, what if the ball suddenly starts moving at 10 times the speed, all these kinds of classic football scenarios. But really, what you would be spending a large chunk of your time on is coming up with an actual plan for what your team should do when they're on the field playing the actual game. You'd be coming up with strategies for how you think they would best win given the game that you expect to pan out. And you'd have a bunch of other strategies that the game goes in different directions, depending on what actually happens on the day. You'd have written all of this out in a lot of detail, probably on a Google Doc, and then you've shared it with all your expert football coach friends, and they'd have red teamed it and improved it with you, and then you'd have sat your players down and said, "Here's our plan. "Mr. Fast Man that runs along the sides, "here's what you're supposed to do. Mr. Quarterback person dude, here's what you're supposed to do, Mr. Big Center charging in, here's what you're supposed to do. I'm not going to talk about football, thankfully for you because I know absolutely nothing about football. This is all an extended metaphor for the field of AGI governance research and an extended way of making microclaim for this talk, which is that I think a very important and surprisingly very neglected goal of the field is to come up with actual plans and execute on them. As a field, we've made a whole lot of progress in the last couple of years, no doubt. But I'm worried that it's not on track to be enough. I'm worried that it's not good enough to make better arguments about what the problems are. I'm worried that it's not good enough to recruit better people into the field. I'm worried that it's not good enough to just spread good memes. I strongly suspect we need actual plans. We need to stress test them. We need to argue about them. And then ultimately, we need to act on them. So what I'm going to cover today are basically two things. I'm going to say a bit more about what I mean by a plan, and then I'll say a little bit more about the kind of features of what I would call a good plan that this field should focus on generating. A couple of upfront notes and caveats. This talk is very meta. It's about prioritization and methodology. You'll be very disappointed that I will not actually be talking about object level plans, and you can heckle me for that if you would like. Secondly, this is targeted at folks who are in or interested in AGI governance research. So folks who are on the practitioner side or in the field building side, everything I say will apply much less or not at all to you. And then thirdly, everything I say is my opinion only, not necessarily endorsed by organizations I'm affiliated with. And finally, I did consider whether I was above doing the cheap AI talk thing of using Dali to keep you engaged and realize I'm definitely not above it. So everything that you see in the slides is accredited to Dali. All right. What do I mean by a plan? In short, what I mean is a theory of victory for humanity come AGI. How do we collectively ensure that the long-term future goes well, conditioning on assuming that it's possible to develop powerful, generally capable AI systems? What I mean in more extended form is something closer to a governance playbook rather than a plan. The word plan sounds very rigid, blueprint-like, like a big grandmaster plan kind of vibe. That's not really what I mean. What I mean is a playbook in the sense of having a portfolio of best guess strategies that optimize for success given different assumptions. The portfolio should ideally cover the space of scenarios that we think the world could probably go down. And among the portfolio, we should collectively we have confidence that no matter how the word goes down, we have at least a running best guess of what actions we should take to optimize for success. I'll say a lot more about this kind of playbook portfolio idea in a bit, but I want to start off by just kind of unpacking a couple of terms that I'm going to use a whole lot. So first, thinking about powerful, generally capable AI systems. By this I mean systems that can perform at human level or above across a wide range of cognitive tasks. I am particularly interested in systems that can outperform the best humans on tasks that require a high amount of cognitive skill and problem solving and grant power in today's world. So things like scientific research, business strategy, military and political strategy. Governance playbook, I'll say only a tiny bit about 'cause I'll get to it much more, but think about it as a portfolio of strategies that optimize for a desired outcome. Governance in my head, I mostly think about it adds about shifting and setting up incentive structures for actions to be taken to achieve a desired outcome. So they can be very explicit incentive structures, so things like contracts, legislation, agreements, treaties, and they can also be much less explicit. So things like norms, market pressure, and demand. What I don't mean to include in governance playbook is a technical alignment plan. The scientific work of figuring out how we align eventually arbitrarily powerful AI systems. That's for our good friends in the technical AI alignment community to solve, she says, with a slight hint of desperation. But the governor's playbook should be responsive to different outcomes in the technical alignment space, depending on how long it takes, how hard it is. And it should also address the fact that we need incentives for those technical alignment solutions to be taken up once they are actually discovered. Finally, I want to say a bit more about what I mean by ensuring that things go well. To a first order approximation, what I mean when I say that throughout this talk is ensuring that the first few deployments of powerful, generally capable AI systems are aligned, are safe, and that we reach a state of stabilization. I want to acknowledge upfront that I take a very existential risk-focused perspective in this talk and in my work. So when I say ensuring things go well, I basically mean avoiding these kinds of high stakes risks. If you personally place low credence on X-risk from advanced AI, then what ensuring things go well could very well mean a very different thing. So that said, when I say reach a state of stabilization, what I mean is we get to a world where existential risks are very low from advanced AI. We have no major power transitions on the horizon between human and AI actors. And we can kind of collectively turn our attention to ensuring the future goes well, ideally with aligned AI capabilities on hand. You might ask why focus on the first few deployments rather than deployments forever in time. And the core argument here is something like the first couple of deployments are unusually high stakes and usually high leverage events. If you think about us navigating that well, we'll get to a point where ideally we've reached the state of stabilization that I've described. And you can imagine that we're a much more technologically advanced civilization on the other side. And we can control deployments much better and much more effectively. If you imagine that we don't navigate this period well, then we'll be locked into a world with misaligned and/or unsafe AGI. And that makes things very hard to correct after that. So ensuring things go well by thinking about the first couple of deployments, that essentially, I think, does a lot of the heavy lifting of ensuring that the trajectory thereafter goes well, although it doesn't do all the lifting that's required. I want to insert a really big, meaty kind of academic caveat here in that I think there are plausibly just a lot of issues with the way that I've defined, ensuring that things go well. I think it's a useful simplification to think about it at the first couple of deployments. But I'm very aware that there's something suspicious about separating stabilization from literally everything that happens afterwards. And there's probably a lot of interdependencies between them and a lot of fuzziness between how these phases are cut up. I think there's also a likelihood that deployment events won't be very discreet or very observable and it just won't be clear what AI systems we actually are concerned about at the time. So that's all a long way of saying, I'm very aware that this is an oversimplification, I'm in the market for better ones, but I will use it for this talk in any case. All right, so you've said a little bit more about what a plan means. Now I want to turn my attention to covering some features of plans that I think I'm most interested in seeing. I'm going to cover three features. On target, by this I mean plans that actually take a big chunk out of the problem of existential risk from advanced AI. Playbook-like, so revisiting this term, basically thinking about it as a portfolio of strategies that covers a scenario space. And then finally, credible in the sense that they pass some basic reality checks. So I'll spend a bit of time on each one of these. On target, the bottom line up front here is that a thing that I think AGI Governance researchers should do for every single governance intervention that they're thinking of is ask themselves a question, what is the link between this intervention and a reduction in existential risk from advanced AI? One way of thinking about this is one of the main problems we want to solve is existential risk from advanced AI. And you can split it up into a couple of different threat models that lead to this expected x-risk. And then using those threat models, we should always press ourselves to ask the question, how am I addressing a particular chunk of this problem very head on? To make this a little bit more concrete, for me, the two threat models that loomed the largest are misaligned power seeking systems accidentally being deployed. I'll call this the goal of maps prevention. And then existential misuse prevention in the sense of misuse by actors that cause existential harm. So the most salient ones are probably things like widespread, robust italitarianism, or things like destabilizing AI weaponization. These images are absolutely not informative, don't pay too much attention to them. Dali was a bit of a trip. So those are the two problems that loomed the largest for me. But there could be a whole bunch of other problems that loom large for you, depending on how you think stuff's going to go down. And again, if you're not that concerned about existential risk from advanced AI, good for you. And ensuring things go well or the big problems could be something like, how does humanity construct a deliberative process to address X big question in the long-term future. So there could be a bunch of different things. The key point here is that there is a tax that we need to put ourselves to, which is breaking down this problem into a couple of big chunks and making sure that we're running just pretty head-on with those as the key prize that we're aiming for. This is all a little bit abstract, so I wanna spend a bit of time pointing out ways in which I think things can be not on target in the sense that we can run at things a little bit sideways, we kinda nibble away at the edges of things without really realizing it. So one example of this is using instrumental or proxy goals. Really common ones in AGI governance research are proxy goals like preventing racing, stopping AGI development altogether, stopping certain actors from getting access to AGI. And there are very good stories about why you would think that this correlates with a reduction in existential risk. So I think there's very good stories to unpack there. The real nudge here is just to make those stories explicit so that we're making sure that we're running at the right problems for the right reason. So to steal man preventing racing, usually the story there is there is a certain type or certain types of race dynamics, which can lead to actors counterfactually investing less in alignment and safety compared to what they would have and that leads to an increased likelihood of map systems accidentally being deployed. That's much more long-winded way of saying preventing racing but that's essentially mostly what's going on And using the focus of focusing on X-Risk has us focus on the right race dynamics to permit. Another example of this is usually thinking about stopping actors from getting access to AGI. The story there is there are certain types of actors in the world that are more likely to misuse AI in the way that we're concerned about or more likely to accidentally develop map systems. And so there's a set of hypotheses that you can unpack there and you can test and making sure that those actually pass muster before running at this goal is basically the key ask. Another way in which I think it's possible to run a little bit sideways or nibble at the edges of things rather than being on target is thinking about broad interventions. So when I think about broad interventions, usually they're pretty good ideas because they just feel like very sensible things to do. And I think that's absolutely the case. But I think it's also the case that you should apply the lens of thinking about how does this broad intervention plausibly lead to a reduction in X-risk to focus that broad intervention. Things like communicating and awareness raising about AGI risks. Usually the story there is only certain types of plans go through, you convince certain types of actors of certain things. If you actually had a plan in mind that reduces X-risk, then you can just be a lot more targeted about who you're trying to convince of what. Another example is raising the waterline or responsible behavior of AI developers. Again, there's a story that passes muster here, which is that generally it's probably a good thing that AI developers are trustworthy, cautious, secure. But if you really think about it from an X-risk perspective, There are some behaviors in labs that are much more important to engender than others. So for example, if you're concerned about misuse, you'd probably be much more concerned about how a lab engages with governments than how they monitor individual cases of misuse. The key point here, just to sum up the on target point, is that I would love us to shift from a mode of generally do sensible things to asking ourselves a question, how does this take a big chunk out of the problem of existential risk from advanced AI? and making sure that we have a red line between the work that we're doing and a reduction in existential risk. All right. Second feature, Playbook-like. Dali can't spell, FYI. Back when I was a football coach, I've never touched a football in my life, I would have a Playbook. And the Playbook would have a mainline play. So the mainline play is, you know, given my best guess about how I think the game's gonna go, What's the strategy that optimizes for success there? And then I would basically never expect the game to go exactly as I expected. And so I would also have alternate plays for the ways that games could otherwise go. And so when I'm actually playing the game, I would see what game is actually unfolding. I flip through my playbook and pick what I think the optimal play actually is to ask my players to proceed with. So in this analogy, AGI governance researchers are the playbook writers. And we have essentially the same problem that football coaches have, which is that we are planning for a game that hasn't yet happened, and we are planning for a game that could go in so many different directions. In our case, it's even worse because we're playing a very weird game that we've never played before. It's an incredibly hard game, and if we don't win the game, then let's just not talk about what happens if you don't win the game. So what do we do in the face of all of that uncertainty? My first preferred angle of attack to this is basically to draw bounds around the space of worlds that we care about having plans for, and only optimize for success within that bounded space of worlds. So to illustrate, I personally care the most about having plans for worlds where things are not by default going to go well, because if they're by default gonna go well, then things are probably fine. And you can just kind of tinker away the edges. I also care about having plans for worlds where it's modally realistic in the sense that it's likely to be the way that we think actually reality is going to go, given information that we have. So these things combined make me personally work on planning for worlds that meet these constraints. Assuming we get AGI by 2050. Reasons are it's much more difficult to predict what's going to happen much further out than that. It's difficult to have leverage much further out than that. And also timeline forecasting, unfortunately, suggests that you should probably a fairly large probability mass on sub-2050 timelines. Assuming that we get AGI by the deep learning paradigm, i.e. there doesn't need to be some alternative paradigm shift, and so there probably isn't an impending game board shift as it were, as to which actors are more likely to develop AGI. Assuming that AGI will be very resource intensive, particularly in the sense of requiring a lot of compute, this assumption leads to a couple of interesting downs to your assumptions. So things like assuming that only very few actors have the willingness to pay and the level of resourcing to actually pursue AGI projects. And so you're probably talking about worlds where there's something more in the span of like two to 10 actors rather than dozens and dozens of actors that can develop AGI. And then finally, as a corollary of the above, assuming that AGI is probably gonna be developed by a big company or a state or some hybrid in between. So you could have very different assumptions to everything I've laid out here. The key point is that these are the bounds that you could imagine drawing around a space of worlds that you care about having plans for. And that actually means that it's a lot easier to make plans 'cause you're not dealing with epic amounts of uncertainty. So now you're like, thanks but no thanks really 'cause there's still like a million different parameters that can combine within the space of worlds. And so I don't actually really know what game I'm supposed to be planning for in any case. And so at this point, my angle of attack which feels a lot more unstable I wanna say And it's kind of a method that me and my team are trying, so we can report back in a year if we fail. But the current main angle of attack that I'm a fan of is thinking about first generating a mainline play, as I've said. So the mainline play is basically what is the most robust way that you can optimize for success across the space of worlds without conditioning on anything more specific than these constraints. So assuming you don't have any more prescience than this, basically. And then you look at the space of worlds. You're like, what worlds have I not covered? What worlds does my mainline play basically just totally fail in? In which case, I need alternate plays. I just need a different plan. And so you can imagine your playbook basically has this mainline play and then a couple of alternate plays to cover the rest of the space. So roughly, generating them might look something like this. And this is a caricatured step-by-step process, but the world is a lot more messy than this. You'd start off by picking, again, on the theme of being on target, picking a particular problem that contributes to existential risk from advanced AI that you care the most about running at here. So for this example, I'm gonna use maps prevention where basically the goal is ensuring that the first few deployments of AI are not AI, powerful, generally capable AI systems in particular, are not misaligned in power seeking models. So with that goal in mind, you then ask the question of what's the most robust way to achieve that goal if I cannot condition on anything more specific than AGI by 2050, deep learning, et cetera, et cetera. So when I think through this example, I naturally start to generate basically a list of the kinds of things that I think my main line play needs to achieve. So you would get a list that looks something like, the main line play needs to apply to the actors that can afford resource intensive AGI, so probably in this realm of two to 10, they can be both state and non-state actors, so you have to cover that whole space. But you basically don't need to cover that much more than that set of actors because of the assumptions that we've made. You might also get to something like, well, if we're talking about maps prevention as the goal, we ultimately need a way of defining and evaluating whether a system is misaligned or power seeking and not. And so we need a way of saying, this is the thing to avoid doing and clarify that in particular terms. Once we've got a way of defining and evaluating what it is that we don't want developers to do, we can translate that into requirements of developers. And then finally, the kind of component that you need there is monitoring whether, in fact, Developers are doing the thing that you don't want them to do, and if so, penalizing that. So this is all very abstract and high level, but those are kind of like the cogs of the mainline play that you probably need in most worlds. When I think about this as an example, I'm personally pulled towards investigating things that look like international treaties and regulatory regimes. And the main challenge is how do you garner the effective participation of essentially the US and China for the most cases that you're worried about. There probably are components of compute governance in the sense that if we're making this resource in terms of assumption, then compute is one of the main inputs. And it probably requires some components of independent expert evaluation of maps risk in systems that are being trained and deployed. So this is all kind of for illustration. But you could imagine ultimately landing on something like a mainline play, which is some international treaty regime. And then basically, you want to move on to the step of trying to break it a lot. And by breaking it, you're basically looking for improvements to make it more robust, or you're looking for worlds where your mainline play just totally sucks, and you need a different alternate play. So questions you might ask at this point would be something like, what are the riskiest assumptions that constitute my mainline play that just seem unlikely to work? Or what could happen in the world that could totally undermine the mainline play actually working? So for example, I would quickly expect to find a need for something like nonproliferation as a complementary element of the mainline play. Because if you can severely restrict the number of actors that can develop AGI, that makes your coordination problem a lot easier. I would also probably quickly expect to find a need for some kind of plan around raising awareness about maps risks among the key decision-makers, because maps is just very far outside the orphaned window for basically anyone at the moment. And so you need to do some targeted work there. Examples of worlds where you might find the need for an alternate play would be, for example, in very short timeline worlds. So say if we're in a world where it's AGI by three to five years from now, honestly, God help us. And in those worlds, you just don't have time to negotiate complicated international treaties. And so you probably need an alternate play that relies a lot more on decisive action that can move quickly. And the final step, I say step 300 in the sense that there's obviously a lot of work to do to investigate red team improve things. But imagine you get to the point where you have a robust mainline play and you have alternate plays, and they're all anchored on the kinds of worlds that you think that they'll work for. Then we turn to doing a couple of things simultaneously. The first is thinking about what the trade-offs are between your mainline and your alternate plays. Because often there are trade-offs, and you're going to start moving towards implementing the mainline play, then you want to know what option value you're taking off the table. So for example, if we're riffing again on this international treaty main line play, then plausibly it comes with pro-social actors making very credible commitments to slow down in order to get other people to the table, and that would trade off pretty directly on alternate plays which rely on pro-social actors having a long lead. So being aware of those trade offs is very important. And then we would ideally start laying the groundwork for the main line play, and this requires working with many, many, many more actors than the research community that we're we're talking about and this community more writ large. And unlike a football game, ideally, the thing that we're doing is we're doing as much work in advance of the game as we possibly can. And then finally, we want to consistently be doing decision-relevant forecasting, where the decision that is most relevant here is telling us whether we are trending towards worlds that need alternate plays. So for implementing the mainline play, what is the signals that we could have that would tell us that we're supposed to be pivoting to other things. All right, so that's basically the kind of summary, I think, of what I mean by playbook-like in this instance. The bottom line that I do wanna emphasize before moving on is that I don't wanna overemphasize what I think playbooking will do for us. I think you can think about playbooking as front-loading the cognitive effort of thinking about what to do under different scenarios and trying to do as much work as we can to prepare, But ultimately, as with a football game, players always have to respond to on-the-field conditions. They're always improvising. The coach always has to be adjusting to what's working and what's not. And we always need to maintain strategic awareness and adaptability and flexibility. So in the same way, in AGI governance, I don't expect things to basically go exactly as we planned. And I expect it to be very important to have good people in good positions, for example. But I do sure as heck think that we should be trying our best to foresee and plan for our best game, plan for known unknowns, and lay the groundwork with appropriate caution for sure, but appropriate urgency. And that's ultimately what I think the work is that we're calling to do here, but not necessarily gonna solve the whole problem, absolutely. All right, last feature I wanna cover is credible. And by credible, I mean something like passing some basic reality checks. And by this, I mean something like, you know, not having plans that rely on unrealistic assumptions, so not assuming that there'll be sudden worldwide convergence on the importance of existential risk, not assuming that there's perfect operational security that labs can achieve. Also doing things like, assuming that actors will take actions that are very far outside their overton window, I think we should be very suspicious of. So when I see plans that involve actors egregiously violating the law, or risking their lives, or risking the lives of others, I view those with suspicion by default. I think other things that are important about credibility are something like reflecting and responding to things that are important to other actors in the world that don't necessarily resonate all that deeply with people like me and you. So things like history, culture, injustice, reputation, honor, these things matter to a lot of people and plans should appropriately reflect that hat. And then finally, when I say credible, I mean they are just not half-baked. They should be detailed, well thought through. Experts who are looking at your playbook should not laugh at it. If Deng Xiaoping and Kissinger looked at your playbook, they wouldn't laugh at it. And it just basically just looks like a thing that has been well done and well thought through as a kind of basic point. All right, to wrap up, I'll admit I don't really care for football all that much. And I don't really know what game we're playing there and why we're all topless and overweight. But I do really, really care about this particular championship game. And the thing that I want AGI Governance researchers to do is get to work vigorously investigating bids for plays to put in our playbook. And these bids should be concrete. They should be things that we can argue about and stress test and trade off against each other. They should have clear eyed paths to impact. We know how they take a big chunk out of existential risk from advanced AI. They should pass some basic reality checks. And we should have several of them to cover the space of worlds that we care about. Let's get to work. Thanks. [APPLAUSE] [ Applause ]

Notes