How Can AI Labs Incorporate Risks From AI Accelerating AI Progress Into Their Responsible Scaling Policies?

Released on 24th March 2025
Tom Davidson
This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.

Summary

I describe a threat model for AI accelerating AI progress, a specific dangerous capability threshold, early warning signs for detecting that threshold, and the protective measures needed to continue development safely past that threshold. Labs should start measuring for the warning signs today. If they observe them, they should pause AI development unless they have implemented the protective measures.
Threat model
AI that can significantly accelerate AI progress poses multiple severe risks. It could destabilize critical efforts to proceed cautiously, rapidly lead to superintelligent AI that poses catastrophic risks, or allow an actor or gain illegitimate amounts of power.
My specific dangerous capability threshold is AI that could accelerate the lab’s pace of software1 progress by 3X. So if a lab’s training compute requirements previously halved every 12 months, any AI that causes them to halve every 4 months should be viewed as very dangerous.
Early warning signs
Labs can get early warning signs of this dangerous capability by observing either of the following:
  • AI has already doubled the pace of software progress: the lab has made 3 years worth of software progress in 18 months, and this is plausibly due to AI.
    • If progress accelerates very suddenly, there will be little time between 2X and 3X acceleration and this warning sign won’t be effective. The next warning sign is designed to detect scenarios like this.
  • AI can autonomously perform many AI R&D tasks, enough that it might soon enable 3X faster software progress.
Labs should start measuring for both of these warning signs today. If they detect either warning sign, they should pause development unless they have adopted the protective measures.
Protective measures
I think this is among the most dangerous capabilities an AI can possess, and the protective measures needed to handle it safely are likely to be very significant on many dimensions. Here’s a really brief sketch of some of the key protective measures, but this needs more work:
  • Info security that prevents a state actor from stealing the weights and accelerating their own AI development.
  • Alignment, boxing, and internal monitoring that are good enough to ensure that misaligned AI doesn’t “poison” the AI development process**.**
  • External oversight: a publicly legible mechanism to ensure a responsible 3rd party signs off on high stakes decisions, including any decision to rapidly improve AI capabilities. This improves decision making and provides labs with mutual assurance that others labs are behaving responsibly.
  • AI development speed limits: a Responsible Scaling Policy that explicitly evaluates how fast the lab could advance its overall AI capabilities while still being safe – given the lab’s current processes – and commits not to advance faster than that.
    • Exception: following through on the commitment would make the lab fall behind a demonstrably less cautious competitor.
  • Pro-social governance: a governance structure that empowers the lab to act in the public interest even if this is opposed to shareholder value.
  • Better measurement: comprehensive measurements and forecasts of the pace of software progress.

Threat model

Recent AI progress has been very fast; we should proceed more slowly when developing systems that could pose extreme risks. But AI has already started to accelerate AI progress, and as we approach AGI the degree of acceleration will increase, perhaps dramatically. If this unlocks the technological potential for extremely rapid progress at exactly the time when labs should be most cautious, that could be extremely destabilizing.
Particularly scary is the possibility of very rapid AI software2 progress without additional computing power. Though this may well never happen, it would pose a significant challenge to AI governance because advances in algorithms and data are hard to measure and control.3
Particular risks include:
  • Actors rapidly develop AI with dangerous capabilities.
    • Some of the most extreme risks may only arise after AI is able to significantly accelerate AI progress. In particular, risks stemming from AI significantly surpassing human experts in the domains of persuasion, hacking, strategy, coordination, and R&D.
    • Rapid AI progress could make it much harder to manage extreme risks.
  • Actors rapidly gain illegitimate amounts of power, faster than regulators can keep up.
    • Even if a lab develops AI safely, they could illegitimately gain huge amounts of power by tightly controlling an AI workforce whose cognitive output exceeds that of millions of top human experts.
  • Actors believe they must “rush or fall behind” and become less cautious.
    • Today the reliance of AI progress on more compute tempers competitive “race to bottom” dynamics between labs in two ways:
      • There are natural delays in AI development when waiting for a new generation of chips to be released or attracting investment to buy more compute.
      • One lab can easily rule out the hypothesis that another lab is making extremely rapid AI progress because they can observe roughly their stock of compute.
    • Very rapid AI software progress undermines both these mechanisms, and could exacerbate race dynamics.
The above risks could be posed by insufficiently cautious labs, bad actors that exfiltrate a lab’s weights, or by misaligned AI influencing activities within a lab.
My dangerous capability threshold is AI that could drive 3X faster software progress than 2020-23. Why is 3X dangerous?
  • A 3X increase in the pace of progress would likely require extremely capable AI that can automate a large majority (>75%) of AI R&D tasks.4
  • If the lab currently halves its training compute requirements each year, then 3X faster progress corresponds to 8X increase in effective compute in 1 year from better training algorithms alone.5 That’s like going from GPT-3.5 to GPT-4 in just one year while keeping training compute constant. But it’s much scarier than that because it’s starting from AI that’s already extremely capable and may soon pose extreme risks.
  • In addition, there would be 3X faster improvements in prompting, agent scaffolds, fine-tuning techniques, and improving runtime efficiency.
  • Soon after 3X acceleration is observed,progress may be significantly faster still.
I believe a 3X acceleration could be extremely dangerous. So in the next section I recommend pausing when a 2X acceleration is detected unless the protective measures (the final section) are in place.

Early warning signs

Labs can get early warning signs of this dangerous capability by observing that either:
  • AI has already doubled the pace of software progress: the lab has made 3 years worth of software progress in 18 months, and this is plausibly due to AI.
  • AI can autonomously perform many AI R&D tasks, enough that it might soon enable 3X faster software progress than in 2020-23.
Lab should start measuring for both these warning signs today. They should commit to pause their AI development when they observe either warning sign, unless they have adopted the protective measures sketched below.

AI has already doubled the pace of software progress

The simplest early warning sign for AI accelerating software progress by 3X is AI accelerating progress by a smaller amount. This works as long as acceleration happens somewhat gradually.
I propose that the warning sign is triggered when the lab observes software progress that is 2X faster than in 2020-23 – see fn for justification of “2X”6 – and this acceleration is plausibly due to AI’s contributions.
Many labs are already measuring the pace at which algorithmic improvements are reducing the compute needed to develop AI of a given capability level. These measurements should be extended to account for other sources of software progress (better prompts, scaffolds, tools, fine-tuning procedures). See fn for details.7
More precisely, this warning sign is triggered when two conditions hold:
  1. 3 years worth of software progress, at the pace of 2020-23, occurs in 18 months or less.
    1. For example, if the lab was previously halving its training requirements every year, the threshold would be “an 8X reduction in training requirements over an 18 month period”.
    2. Looking at 3 years worth of progress makes this less likely to trigger due to a “fluke” period of faster progress.
    3. Labs should be measuring their software progress every 2 months to detect this.
  2. This faster pace of software progress is plausibly due to AI.
    1. I expect that this will be pretty obvious to people in the lab.
    2. RCTs and employee surveys can help resolve any uncertainty, and give external actors visibility into the judgment call. See fn for details.8
    3. What if the faster software progress isn’t due to AI? Many of the protective measures above may still be needed, if the “natural” pace of software progress is increasing, though I won’t explore this further here.
    4. Labs should commit to pausing at least temporarily if they observe 3 years worth of software progress in 18 months, while they evaluate whether the acceleration is due to AI and whether they expect progress to slow down again.
If we strengthen condition 2, then we can weaken condition 1 to compensate. In particular, suppose it is not only plausible but also highly likely that the faster progress is due to AI. In this case we don’t have to wonder if the faster progress is due to luck, so the warning sign could instead be triggered if 1 year of software progress, at the pace of 2020-23, occurs in 6 months or less.
Without delay, labs should start measuring the pace of software progress, conducting RCTs, and doing internal surveys.

AI can autonomously perform many AI R&D tasks

The above warning sign will fail if AI acceleration is sudden, which is a particularly scary scenario. To guard against this possibility, labs should run an eval to detect whether new models might significantly accelerate software progress.
The eval doesn’t need to accurately measure how much the new model will accelerate software progress. It just needs to rule out the possibility of sudden and significant acceleration. In particular, it must be very implausible that an AI could fail the eval but still make AI software progress 3X faster than in 2020-23.
The eval could assess whether AI can do ten AI R&D tasks autonomously, as with ARC’s ARA eval. Then the lab could argue “if the AI can’t autonomously do 5 of these ten tasks, it won’t significantly accelerate progress (even if it increases the pace of progress by 50%)”.
  • As usual, the eval should give AI access to the best possible scaffolds and tools to avoid underestimating its capabilities.
  • The initial eval could be very conservative, just testing some basic AI R&D tasks. When it is close to passing, harder evals can be designed (it must still be very implausible that an AI could fail the new eval but make AI software progress 3X faster than in 2020-23).
  • Labs should also build a “trigger eval”, such that they intend to pause if the eval passes.
It may be that the trigger eval passes, but the lab believes AI could not significantly accelerate progress. They could pause development and investigate further by granting a small number of employees access to the model and doing an RCT and survey on its productivity effects, see fn for details.9 If they decide to proceed, they should i) revisit the decision each month for the next 6 months (measuring the pace of software progress), and ii) create a new eval or RCT that, if passed, would cause them to pause.

Protective measures

If labs detect either of the above warning signs, they should pause development unless they have adopted suitable protective measures. Below I sketch out the high-level requirements of these protective measures and justify them. I am not describing specific ways that a lab could actually implement the measures. Much more work is needed to flesh out these measures.
To continue to scale safely once they have observed one of the two warning signs, labs should have the following in place:
  • Info security that is robust to a well resourced state actor and to the lab’s best AIs.
    • Why? Around this time, it may be necessary to pause AI development globally, for example if we cannot find a scalable alignment solution. If a state actor steals the weights and is not cautious, coordinating a pause could become extremely difficult.
    • Info security must also prevent AI from within the lab from hacking the server or taking disallowed actions (see “boxing” below). This must remain the case even as AI develops superhuman hacking abilities.
  • Alignment, boxing, and internal monitoring that are collectively good enough to guarantee that misaligned AI is not poisoning the process of AI development.
    • Why? A key threat model is that misaligned AI builds its influence by subtly poisoning the AI development process so that future AIs are misaligned.
    • Alignment could prevent models from wanting to do this.
    • Boxing could limit the actions that AI could take to poison the development process, e.g. preventing it from running code on large clusters.
    • Internal monitoring could verify that the AI’s outputs are harmless and truthful.
    • To be clear, the threat of poisoning begins well before AI significantly accelerates AI progress, and protective measures should be ramped up in line with AI’s contributions to AI development. But by this point those protective measures should be extremely effective, significantly beyond what is in place today. (Describing how those measures should evolve over time before this point is beyond the scope of this document.)
  • External oversight: a publicly legible mechanism that ensures a responsible 3rd party organization signs off on high stakes decisions, including a decision to rapidly improve AI capabilities.
    • Why? Two reasons:
      • Firstly, a lab with AI this capable will be making many high-stakes decisions. This includes decisions about whether and how quickly to advance its AI capabilities, how to use its abundant AI labor force, and how to mitigate particular risks from new AI systems. External oversight ensures the lab makes these decisions responsibly.
      • Secondly, other labs may take risks if they believe a lab (with AI this capable) is not being responsible. A legible external oversight mechanism provides reassurance to all actors that no one is advancing AI dangerously quickly, despite the possibility of AI acceleration.
The remaining protective measures feel somewhat lower priority for me:
  • AI Development Speed limits: an RSP that explicitly evaluates how fast the lab could improve its overall AI capabilities with its current processes while still being safe.
    • Why?
      • Existing mechanisms in the RSP may break down if AI progress becomes too fast. For example:
        • The lab may need to re-run evals more regularly when progress is faster.
        • The lab may need a better process for quickly and regularly getting external oversight over important decisions if progress is faster.
        • Risks emerging that the lab is not explicitly testing for may be more likely to emerge if the lab has less time to explore their new models’ capabilities.
        • There may not be enough time for whistleblowers to speak up if the lab’s RSP is not being followed.
      • A speed limit gives a clear and legible guarantee to other labs, and the world at large, that the lab won’t quickly and recklessly gain huge amounts of power. This prevents other labs from feeling they must “rush or fall behind”.
      • Without a speed limit, external parties might reasonably worry a lab will find a way to interpret its RSP in a way that allows it to advance extremely quickly.
    • This doesn’t fully preclude over going over the speed limit:
      • There can be an exception if following through on the commitment would make the lab fall behind a demonstrably less cautious competitor.
      • The lab can make a new RSP with a higher speed limit.
  • Pro-social governance: a governance structure that empowers the lab to act in the public interest even if this is opposed to short-term shareholder value.
    • Why? Pursuing share-holder value could pose large externalities to the rest of society in these high-stakes decisions. For example, the lab should resist the temptation to advance their capabilities extremely rapidly if this would pose even a small risk of catastrophe.
  • Better measurement and forecasts of the pace of software progress.
    • Why? The risks of AI acceleration may well become more severe over time. The lab can better mitigate these risks if they can accurately predict the pace of their future progress.
      • This includes hypothetical predictions. How fast would progress be if they went all out on improving capabilities? What if they also trained a model with 10X more compute?
      • This can also inform regulatory decisions. If the lab predicts that with a training run of 1e28 FLOP, they could accelerate software progress by 3X and with a training run of 1e29 FLOP they could accelerate software progress by 5X, this could inform regulations about the safety standards needed for actors to access that amount of compute.
  • Prioritizing alignment automation: for example, something like a commitment to use (e.g.) >2X as much inference compute to solve AI alignment as to improve AI capabilities. (Compute used for training or for serving external customers isn’t included here; I’m just including inference compute used internally.)
    • Why? Once AI is this powerful and accelerating software progress this much, it is critical that it is used primarily to solve AI alignment rather than to further accelerate capabilities progress.
    • Exceptions:
      • If there is consensus that the lab has found an indefinitely scalable solution to alignment, then there is no longer an urgent need to prioritize solving AI alignment.
      • If the lab demonstrably risks falling behind a less cautious actor that is using a smaller fraction of their compute for alignment, they can declare that they are in a dire situation and drop this commitment.
    • The specific number I’ve chosen here is somewhat arbitrary, as is the specific metric (amount of inference compute used on alignment vs capabilities). What matters is that the lab prioritizes using its massive AI workforce to reduce AI risk.
    • This is the protective measure I’m least sure about. It might be better to just insist on a high absolute level of alignment, and it might be hard to cleanly distinguish alignment from non-alignment work.

Footnotes

Released on 24th March 2025