Rewriting Threshold Classification under Uncertainty

An Anthropic Risk-Report Sample in AI Safety Documentation

Saman Samadi, PhD (Cantab)

Technical rewriting sample | 27 May 2026

Focus: Technical rewriting; threshold classification; audience-sensitive documentation; uncertainty communication.

PDF | Source Report | Return to AI Safety Portfolio

This page presents a documentation rewriting sample rather than an independent safety assessment of Anthropic’s models. Its purpose is to show how a dense passage from an AI safety risk report can be revised for different audiences while preserving the force, limits, and uncertainty of the original evidence.

The source passage comes from Anthropic’s Risk Report: February 2026, section 4.4, “Current state of model capabilities.” It follows a table comparing chemical and biological weapons-related evaluations for Claude Opus 4.6 and Claude Sonnet 4, then explains how Anthropic uses ASL-3 safeguards, automated evaluations, uplift trials, and uncertainty language when classifying models near a capability threshold.

Source and Locator

Source: Anthropic, Risk Report: February 2026, section 4.4, pp. 66–68.

Passage under review: the two paragraphs on p. 68 beginning “Since the launch of Claude Opus 4…” and continuing through Anthropic’s discussion of uncertainty, threat-actor characteristics, tacit knowledge, proxy tasks, and dangerous real-world tasks.

Report link: https://www-cdn.anthropic.com/08eca2757081e850ed2ad490e5253e940240ca4f.pdf

Context

Anthropic’s February 2026 Risk Report evaluates several categories of catastrophic risk in relation to the company’s AI systems and the mitigations surrounding them. The selected passage belongs to the section on non-novel chemical and biological weapons production. In that context, Anthropic distinguishes between models protected under ASL-2 and ASL-3 safeguards. ASL-2 represents a baseline level of deployment safeguards for models whose capabilities do not yet warrant enhanced protection; ASL-3 represents a stronger safeguard regime for models that could provide meaningful uplift on priority threat models if fully accessible.

The passage is valuable because it carries a difficult documentary task. It has to explain why certain models are placed under ASL-3 safeguards while also keeping the uncertainty of that classification visible. The decision is based on evaluation evidence, but the evidence does not map cleanly onto real-world risk. A model may perform strongly on concrete evaluations without that performance translating directly into practical assistance for a threat actor attempting a complex project over months. The document therefore has to make a threshold decision legible before epistemic closure is available.

Diagnostic Reading

The source passage performs three kinds of documentary work.

It first states a safeguard rule. Anthropic treats Claude Opus 4 as a practical reference point: models with similar or greater general capabilities receive ASL-3 safeguards. This gives the threshold an operational form, but the phrase “general capabilities” carries more weight than the sentence can immediately explain. A reader has to infer how general model capability relates to the chemical and biological threat model under discussion.

It then explains evaluation triage. Automated evaluations provide the scalable comparison across frontier models, while more expensive uplift trials are reserved for models below the frontier that require closer assessment around the ASL-2 / ASL-3 boundary. The logic is reasonable, but compressed. The document could make the relation more explicit: automated evaluations support recurring comparison; uplift trials become more important when the safeguard classification remains unsettled.

The passage finally qualifies the classification. Anthropic states that there is significant uncertainty about the real level of risk posed by Claude Opus 4 and more advanced models. This uncertainty arises from the difficulty of translating evaluation performance into real-world usefulness for a threat actor. Threat actors differ in skill and resources; experts disagree about the importance of tacit knowledge; and the relevant evaluations must use proxy tasks because the actual dangerous tasks cannot be tested directly.

The strongest feature of the passage is its refusal to let threshold classification become false precision. The classification is firm enough to guide safeguards, while the uncertainty remains attached to the public claim. The main weakness is structural. The rule and the uncertainty appear in sequence, but the relation between them could be made clearer. A more explicit version would present ASL-3 classification as a precautionary safeguard decision supported by imperfect but decision-relevant evidence.

Rewrite for Policy and Governance Readers

Anthropic applies ASL-3 safeguards to models whose general capability is comparable to or greater than Claude Opus 4. This classification is based primarily on automated evaluations, which allow the company to compare model capability across releases. More expensive uplift trials are reserved for cases in which a model falls below the current frontier but may still sit close enough to the ASL-2 / ASL-3 boundary to require further assessment.

This should be read as a safeguard classification made under uncertainty. Strong performance on chemical and biological evaluations may indicate that enhanced safeguards are warranted, yet those evaluations do not directly establish how useful the model would be to a real threat actor pursuing a complex project over months. The remaining uncertainty concerns threat-actor skill, the role of tacit practical knowledge, and the use of proxy tasks in place of actual dangerous tasks. The ASL-3 decision therefore functions as a risk-management judgement: evaluation evidence is strong enough to trigger a higher safeguard regime, while the real-world translation of that evidence remains partly unresolved.

Rewrite for Informed Public or Civil-Society Readers

Anthropic says that models at roughly Claude Opus 4’s capability level, or above it, receive stronger ASL-3 safeguards. The company uses automated tests to compare model capabilities across releases. When a model is less capable than the frontier but still close to the boundary, Anthropic may run more expensive trials to decide whether the model needs ASL-3 protections or the lighter ASL-2 baseline.

The important point is that these tests do not measure real-world danger directly. They help Anthropic decide when stronger safeguards are needed, but the company still reports significant uncertainty. A model can do well on controlled evaluations without that result translating neatly into practical help for a harmful actor. Real-world risk depends on the actor’s skill, the amount of practical know-how required, and the fact that safety evaluations must use representative proxy tasks rather than actual dangerous tasks.

Rewrite for an Internal Safety-Documentation Lead

The passage should frame ASL-3 classification as a safeguard decision made from imperfect but action-relevant evidence. The present wording states the operative rule clearly: models at or above Claude Opus 4’s general capability level receive ASL-3 safeguards; automated evaluations are used for scalable comparison; uplift trials are reserved for below-frontier models near the ASL-2 / ASL-3 boundary.

The evidentiary status of that rule should be made more explicit. Automated evaluations support a capability comparison. Uplift trials provide additional evidence when the classification boundary is less settled. Neither resolves the full translation problem from evaluation performance to real-world threat-actor uplift. The uncertainty paragraph should therefore sit in closer relation to the classification paragraph, so that ASL-3 does not read as a simple consequence of benchmark strength. It is better presented as a precautionary safeguard classification in a domain where threat-actor skill, tacit knowledge, and proxy-task validity remain active sources of uncertainty.

Commentary on What Changed

The three rewrites preserve the central claims of the source passage. Anthropic still applies ASL-3 safeguards to models with general capabilities comparable to or greater than Claude Opus 4. Automated evaluations remain the main scalable comparison method. Uplift trials remain reserved for cases where a below-frontier model needs closer assessment near the ASL-2 / ASL-3 boundary. The uncertainty about real-world chemical and biological risk remains central.

The main revision concerns the evidence-to-claim relation. In the source passage, the safeguard rule appears first and the uncertainty statement follows. In the rewrites, the classification and the uncertainty are placed into the same documentary movement. ASL-3 becomes visible as a safeguard decision supported by evaluation evidence, rather than as a direct measurement of catastrophic risk. This distinction matters because the public reader should not infer that automated evaluations alone establish how much real-world harm a model would enable.

The policy-facing version foregrounds governance. It uses the language of classification, threshold, safeguard regime, and risk-management judgement. That register is appropriate for readers concerned with accountability, auditability, and decision procedure.

The public-facing version reduces institutional compression. It explains ASL-3 as stronger safeguards, separates automated tests from more expensive trials, and states the limitation plainly: controlled evaluations can guide safeguards, but they do not reproduce real-world misuse conditions.

The internal documentation version keeps the technical and procedural vocabulary closer to the original. It is written for someone responsible for revising or reviewing the report itself. Its concern is structural: the passage would become stronger if the classification rule and the uncertainty attached to it were brought into closer relation.

Across the three rewrites, no stronger safety claim is introduced. The revisions do not state that ASL-3 safeguards eliminate risk. They do not claim that automated evaluations prove real-world misuse capability. They do not treat proxy tasks as direct evidence of actual dangerous action. The rewriting clarifies the claim while keeping uncertainty attached to the evidence.

Portfolio Value

This sample demonstrates audience-sensitive technical rewriting in AI safety documentation. The task is not stylistic simplification alone. It is claim discipline: preserving the relation between capability evidence, threshold classification, safeguard assignment, and residual uncertainty.

For external-artifacts and system-card roles, the sample shows how release-facing language can be clarified without making it promotional. For responsible AI governance roles, it shows how a safeguard decision can be presented as a documented judgement rather than a bare institutional assertion. For model-evaluation communication roles, it shows how evaluation evidence can be translated into public prose while retaining the limits of proxy measurement.

The underlying skill is directly relevant to AI safety documentation: a public claim becomes stronger when the reader can see what evidence supports it, what decision it authorises, and what uncertainty remains.

Reference

Anthropic. Risk Report: February 2026. Anthropic, February 2026. See especially section 4.4, “Current state of model capabilities,” pp. 66–68, and Appendix 7.6, “ASL-2 and ASL-3 protection levels,” pp. 99–100. https://www-cdn.anthropic.com/08eca2757081e850ed2ad490e5253e940240ca4f.pdf