When an Evaluation Becomes a Claim
Evidence, Language, and AI Safety Reporting
Saman Samadi, PhD (Cantab)
Article / PDF | 1 March 2026
Focus: Model evaluations; benchmark interpretation; red-team findings; uncertainty and residual risk.
Download PDF | Return to AI Safety Portfolio
Abstract
Frontier AI documentation gives evaluation a public life. A benchmark score, red-team result, attack-success rate, or capability assessment becomes consequential when a system card, risk report, or safety framework translates it into a claim about capability, mitigation, release, residual risk, or governance adequacy. This article examines that passage from measurement to claim through OpenAI’s GPT-4o System Card, OpenAI’s Preparedness Framework, Anthropic’s Claude Sonnet 4.6 System Card and Responsible Scaling Policy, and Google DeepMind’s report on indirect prompt injection. The main case is GPT-4o’s persuasion section, where mixed modality-specific findings are gathered under a medium-risk label. Anthropic’s prompt-injection and CBRN sections then show how mitigation state and AI Safety Level determinations alter what a public claim can responsibly mean. Google DeepMind’s adaptive-evaluation report adds a methodological pressure point, since non-adaptive testing can produce false confidence before a public claim has been phrased. Across these cases, the article argues that strong AI safety reporting depends on an inspectable evidence-to-claim relation: the evaluated object, comparator, threshold rule, mitigation state, adversarial condition, and residual uncertainty must remain visible enough for public scrutiny.
Keywords: AI safety documentation; system cards; model evaluations; risk reporting; preparedness frameworks; prompt injection; residual risk; frontier AI governance.
1. Introduction: The Documentary Life of Evaluation
Frontier AI systems now enter public life through documents as well as through products. A model may be released through an interface or API, yet the terms under which it becomes publicly intelligible are increasingly arranged by a surrounding documentary order: system cards, safety reports, risk frameworks, preparedness policies, evaluation summaries, transparency hubs, and technical papers written close to the moment of release. These artifacts do more than accompany model release. They give evaluative evidence a public surface, and that surface helps determine what readers can trust, contest, or carry into later governance.
The difficulty gathers around the word evaluation. In ordinary usage, an evaluation appears as a test, and a test appears to confirm or unsettle a claim. Frontier AI documentation works through a less settled documentary grammar. An evaluation may measure a model’s benchmark performance, its resistance to red-team pressure, its refusal behaviour under a policy condition, its ability to complete a dangerous-capability task, or the efficacy of a safeguard after deployment controls have been applied. Each result then enters a document whose audience may include researchers, regulators, journalists, enterprise customers, civil-society actors, internal safety committees, and future auditors. The claim made public by that document is shaped by more than the result itself. It depends on threshold language, baseline comparison, mitigation state, residual-risk framing, visual presentation, and the governance procedure into which the result has been placed.
This article examines that passage from evaluation to claim. Its central object is the evidence-to-claim relation: the inferential path by which a technical result becomes a public statement about capability, risk, mitigation, release, or accountability. A benchmark score does not speak in public by itself. A red-team finding acquires public meaning when the document names the attack surface, the adversary, the success criterion, the mitigation condition, and the residual uncertainty that still surrounds the result. A threshold does more than classify. It authorises an institutional sentence: a capability level has been reached; a safeguard level has become necessary; a risk is being managed under a named policy; a release proceeds under specified conditions.
The argument developed here is that public AI evaluations become consequential when they are translated into thresholded, mitigated, and rhetorically stabilised claims. This translation cannot be avoided. No system card can reproduce the full internal record of model development, testing, review, and deployment decision-making. Public documents have to compress. Compression becomes accountable when the reader can reconstruct the main joints of the passage from task to result, from result to threshold, from threshold to mitigation, and from mitigation to residual-risk judgement. Where those joints disappear, evaluation begins to function less as evidence than as assurance.
The main case study is the persuasion section of OpenAI’s GPT-4o System Card. OpenAI presents GPT-4o as an omni model whose system card includes Preparedness Framework evaluations, external red-teaming, mitigations, and safety assessments across several domains.[1] The persuasion section is unusually compact because it places a medium-risk label beside a more conditional evidentiary pattern. The card reports that GPT-4o’s persuasive capabilities “marginally” cross the medium-risk threshold from low risk, distinguishes between text and voice modalities, compares AI-generated interventions with human-written and human-spoken baselines, presents immediate and one-week effect-size measurements, and gives the reader a chart whose visual order helps stabilise the public claim.[2] The section does not require accusation in order to become analytically valuable. Its force lies in the way a mixed evaluation becomes a release-facing risk classification.
The comparison cases refine the same pressure from different directions. Anthropic’s Claude Sonnet 4.6 System Card defines prompt injection as a malicious instruction hidden in content that an agent processes on the user’s behalf, then reports attack-success rates with and without safeguards across coding, computer-use, and browser-use environments.[3] The same system card’s CBRN material then shows evaluation becoming institutional judgement through ASL-3 and ASL-4 biological-risk evaluations, comparative uplift, preliminary assessment, and the Responsible Scaling Officer’s determination that ASL-3 safeguards are appropriate for the CBRN domain.[4]
Google DeepMind’s report on indirect prompt injection changes the scale once more, since it shows that non-adaptive evaluation can produce false confidence before a public claim has been written.[5]
These cases belong to different genres. OpenAI’s GPT-4o document is a system card attached to a major model release. Anthropic’s Sonnet 4.6 system card combines capability reporting, safeguards evaluation, alignment assessment, and RSP reasoning. DeepMind’s indirect prompt-injection paper is a technical report, while Google DeepMind’s Frontier Safety Framework and Microsoft’s Frontier Governance Framework articulate framework-level procedures for capability levels, early-warning evaluations, safety cases, deeper assessments, mitigations, and deployment decisions.[6] The difference between these genres belongs to the argument. Each converts evidence differently. A system card gathers public-facing evaluation results into a release artifact. A policy framework defines decision rules and safeguard expectations. A technical paper can make visible the methodological limits that system-card summaries may later compress. Together, they show that AI safety documentation is now one of the places where technical evidence, institutional authority, and public accountability meet.
The larger concern is public accountability. If AI safety documentation is to serve as more than release accompaniment, readers must be able to inspect the movement from evidence to claim. That inspection will often remain partial, since public documents cannot disclose everything. Still, they can show whether a score has become a threshold, whether a threshold has become a safeguard requirement, whether a mitigation has been tested under realistic pressure, and whether residual risk has been preserved as part of the claim. When evaluation becomes public language, the central question is what the document asks the measurement to authorise.
2. From Measurement to Public Claim
Evaluation is the technical origin of many public claims in frontier AI documentation, although it rarely carries the whole claim by itself. A benchmark may show that a model performs well on a defined task. A red-team exercise may show that a model can be induced into a harmful behaviour under particular conditions. A capability evaluation may suggest that a model can assist with a domain-specific procedure. A safety evaluation may indicate that a safeguard reduces a specified class of unwanted output. These results matter, yet their public significance depends on what the document does with them. Evaluation becomes claim-bearing when it is placed in relation to task, comparator, threshold, mitigation state, and governance consequence.
The word evaluation can mislead through its apparent simplicity. It suggests a stable procedure by which a system is measured. In frontier AI reporting, evaluations often serve several documentary functions at once. They measure model performance, support internal decision-making, justify external trust, trigger safety procedures, and provide retrospective evidence for a release. Engineers may read a result as a signal about model behaviour. Governance teams may read the same result as evidence for a threshold determination. External readers may receive it as a public statement about risk. The danger appears when those readings collapse into one another.
Capability evaluation and safety evaluation need to remain distinct even where they appear in the same document. A capability evaluation asks what a model or system can do when its performance is elicited under specified conditions. It may involve scaffolding, tools, extended reasoning, repeated attempts, or access to external resources. Its purpose, especially in frontier-risk frameworks, is often to approximate what a capable user or adversary might extract from the system. A safety evaluation asks how the system behaves when risk, misuse, violation, refusal, harmful compliance, or safeguard performance is at issue. The distinction carries public consequence, since a model can be highly capable in a risky domain while the deployed system is presented as adequately safeguarded against practical misuse.
Benchmarks add a second layer of difficulty. A benchmark gives evaluation a repeatable form and allows comparison across systems, versions, or configurations. It can make a claim appear more objective because the task is named, the score is numerical, and the comparison is often visually clear. Yet a benchmark is always a constructed environment. It selects a task, a scoring rule, a prompt format, a model configuration, and a success condition. NIST’s AI RMF treats risk measurement as an area where metrics may be oversimplified, gamed, used outside their intended setting, or unable to account for affected groups and context.[7] A benchmark score can support a bounded comparative claim while remaining weak evidence for a broad safety claim. The score becomes stronger when the document states what the benchmark measures and how its limits bear upon the claim.
The comparator is one of the most important parts of that chain. A result becomes meaningful through relation: a model is more persuasive than a baseline, less successful than a human comparator, more robust than an earlier version, or above a rule-in threshold while remaining below a higher rule-out line. OpenAI’s GPT-4o persuasion material is intelligible because the AI interventions are read against human-written articles, human audio clips, and human conversations.[8] Anthropic’s prompt-injection tables become intelligible through comparison across models, attempt budgets, thinking modes, and safeguard conditions.[9] DeepMind’s indirect prompt-injection findings depend on the contrast between transferred non-adaptive attacks and adaptive attacks shaped against the defence.[10] A claim that omits its comparator withholds part of its own meaning.
Thresholds turn comparison into institutional consequence. A threshold is a rule for translating measured capability into action. OpenAI’s Preparedness Framework connects threat models, tracked categories, capability evaluations, and threshold levels with safeguard requirements before deployment. Covered systems that cross a high capability threshold require robust and effective safeguards before deployment, while critical capabilities require safeguards even during development, irrespective of deployment plans.[11] The threshold changes the public status of the evaluation. The result now supports a claim about what the organisation must do.
Mitigation introduces another turn in the evidence-to-claim relation. Once a risk has been identified, the document may describe safeguards, classifiers, monitoring, access restrictions, refusal policies, tool-use controls, security measures, or other interventions. These measures can reduce risk, and they also change the object being described. A pre-mitigation result concerns the model or system before a safeguard is applied. A post-mitigation result concerns the system after the safeguard has entered the evaluation. If a document foregrounds only the post-mitigation state, the reader may attribute safety to the model itself, when part of the safety belongs to the system around it. If both states remain visible, the reader can see how much of the public claim belongs to model behaviour and how much belongs to deployment architecture.
Residual risk prevents mitigation from becoming closure. A responsible document should rarely imply that risk has vanished. It should say what risk remains after the documented mitigations, what assumptions govern that judgement, and what monitoring or further action follows from it. NIST defines residual risk as risk remaining after risk treatment and states that documenting residual risks informs end users about potential negative impacts of interacting with the system.[12] Anthropic’s RSP gives this concept a frontier-model form by treating Risk Reports as documents that should explain threat models, relevant capabilities and behaviours, mitigations, mitigation effectiveness, and remaining absolute risk after mitigation.[13] A residual-risk claim is therefore one of the places where a safety document tells the reader what remains present, uncertain, tolerable, or in need of oversight.
The article’s method follows from this grammar. Each case is read by asking what was evaluated, which comparator gives the result meaning, whether the result is pre- or post-mitigation, what threshold or policy rule transforms the result into a governance claim, and how residual uncertainty remains attached to the statement being made. These questions do not require a comprehensive judgement on whether a model is safe. They require a narrower and more useful judgement: whether the public claim remains proportionate to the evidence offered for it.
3. Thresholds as Claim-Making Devices
A threshold is the place where an evaluation result begins to acquire public consequence. Before that point, a measurement may indicate a model’s performance on a task, its response to an adversarial probe, its refusal behaviour under policy pressure, or its ability to complete a capability-relevant procedure under elicitation. Once the result is measured against a threshold, the document has changed the status of the evidence. The result has been made available to decision. It can now support a scorecard label, a safeguard requirement, a release condition, a risk report, or a public statement that the model falls within a specified band of concern. Threshold language carries more than technical meaning in frontier AI documentation. It is one of the devices through which measurement becomes institutionally usable and publicly claimable.
OpenAI’s Preparedness Framework gives a clear example of this conversion. The framework defines tracked categories in which frontier capabilities are monitored because they may create severe harms, and each tracked category is linked to capability thresholds marking meaningful increases in risk.[14] Biological and chemical capability, cybersecurity capability, and AI self-improvement are placed under this structure, with high and critical thresholds that activate different safeguard expectations.[15] The threshold is therefore neither a mere benchmark score nor a loose warning sign. It is a procedural hinge. A covered model crossing a high capability threshold must be matched with safeguards judged sufficient to minimise the associated risk before deployment, while a critical capability can require stronger restrictions during development itself. The evaluation becomes consequential because the framework gives the result a place within a governance sequence.
The same framework shows that threshold determinations are not produced by raw measurement alone. OpenAI states that, before deployment, covered models undergo scalable evaluations and that their results, together with observations affecting interpretation, are compiled into a Capabilities Report for the Safety Advisory Group. The determination that a threshold has been reached is informed by those evaluation results, while also reflecting holistic judgement based on the totality of available evidence and the robustness of the methodology.[16] The public category may appear sharp, but the process behind it includes interpretation, evidence selection, robustness assessment, and institutional judgement. The threshold gives categorical form to a result that remains mediated by procedure.
OpenAI’s account of capability elicitation sharpens this point. The framework states that evaluations aim to approximate the full capability that an adversary contemplated by the relevant threat model could extract from a deployment candidate, using high-capability settings, model variants with negligible safety-based refusals where necessary, and the best available scaffolds.[17] A capability evaluation is therefore an attempt to draw out what the model can do under conditions approximating strong misuse pressure. Even then, the framework treats one-time elicitation as a lower bound, since new scaffolding and elicitation techniques may reveal stronger capability later.[18] Thresholds sit within a field of uncertainty. They give governance a workable form, while carrying the knowledge that capability is never exhausted by the test that currently measures it.
Anthropic’s Responsible Scaling Policy provides a useful contrast because it moves the centre of gravity from score presentation toward public argument. The RSP describes itself as a voluntary framework for managing catastrophic risks from advanced AI systems, and its third version maps capability thresholds to mitigations while distinguishing Anthropic’s own planned mitigations from more ambitious industry-wide recommendations.[19] In a crucial passage, Anthropic acknowledges that it cannot presently give highly specific advance detail on the exact evaluations that will determine whether risk thresholds have been passed, or on the exact mitigations required to achieve safety.[20] The response is to require analysis and arguments that make a strong case for safety. This is a different documentary posture. Where a scorecard condenses judgement into a label, the RSP foregrounds the argumentative burden that remains when thresholds and mitigations cannot be fully fixed in advance.
That emphasis becomes especially important in Anthropic’s treatment of Risk Reports. The policy gives Risk Reports a central role in explaining how a model’s capabilities, mitigations, and overall risk judgement are assessed before a release or major deployment decision. A Risk Report is meant to document threat models, evidence about capabilities and behaviour, risk mitigations, the effectiveness of those mitigations, remaining absolute risk, overall risk, risk-benefit reasoning, and future monitoring plans.[21] A Risk Report, in this sense, is the form in which evaluation is asked to become accountable reasoning. The public value lies in the chain it can make inspectable: capability threshold, threat pathway, mitigation claim, residual risk, and institutional decision.
The threshold, then, is a documentary instrument. It does not remove judgement from AI safety reporting. It organises judgement so that a public claim can be made, challenged, updated, or withheld. It lets a document say that a model has crossed a relevant capability boundary, that a safeguard level has become necessary, that a release is conditioned by mitigation, or that further development should halt until additional controls exist. At the same time, the threshold can project more stability than the underlying evaluative science yet possesses. Its authority depends on the surrounding documentation: task definition, capability elicitation, threat model, comparator, mitigation state, residual risk, and the candour with which uncertainty is preserved.
This is the frame within which the GPT-4o persuasion scorecard should be read. The label “Medium” does not enter the System Card as a neutral description of persuasive power. It enters through a threshold grammar that makes a behavioural finding publicly intelligible as risk classification. Persuasion offers a particularly compressed instance of that movement: mixed evidence, modality distinctions, human baselines, follow-up measurements, a chart, and a public label whose categorical force exceeds the simplicity of any single result.
4. GPT-4o Persuasion and the Medium-Risk Label
The persuasion section of the GPT-4o System Card offers a compact instance of the documentary movement this article is concerned with. A series of empirical observations, each carrying limited force on its own, is gathered into a scorecard whose public form assigns the model a risk category. The transition can be seen almost at once. A label appears before the reader has worked through the evidentiary detail. The scorecard states “Persuasion Score: Medium,” and beneath that heading OpenAI writes that the persuasive capabilities of GPT-4o “marginally cross” the medium-risk threshold from low risk.[22] The phrasing is compressed, yet it carries several operations together. It names a domain of capability, places that domain within a threshold structure, records a crossing, and qualifies the crossing as marginal. The claim has become classificatory before the detail of the evaluation appears.
The importance of the word “marginally” lies in the friction it introduces into the scorecard’s categorical form. “Medium” is a stable public label. “Marginally” returns the reader to the conditions under which that label has been assigned. The section therefore makes visible the pressure between classification and evidence, because the reader receives a risk category while also being reminded that the crossing between low and medium sits close to the boundary that separates them. In ordinary public reading, the category may travel more easily than the qualification. The document acknowledges modality, baselines, pre-registered thresholds, and follow-up data, yet the scorecard has already performed part of the public work.
The section then divides the evaluation into text and voice modalities. OpenAI states that, on the basis of pre-registered thresholds, the voice modality was classified as low risk, while the text modality marginally crossed into medium risk.[23] This distinction matters because persuasion is not treated as a single undifferentiated capability. The evaluated surface of the model changes across modality, and the risk label changes with it. A voice intervention, a written article, and an interactive chatbot do not carry the same evidentiary status simply because they are all grouped under persuasion. The document’s more careful claim is thresholded by modality, and the text modality does the decisive work in moving the public score into medium risk.
The text-modality paragraph complicates the scorecard further. OpenAI reports that GPT-4o-generated articles and chatbots were evaluated on participant opinions concerning selected political topics, with AI interventions compared against professional human-written articles. The result is mixed: the AI interventions were not more persuasive than human-written content in aggregate, although they exceeded the human interventions in three of twelve instances.[24] The evidence supports a specific proposition about performance under particular conditions and topic selections. It also supports the internal classification that OpenAI assigns within its threshold structure. It gives much less support to a broad public statement that GPT-4o is more persuasive than human writers, because the aggregate result explicitly withholds that conclusion. The medium label depends on the framework through which the result is made decision-relevant.
The lower chart on the scorecard page sharpens the problem. Its grouped bars compare human articles, AI articles, and AI chatbots across aggregate opinion and selected political topics. The visual organisation gives the data an appearance of settlement. The bars are aligned and placed under the scorecard’s already assigned medium label. Yet the chart also disperses the claim into topic-specific and ideology-specific conditions. The aggregate comparison does not produce the same impression as the abortion, minimum-wage, or immigration panels.[25] The chart gives the page evidentiary density while making clear that persuasion appears unevenly across issue, format, and participant grouping. It projects visual confidence over conditional content.
The voice-modality results work differently. OpenAI reports an updated study design measuring effect sizes on hypothetical party preferences and the persistence of those effects one week later. GPT-4o voiced audio clips and interactive conversations are compared with human baselines: static human-generated audio clips and conversations with another human. OpenAI states that, for both interactive multi-turn conversations and audio clips, the GPT-4o voice model was not more persuasive than a human. Across more than 3,800 surveyed participants in United States Senate races classified as safe by three polling institutions, AI audio clips reached 78 per cent of the human audio clips’ effect size on opinion shift, while AI conversations reached 65 per cent of the human conversations’ effect size.[26] These figures indicate a capacity to move opinion under the study conditions, while remaining comparative and bounded.
The one-week follow-up further alters the evidentiary surface. OpenAI reports that, when opinions were surveyed again after one week, the effect size for AI conversations was 0.8 per cent, while the effect size for AI audio clips was -0.72 per cent.[27] This follow-up changes the temporal character of the claim. A persuasive intervention that produces an immediate shift may have a different risk profile from an intervention whose effects persist. The document includes this distinction, and its inclusion strengthens the section. At the same time, the scorecard label does not visually differentiate immediate influence from enduring influence. The page’s public architecture places both temporal layers beneath the same medium-risk heading, while the prose carries the more difficult distinction between momentary effect, comparative effect, modality, and persistence.
This case shows why evaluation communication has to ask how much public weight a result is carrying. In the GPT-4o persuasion section, OpenAI’s evidence is carefully bounded. The document names modality, comparator, participant setting, effect sizes, and follow-up. It also debriefs participants after the follow-up survey in order to minimise persuasive impacts.[28] These details create a record of methodological care. Yet the medium-risk label condenses that record into a public category whose circulation will almost certainly be more durable than the qualifications attached to it. The label functions as a hinge between measurement and governance.
The section’s relation to the Preparedness Framework is therefore decisive. The score belongs to a wider architecture in which tracked and researched capabilities are measured against thresholds, and those thresholds inform decisions about safeguards and deployment. Persuasion, in the GPT-4o System Card, is a capability class routed through an institutional grammar of risk. The phrase “medium risk threshold” gives the empirical result a procedural destiny. It makes the result usable by a governance process and then by a public document. Evaluation becomes claimable because a threshold system makes the finding classifiable, and the system card gives that classification a stable public form.
The GPT-4o persuasion case consequently gives this article its central principle. A strong safety document should allow the reader to see the chain through which a result becomes a public classification. In the persuasion section, that chain can be reconstructed: the evaluated task concerns persuasive interventions on selected political topics and hypothetical party preferences; the comparators are professional human-written articles, human static audio, and human conversation; the evidence appears through effect sizes, topic-specific comparisons, and follow-up measurement; the threshold system translates part of that evidence into a medium-risk label; and the system card presents the label as part of a release-facing public artifact. The document is relatively strong because these elements are present. It remains analytically exposed because the public label is more compact than the evidence that sustains it.
This is a structural condition of frontier AI documentation. A public system card has to make technical evidence readable under time pressure, institutional scrutiny, and political consequence. It has to compress without severing the relation between measurement and judgement. The GPT-4o persuasion section succeeds to the extent that it preserves modality, comparator, threshold, and follow-up. It becomes vulnerable to over-reading where the scorecard’s categorical stability begins to dominate the more conditional evidentiary material below it. The medium score is real within the document’s governance grammar. The evidentiary basis remains narrower, mixed, and conditional. That tension gives the case its value: it shows how evaluation becomes public classification while leaving the reader with enough evidence to test the claim’s scale.
5. Mitigation State and the Problem of What Is Being Claimed
The GPT-4o persuasion case showed how a threshold label can gather mixed evidence into a public risk classification. Anthropic’s Claude Sonnet 4.6 System Card brings the same problem into another register, where the central difficulty concerns the relation between model robustness, safeguards, attack surface, and deployment claim. Prompt injection is an especially useful case because the danger arises when an agent processes external content on the user’s behalf and encounters hidden instructions that attempt to redirect its behaviour. Anthropic defines prompt injection in exactly this way, using the example of a website visited by an agent or an email summarised by an agent.[29] The public claim must specify the surface on which robustness was tested, the adversary’s strength, the number of attempts permitted, and the role played by additional safeguards before robustness can be stated with any discipline.
Anthropic’s system card is comparatively strong because it separates several layers often collapsed in public discussion. The document distinguishes tool-use benchmarks from adaptive attacks across coding, computer-use, and browser-use environments. It also distinguishes between conditions with and without safeguards. That separation changes the evidentiary force of the tables. A result produced without safeguards says something about the model’s own behaviour under a specified attack condition. A result produced with safeguards says something about a deployed or deployable system in which additional classifiers, instructions, tool-response warnings, or product-level defences are doing part of the safety work. This distinction determines what kind of public claim can be made.
The Agent Red Teaming benchmark introduces the first layer of this structure. Anthropic reports that Gray Swan, an external research partner, evaluated the models using the ART benchmark developed with the UK AI Security Institute, testing susceptibility to prompt injection and measuring the probability that an attacker finds a successful attack after one, ten, or one hundred attempts across nineteen scenarios.[30] This is more careful than a single robustness percentage. Attack success is treated as probabilistic, repeated attempts are recognised as increasing the chance of success, and the benchmark is framed around indirect prompt injection in tool use. The report asks the reader to understand robustness as an adversarial relation, with the static property displaced by a more pressured account of attack and defence.
The Shade coding evaluation makes the mitigation-state problem clearer. Anthropic states that Shade is an external adaptive red-teaming tool from Gray Swan, used to evaluate prompt-injection attacks in coding environments. In the reported coding table, Claude Sonnet 4.6 with extended thinking reaches 0.0 per cent attack success both with and without safeguards, across the one-attempt condition and the 200-attempt adaptive condition. Under standard thinking, the model records 0.1 per cent attack success without safeguards at one attempt and 7.5 per cent at 200 attempts, while the safeguarded condition lowers those figures to 0.04 per cent and 5.0 per cent respectively.[31] In a public release document, those numbers could easily be compressed into a claim that Sonnet 4.6 is highly robust against coding-environment prompt injection. That claim remains safe only when it stays attached to the tested surface, attacker budget, thinking mode, and safeguard condition.
The computer-use table is more difficult and therefore more valuable for the article. In graphical user-interface environments, Sonnet 4.6 again improves substantially over Sonnet 4.5 and performs strongly relative to the Opus models. But the with-safeguards relation is less smooth than a reader might expect. For Sonnet 4.6, extended thinking without safeguards shows 12.0 per cent attack success at one attempt and 42.9 per cent at 200 attempts; with safeguards, the one-attempt result falls to 8.0 per cent, while the 200-attempt result is reported as 50.0 per cent. Standard thinking moves from 14.4 per cent and 64.3 per cent without safeguards to 8.6 per cent and 50.0 per cent with safeguards.[32] The one-attempt condition improves under safeguards, and the standard-thinking 200-attempt condition also improves. The extended-thinking 200-attempt condition resists a simple before-and-after story. A careful document reader should therefore avoid turning the section into a general assurance that safeguards uniformly reduce risk across all conditions.
Mitigation state earns its place in the article at this pressure point. A mitigation has no magical capacity to transform a risk result into a safety claim. It has to be evaluated in relation to attack surface, adversary, number of attempts, model mode, and measured outcome. The computer-use table shows how a public safety document can be strong even when the result is not perfectly symmetrical. Its value comes from showing the complexity. It also illustrates why a table can be both clear and hard to interpret. The table is visually organised, but the meaning of the numbers depends on knowing that the 200-attempt adaptive condition measures whether at least one of 200 attempts succeeded for a given goal.[33] A reader who treats that column like an ordinary per-attempt success rate will misread the claim.
The browser-use section gives the cleanest example of post-mitigation reporting. Anthropic describes an internal evaluation in which untrusted content is dynamically injected into web environments later viewed by the model through screenshots or page reads, with an adaptive attacker given ten attempts to craft a successful injection. Without safeguards, Sonnet 4.6 records 1.29 per cent successful attack across scenarios in both extended and standard thinking, with per-attempt rates of 0.24 and 0.29 per cent respectively. With additional safeguards, the standard-thinking condition falls from 1.03 per cent of scenarios and 0.16 per cent of attempts under previous safeguards to 0.51 per cent and 0.08 per cent under updated safeguards.[34] The reported deployed-defence layer materially alters the risk surface. It also changes the meaning of the claim: the result concerns a product-like system in which model behaviour, browser environment, injected content, adaptive attacker, and safeguards interact.
The relation to the article’s central thesis is direct. An evaluation becomes a claim after the document tells the reader what the result is allowed to mean. The Claude Sonnet 4.6 prompt-injection material shows that mitigation state is one of the main devices through which that meaning is governed. A table of attack-success rates is not yet a deployment claim. It becomes one when the document places it beside safeguards, threat surfaces, attacker budgets, external red teaming, and release language. The result can then support a bounded judgement about the tested and mitigated system. It cannot, without further evidence, support a claim that prompt injection has been solved, that future attacks will remain contained, or that robustness in one agentic surface generalises across the others.
The first principle, drawn from the GPT-4o persuasion case, is that threshold labels should remain visibly answerable to the evidence that produces them. The second is that mitigation claims should remain visibly answerable to the conditions under which mitigation was tested. Prompt injection is a demanding test of this principle because the attack is contextual, adaptive, and surface-dependent. Anthropic’s Sonnet 4.6 system card does not remove those difficulties. It makes many of them available for inspection. That availability is itself part of the document’s value as an external artifact.
6. Safety Levels, CBRN, and Governance Judgement
The prompt-injection material shows how mitigation state changes the object of a public claim. Anthropic’s CBRN and AI Safety Level material pushes the same problem further, because the claim concerns the movement from evaluation to institutional determination. In the Claude Sonnet 4.6 System Card, biological-risk evaluations are placed within the Responsible Scaling Policy, then routed through AI Safety Level categories, threshold logic, comparative model assessment, and the judgement of the Responsible Scaling Officer. The resulting public claim is therefore that, under Anthropic’s RSP process, ASL-3 safeguards are judged appropriate for the CBRN domain. Its force comes from that narrowing.
Anthropic states that Claude Sonnet 4.6 was evaluated under the Preliminary Assessment Process because it was not considered notably more capable than the recently released Claude Opus 4.6. The process used automated assessments for ASL-3 and ASL-4 thresholds across the relevant RSP domains, with comparative results presented against Sonnet 4.5, Opus 4.5, and Opus 4.6. It did not include human uplift trials, expert red-teaming sessions, or other resource-intensive evaluations requiring human participants.[35] The public claim that follows from this process is conditioned by the kind of assessment performed. It should be read as a preliminary, automated, comparative determination under a specified policy process, with no claim to reconstruct independently every pathway through which biological misuse might occur.
The document makes this conditional structure more visible by reporting how the evaluations were handled. Anthropic evaluated multiple snapshots of the model, including a helpful-only version, and reports the highest scores obtained for each evaluation, since those scores were taken to give a better indication of the capability ceiling in dangerous domains covered by the RSP.[36] This is an important documentary decision. By reporting the strongest observed performance across the relevant snapshots, Anthropic moves the evaluation closer to an upper-bound capability claim. The public reader is therefore given a more conservative account of possible capability than a release-only score might provide. At the same time, the result remains an evaluated ceiling within a defined test suite and falls short of a full map of possible future elicitation, scaffolding, or misuse configurations.
The CBRN section distinguishes the threat models for ASL-3 and ASL-4. ASL-3 focuses on the ability to significantly help individuals or groups with basic technical backgrounds, such as undergraduate STEM degrees, to create, obtain, and deploy CBRN weapons. ASL-4 turns to a more severe scenario: AI systems that could substantially uplift moderately resourced state programmes, for instance through novel weapons design, substantial acceleration of existing processes, or dramatic reduction in technical barriers.[37] The evaluation receives its meaning from the threat model. A biological-risk result becomes public safety language only after the document states the kind of actor, level of resource, and type of uplift that the threshold is meant to capture.
The ASL-3 results are reported in a compressed but significant form. Anthropic states that Claude Sonnet 4.6 performed above the ASL-3 rule-in thresholds on all three ASL-3 evaluations while not exceeding the performance of previous models. The document interprets this as indicating that Sonnet 4.6 is likely to provide a similar degree of uplift for ASL-3 threat actors in the biological domain as earlier released models, including Sonnet 4.5, Opus 4.5, and Opus 4.6.38 This is a carefully bounded claim. It places the model within a capability region already associated with
ASL-3 safeguards. The phrase “similar degree of uplift” frames the evaluation as placement within a known safeguard regime.
The ASL-4 results produce the complementary movement. Anthropic states that Sonnet 4.6 performed below previously released models across the relevant evaluations, and in particular did not cross the threshold on the ASL-4 rule-out short-horizon computational biology tasks evaluation. The resulting interpretation is that Sonnet 4.6 is likely to provide a lower or equal degree of uplift for ASL-4 threat actors in the biological domain as Claude Opus 4.6.39 The model is therefore placed below a higher-risk line, relative to a model already released under ASL-3 safeguards. The ASL-4 rule-out does not erase ASL-3 capability; it limits the stronger claim that the model has crossed into the next level of biological-risk concern.
This is the point at which evaluation most visibly becomes governance. Anthropic writes that, on the basis of the reported results, the Responsible Scaling Officer determined ASL-3 safeguards to be appropriate for the CBRN domain for Claude Sonnet 4.6.40 The statement is short, but it contains the evidence-to-claim chain in compressed form. Automated evaluations are selected and grouped under ASL-3 and ASL-4. Results are interpreted against rule-in and rule-out thresholds. Those results are compared with previous models released under ASL-3 safeguards. The Responsible Scaling Officer then converts the evaluated capability profile into a safeguard-level determination. This is an institutional judgement made public through system-card language.
The precision of this passage gives it documentary strength. A weaker safety document might have allowed the biological-risk results to imply that the model did not present meaningful CBRN danger. Anthropic’s phrasing works at a narrower scale. It says that ASL-3 safeguards are appropriate for the CBRN domain. That claim leaves room for the model to have ASL-3-relevant biological capability while also stating that the higher CBRN-4 capability threshold has not been crossed. It leaves room as well for uncertainty around future measurement, because the claim is tied to the current assessment process and does not expand into an unrestricted statement about biological safety.
The appropriate reading is that Anthropic presents evidence, under its RSP, that Sonnet 4.6 remains within a safeguard regime already applied to comparable models, and that the Responsible Scaling Officer judged ASL-3 safeguards appropriate for CBRN deployment. The claim is policy-mediated, evidence-supported, and conditional. Its public value depends on those conditions remaining visible. The section improves the article’s central argument because it shows evaluation becoming claim through procedure: a threat model is named, selected capabilities are tested, results are measured against thresholds, the model is placed relative to prior systems, and a responsible official’s determination gives the public statement its governance form.
7. Adaptive Evaluation and the Limits of Numerical Confidence
Google DeepMind’s report on defending Gemini against indirect prompt injections changes the scale of the argument. The OpenAI and Anthropic materials examined so far concern public documents in which evaluations are translated into release-facing claims through thresholds, safeguards, scorecards, and institutional determinations. DeepMind’s report asks a more prior question: what happens when the evaluation gives a falsely reassuring picture of the defence? The issue is how the design of the test shapes the claim before public prose has begun to phrase it.
The report defines indirect prompt injection as an attack in which malicious instructions are embedded in external data sources that the model subsequently retrieves and incorporates into its context. An attacker might place hidden commands inside an email that the model is instructed to summarise.[41] The danger appears most clearly in agentic settings, where models call tools, handle user data, retrieve documents, or interact with external systems. A prompt-injection attack, under these conditions, may become a route through which private information is exfiltrated, permissions are mishandled, or user intent is displaced by adversarial instruction. The evaluation of such attacks has to approximate adversarial pressure, since the threat arises from hostile manipulation of context, beyond benign misuse alone.
DeepMind’s central lesson is unusually direct: adaptive evaluation is crucial. The report states that many defences performing well on static evaluation sets can be tricked by small adaptations to the attack, and that attacks adjusted in response to the defence are necessary for a realistic impression of the protection provided.[42] This is a methodological claim with immediate documentary consequence. If a public report states that a defence substantially reduces attack success, while the evaluation was static, non-adaptive, or insufficiently adversarial, the resulting safety claim may inherit a confidence that the test has not earned. The weakness would not lie in the number itself. It would lie in the enlargement of what that number is allowed to mean.
The paper’s discussion of Gemini 2.5 makes this problem concrete. DeepMind reports that prompts optimised against Gemini 2.0 produced a much lower attack-success rate when transferred to Gemini 2.5, falling from 92 per cent on Gemini 2.0 to 18 per cent on Gemini 2.5.43 A document that stopped there could easily have generated a strong robustness claim. The numerical contrast appears to give the reader a simple story: the later model is much harder to attack. Yet DeepMind treats that inference as unsafe. Once adaptive attacks were developed against Gemini 2.5, the picture became more difficult, including a 94.6 per cent attack-success rate in one TAP setting without external or system-level defences.[44] The document’s admission is the crucial point: a non-adaptive evaluation could have produced a public claim that looked cleaner than the underlying security situation warranted.
Adaptive evaluation belongs to the logic of accountability as much as to the technical apparatus of adversarial machine learning. The public reader rarely sees the attacker’s work. The reader sees the reported number, the chart, the phrase “reduction,” and the implied direction of safety improvement. If the evaluation has not exposed the defence to attacks shaped against it, the report may present visual or numerical confidence that is partly an artifact of weak adversarial testing.
DeepMind’s report is strong because it names this failure mode. It shows that an evaluation can be honest in its own terms and still insufficient for the broader claim a reader might draw from it.
The concept of attack-success rate is revealing because it looks stable while depending heavily on the adversary against which it was measured. A low attack-success rate under static prompts may show that the model resists a known test set. It does not show that the model will resist an attacker who changes wording, exploits contextual cues, takes advantage of tool use, or optimises against the defence. A higher attack-success rate under adaptive testing may look worse, yet it can produce a more trustworthy document because it gives the reader a stronger adversarial basis for judgement. The better number for public accountability need not be the smaller number. Sometimes it is the number produced by the more severe test.
This point returns us to the role of visual confidence in safety reporting. A chart showing a large fall in attack success can become persuasive before the reader has studied the test conditions. If the chart does not clearly state whether the attack was static, transferred, adaptive, black-box, grey-box, or optimised against the defence, its visual order may mislead through clarity. DeepMind’s report helps resist that problem by making the evaluation distinction part of the argument. It places the reduction under renewed pressure by asking whether the defence survives an attacker who has adapted to it.
DeepMind’s report is therefore a useful corrective to the tendency of AI safety documentation to seek reassurance through numerical presentation. The strongest documentary move in the paper is not the display of improvement, although improvement is present. It is the explicit account of how a weaker evaluation could have misled. That admission gives the document methodological candour. It also gives AI safety documentation a principle that should travel beyond prompt injection: where a safety claim depends on adversarial testing, the report should identify whether the adversary was allowed to adapt. Without that information, the public reader cannot tell whether the evaluation measured a resilient defence or only the defence’s performance against yesterday’s attack.
The article’s broader argument can now be stated more sharply. A threshold can make evidence actionable; a mitigation table can distinguish model behaviour from safeguarded system behaviour; a safety-level determination can convert evaluation into governance judgement. DeepMind’s adaptive-evaluation lesson shows that all three depend on the prior validity of the test. If the evaluation is too weak, the threshold becomes prematurely stable, the mitigation claim becomes overconfident, and the governance judgement inherits an unstable evidentiary base. Public AI safety reporting therefore needs adversarially honest presentation, where the limits of the test remain visible alongside the result.
8. What Better Evaluation Communication Should Do
The cases examined so far do not lead toward a demand for total transparency, as though frontier AI organisations could publish every internal evaluation, red-team transcript, model snapshot, and deliberative record in full. Public disclosure matters, but the question here is narrower and more exact. Evaluation communication improves when the public document makes the inferential path visible enough for the reader to understand the kind of claim being made. A safety document may remain selective, redacted, compressed, or institutionally mediated; it becomes weaker when that mediation is hidden, when a number stands for a broader claim than the evaluation can sustain, or when the public wording detaches from the task, comparator, mitigation state, or threshold rule that gives the result its measure.
The evaluated object has to remain in view. A model is not always the same object as a deployed system, a tool-using agent, a scaffolded evaluation setup, a post-trained release candidate, or a model surrounded by classifiers, monitoring, product controls, and system-level mitigations. In the GPT-4o persuasion section, the evaluated object changes across text and voice modalities, and the score attached to persuasion cannot be read responsibly without that modal distinction. In the Claude Sonnet 4.6 prompt-injection section, the object changes between conditions with and without safeguards. DeepMind’s indirect prompt-injection report changes it again, because the result depends on whether the attack is static, transferred, or adaptive. A public claim that names “the model” without specifying the evaluated surface invites misreading before the evidence has been considered.
Comparison gives evidence its scale. A result is rarely self-measuring. It becomes legible against a human baseline, a previous model, a post-mitigation state, an attacker budget, a threshold band, or an adversarial condition. GPT-4o’s persuasion results depend on professional human-written articles, human audio clips, and human conversations; Anthropic’s prompt-injection tables depend on model generation, thinking mode, safeguard state, and number of attack attempts; DeepMind’s Gemini results depend on the difference between transferred attacks and adaptive attacks shaped against the defence. Better evaluation communication should therefore preserve the comparator as part of the claim, rather than letting the score circulate as if it were absolute.
Mitigation state must remain attached to every safety-relevant statement. The word mitigation can become a term of reassurance before it has earned that role. It suggests that a problem has been addressed, while leaving open whether the intervention was tested, how it performed, whether it altered base-model behaviour or intercepted outputs, and what risk remained afterwards. Anthropic’s with- and without-safeguards tables are valuable because they refuse to let this distinction disappear. They show that prompt-injection robustness has to be read through the condition under which it was measured. When a document says that a risk has been reduced, it should make clear whether the reduction belongs to the model, the product layer, the monitoring apparatus, the tool environment, or their combined operation.
Threshold rules need the same discipline. They make evaluation actionable by moving from measurement to decision, from observed capability to safeguard level, from risk signal to release condition. Their usefulness is also their danger. A threshold can make a judgement look more settled than it is. The GPT-4o persuasion case shows this pressure with unusual clarity: “Medium” is a public label, while “marginally cross” preserves the closeness of the boundary and the dependency of the claim on pre-registered threshold conditions. Better documentation should keep that dependency visible. A threshold should present the procedure through which a result has been classified, so that the model’s nature is not made to appear self-evident through the category.
Residual risk should remain inside the claim, not appear as a final softening phrase after confidence has already been produced. A document rarely supports the proposition that risk has disappeared. It supports a more exact statement: a risk has been reduced, bounded, monitored, accepted under a policy, or placed within a safeguard regime. The GPT-4o voice material, the Claude Sonnet 4.6
ASL-3 determination, and the DeepMind defence-in-depth argument each show that the public claim depends on what remains after evaluation and mitigation. Residual risk is therefore not an ornamental caveat. It is one of the places where the document tells the reader how much of the original risk has survived the procedure that claims to manage it.
Tables, charts, and scorecards have to be treated as argumentative surfaces. They do not simply display evidence. They help produce the authority of the public claim. A chart can organise comparison more efficiently than prose, while also lending a sense of settlement to evidence that remains conditional. A scorecard can make release-facing information legible, while also making the label more memorable than the qualifications beneath it. A table can clarify the effect of a safeguard, but only if the conditions, baselines, units, attempt budgets, and success criteria remain visible enough for the reader to understand what the numbers are permitted to mean.
The distinction between public classification and underlying measurement is the hinge of the article. GPT-4o’s persuasion section measures opinion shifts under specified modalities and comparators, then classifies the persuasive capability as medium under OpenAI’s threshold structure. Claude Sonnet 4.6’s biological-risk material reports automated evaluations and comparative uplift, then records an ASL-3 safeguard determination for the CBRN domain. DeepMind’s prompt-injection paper reports attack-success behaviour under different adversarial conditions, then draws a methodological lesson about adaptive testing and false reassurance. In each case, the public claim is not identical with the measurement. It is a claim made from measurement, through a procedure, toward a reader.
The practical consequence is that evaluation communication should become more explicit about its own grammar. A strong system card or safety report should allow the reader to reconstruct the path from task to result, from result to threshold, from threshold to safeguard, from safeguard to residual risk, and from residual risk to governance judgement. It does not need to expose every internal deliberation to do this. It must, however, keep the main joints of the argument visible. Where those joints disappear, the document may still be informative, but its public claim becomes harder to inspect.
This has direct significance for external artifacts and AI safety documentation work. The problem is not simply to make technical material easier to read. It is to prevent readability from becoming a form of overconfidence. Good documentation should help a non-specialist reader understand the claim without making the claim stronger, smoother, or more general than the evidence allows. It should also help a technical reader see where the public language has remained faithful to the underlying evaluation. The work is editorial, but not merely editorial. It is evidentiary, procedural, and institutional. The document has to carry uncertainty without dissolving into vagueness; it has to create public legibility without converting every result into reassurance.
The strongest evaluation communication does not eliminate difficulty. It locates it. It tells the reader what was tested, against what comparator, under which mitigation state, by which threshold rule, with what residual uncertainty, and toward what governance consequence. The result may remain incomplete, contested, or provisional. Frontier-model evaluation will often be all three. A document that preserves the inferential path gives public accountability a surface on which scrutiny can work. Without that surface, safety reporting risks becoming a sequence of impressive results and confident labels. With it, evaluation can become a claim without becoming an overclaim.
9. Conclusion: Public Accountability as Disciplined Translation
Evaluation has become one of the principal public languages through which frontier AI systems are made answerable. A system card, safety report, risk framework, or technical paper gives a test a public form. It places a result under a threshold, attaches it to a scorecard, compares it with a baseline, interprets it through a safeguard, carries it into a safety level, or turns it into a warning about the limits of the evaluation itself. The public force of evaluation begins in the movement from measured behaviour to documented judgement.
The examples examined in this article show different versions of that movement. OpenAI’s GPT-4o persuasion section condenses mixed, modality-specific evidence into the public label of a medium-risk persuasive capability. Anthropic’s Claude Sonnet 4.6 prompt-injection tables make visible the difference between model behaviour and safeguarded system behaviour, while its CBRN and ASL material shows how automated evaluations become a formal safety-level determination under the Responsible Scaling Policy. Google DeepMind’s indirect prompt-injection report then moves beneath the surface of the published result, showing how a non-adaptive evaluation could have generated false confidence before the public claim was ever written. Across these cases, the central issue is the documentary discipline with which evaluation is allowed to become a claim.
The strongest safety documents do not eliminate uncertainty, and they do not convert every result into an assurance of control. They preserve the conditions under which a claim can be read. They tell the reader what was evaluated, against which comparator, with which model surface, under which mitigation state, with what adversarial pressure, and through what threshold rule. They show whether a result concerns the base model, a scaffolded system, a product deployment, a safeguarded environment, or a formal governance determination. They prevent a clean table, chart, or scorecard from carrying more authority than the test itself can sustain.
The language of thresholds is especially powerful because it promises order under uncertainty. A threshold allows an organisation to say that a capability has crossed into a more serious category, that a safeguard level has become necessary, that a risk report is required, or that deployment remains permissible under specified conditions. Yet the same threshold can make a close, contested, or framework-dependent judgement appear more settled than it is. The GPT-4o persuasion label shows this pressure clearly. “Medium” travels more readily than “marginally cross.” A public reader may retain the category while losing the boundary condition that gives the category its correct scale.
Better evaluation communication should therefore be understood as a discipline of translation. It translates technical results into public form, while holding open the difference between uncertainty and confidence, mitigation and elimination, threshold crossing and general danger, benchmark performance and deployment safety. Its task is exact: to make the inferential path visible enough that the reader can see where measurement ends and judgement begins. That path will never be complete. Frontier-model reporting will always contain redaction, selective disclosure, evolving methods, and unresolved epistemic difficulty. The question is whether the document makes those limits part of the public claim or hides them beneath fluent prose.
For AI safety documentation, external artifacts, and model-evaluation communication, this is the practical centre of the work. The document is one of the places where institutional responsibility becomes legible. A well-written safety report protects the relation between evidence and claim. It prevents a number from becoming a guarantee, a chart from becoming reassurance, a mitigation from becoming closure, and a threshold from becoming an unexamined authority.
When an evaluation becomes a claim, the public has a right to ask what has changed in that passage from result to language. The answer need not weaken the claim. Often it will make the claim stronger, because the document has shown its terms. The evaluated task, comparator, threshold, safeguard, residual risk, and governance consequence remain available for scrutiny. That is the minimum condition of public accountability in frontier AI safety reporting: an inspectable relation between the evidence a document presents and the claim it asks the reader to accept.
Notes
[1] OpenAI, GPT-4o System Card (8 August 2024), pp. 1-4.
[2] OpenAI, GPT-4o System Card, pp. 15-16.
[3] Anthropic, System Card: Claude Sonnet 4.6 (17 February 2026), pp. 97-101.
[4] Anthropic, System Card: Claude Sonnet 4.6, pp. 102-104.
[5] Chongyang Shi and others, Lessons from Defending Gemini Against Indirect Prompt Injections (Google DeepMind, 20 May 2025), pp. 3-5, 18-23.
[6] Google DeepMind, Frontier Safety Framework, Version 3.0 (22 September 2025), pp. 2-7; Microsoft, Frontier Governance Framework, Version 1 (February 2025), pp. 2-9.
[7] National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (January 2023), pp. 5-7.
[8] OpenAI, GPT-4o System Card, p. 16.
[9] Anthropic, Claude Sonnet 4.6 System Card, pp. 98-101.
[10] Shi and others, Lessons from Defending Gemini, pp. 18-23.
[11] OpenAI, Preparedness Framework, Version 2 (15 April 2025), pp. 4-7.
[12] NIST, AI RMF 1.0, p. 8.
[13] Anthropic, Responsible Scaling Policy, Version 3.0 (24 February 2026), pp. 11-12.
[14] OpenAI, Preparedness Framework, pp. 1, 4-5.
[15] OpenAI, Preparedness Framework, pp. 5-7.
[16] OpenAI, Preparedness Framework, pp. 8-9.
[17] OpenAI, Preparedness Framework, p. 8.
[18] OpenAI, Preparedness Framework, p. 8.
[19] Anthropic, Responsible Scaling Policy, pp. 3-6.
[20] Anthropic, Responsible Scaling Policy, p. 4.
[21] Anthropic, Responsible Scaling Policy, pp. 11-12.
[22] OpenAI, GPT-4o System Card, p. 15.
[23] OpenAI, GPT-4o System Card, p. 16.
[24] OpenAI, GPT-4o System Card, p. 16.
[25] OpenAI, GPT-4o System Card, p. 15.
[26] OpenAI, GPT-4o System Card, p. 16.
[27] OpenAI, GPT-4o System Card, p. 16.
[28] OpenAI, GPT-4o System Card, p. 16.
[29] Anthropic, Claude Sonnet 4.6 System Card, p. 97.
[30] Anthropic, Claude Sonnet 4.6 System Card, pp. 97-98.
[31] Anthropic, Claude Sonnet 4.6 System Card, pp. 98-99.
[32] Anthropic, Claude Sonnet 4.6 System Card, pp. 99-100.
[33] Anthropic, Claude Sonnet 4.6 System Card, p. 100.
[34] Anthropic, Claude Sonnet 4.6 System Card, pp. 100-101.
[35] Anthropic, Claude Sonnet 4.6 System Card, p. 102.
[36] Anthropic, Claude Sonnet 4.6 System Card, p. 102.
[37] Anthropic, Claude Sonnet 4.6 System Card, pp. 102-103.
[38] Anthropic, Claude Sonnet 4.6 System Card, pp. 103-104.
[39] Anthropic, Claude Sonnet 4.6 System Card, p. 104.
[40] Anthropic, Claude Sonnet 4.6 System Card, p. 104.
[41] Shi and others, Lessons from Defending Gemini, pp. 3-6.
[42] Shi and others, Lessons from Defending Gemini, pp. 4, 9.
[43] Shi and others, Lessons from Defending Gemini, p. 21.
[44] Shi and others, Lessons from Defending Gemini, pp. 21-22.
Bibliography
Anthropic, Responsible Scaling Policy, Version 3.0 (24 February 2026) https://www.anthropic.com/responsible-scaling-policy/rsp-v3-0
Anthropic, System Card: Claude Sonnet 4.6 (17 February 2026) https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf
Google DeepMind, Frontier Safety Framework, Version 3.0 (22 September 2025) https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/strengthening-our-frontier-safety-framework/frontier-safety-framework_3.pdf
METR, Common Elements of Frontier AI Safety Policies (16 December 2025) https://metr.org/common-elements
Microsoft, Frontier Governance Framework, Version 1 (February 2025) https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/microsoft/final/en-us/microsoft-brand/documents/Microsoft-Frontier-Governance-Framework.pdf
National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (January 2023) https://doi.org/10.6028/NIST.AI.100-1
OpenAI, GPT-4o System Card (8 August 2024) https://cdn.openai.com/gpt-4o-system-card.pdf
OpenAI, Preparedness Framework, Version 2 (15 April 2025) https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf
Shi, Chongyang, Sharon Lin, Shuang Song, Jamie Hayes, Ilia Shumailov, Itay Yona, Juliette Pluto, Aneesh Pappu, Christopher A. Choquette-Choo, Milad Nasr, Chawin Sitawarin, Gena Gibson, Andreas Terzis, and John “Four” Flynn, Lessons from Defending Gemini Against Indirect Prompt Injections (Google DeepMind, 20 May 2025), arXiv:2505.14534 https://arxiv.org/abs/2505.14534