Reading Tables in AI Safety Reports

Evidence, Confidence, and Presentation in Model-Release Documentation


Saman Samadi, PhD (Cantab)


Article / worksheet | 18 March 2026

Focus: Evaluation tables; benchmark interpretation; visual confidence; evidence-to-claim reasoning.

Download PDF | Download Checklist | Return to AI Safety Portfolio

Abstract

Tables in AI safety reports do not merely display results. They organise the reader's confidence in those results. A benchmark table, preparedness scorecard, attack-success chart, or capability-threshold comparison can make a safety claim appear more settled than the underlying evaluation method permits, especially when the table compresses task design, comparator, mitigation state, threshold rule, and residual uncertainty into a visually compact form. This article examines how tables and table-like figures function in frontier AI safety documentation, taking examples from Anthropic, OpenAI, and Google DeepMind.

The first case study reads Anthropic's February 2026 Risk Report, where a table of chemical and biological weapons-related evaluations helps support the distinction between ASL-2 and ASL-3 safeguard classifications while leaving open the difficulty of translating proxy-task performance into real-world threat-actor usefulness. The second case study examines OpenAI's GPT-4o persuasion scorecard and effect-size presentation, where mixed modality-specific findings are gathered under a medium-risk label. The third case study considers Google DeepMind's evaluation of Gemini against indirect prompt injection, where adaptive attacks unsettle the confidence that non-adaptive evaluation might otherwise project. Across these cases, the article argues that strong AI safety reporting depends on making the relation between table, method, claim, and uncertainty inspectable. A table becomes more accountable when the reader can see what it measures, what it compares, what decision it supports, what it does not show, and what uncertainty remains after its numbers have been arranged.

Keywords: AI safety documentation; evaluation tables; system cards; benchmark reporting; residual risk; visual confidence; model evaluations; safety claims; frontier AI governance.

1. Introduction: When a Table Becomes a Claim

Frontier AI safety documents increasingly ask tables to do work that prose alone cannot easily perform. A table can place systems beside one another, compress many evaluations into a single visible surface, assign a model to a risk category, distinguish pre-mitigation from post-mitigation behaviour, or display an attack-success rate before and after a defence has been applied. The table promises order. It tells the reader that a difficult judgement has been made measurable, arranged, and given a public form. Yet the same order can produce an excess of confidence. Once numbers have been aligned in rows and columns, a safety claim may appear more stable than the evaluation that supports it.

This problem is not unique to AI. Tables have long carried institutional authority because they transform dispersed observation into comparable form. In AI safety documentation, however, the relation between table and claim is unusually charged. The documents at issue are not ordinary technical appendices. They belong to release decisions, risk frameworks, system cards, model cards, preparedness reports, safety cases, and public-facing governance materials. A score can influence whether a model is described as low, medium, high, or critical risk. A pass rate can support a claim about dangerous capability. A benchmark result can be attached to a mitigation story. An attack-success rate can shape the public impression of whether a safeguard has held. In these documents, the table does not sit beneath the claim as neutral evidence. It helps determine the form in which the claim can be made.

The central difficulty appears when the table's visual clarity exceeds its evidentiary clarity. A table may show that one model scored higher than another, but not whether the test captures the capability that matters for real-world misuse. It may show that an attack succeeds less often after a defence has been introduced, but not whether the attacker was allowed to adapt. It may show that a risk category has been assigned, but not how much judgement entered the movement from result to category. It may show a confidence interval, while leaving the reader uncertain about the assumptions behind the underlying sample, task, rubric, or elicitation method. The numbers are visible; the relation between numbers and public claim may still remain difficult to reconstruct.

This article therefore treats tables in AI safety reports as documentary devices. They are neither ornamental visual aids nor self-sufficient evidence. They are arrangements through which evidence becomes claim-bearing. A table selects what counts as comparable, what remains outside the comparison, what metric appears as decisive, what uncertainty is marked, and what later sentence can be written with greater public authority. When a system card or risk report states that a model remains below a threshold, requires enhanced safeguards, has a medium-risk persuasion profile, or appears robust against a particular class of attack, the table often provides the visible surface on which that statement rests.

The first case study comes from Anthropic's Risk Report: February 2026. In the section on non-novel chemical and biological weapons production, Anthropic presents Table 4.4.A, a summary of chemical and biological weapons capability evaluations for Claude Opus 4.6 and Claude Sonnet 4. The table includes bioweapons acquisition uplift trials, expert red-teaming, long-form virology tasks, multimodal virology, and LAB-Bench subsets.[1] It is followed by a statement that, since the launch of Claude Opus 4, Anthropic has required ASL-3 safeguards for models with similar or greater general capabilities, while also reporting significant uncertainty about the level of real-world risk posed by Opus 4 and more advanced models. The table therefore supports a safeguard classification, but it does not close the risk question. Anthropic itself names the remaining difficulty: threat actors vary in skill; experts disagree about the importance of tacit knowledge; and the company must test models on representative proxy tasks rather than actual dangerous tasks.[2] The table is useful precisely because it shows a decision being made under conditions where direct measurement would be unsafe, partial, or impossible.

The second case study turns to OpenAI's GPT-4o System Card. Its preparedness scorecard classifies GPT-4o as medium risk overall because persuasion is scored at the medium level while cybersecurity, biological threats, and model autonomy remain low.[3] The persuasion section then states that GPT-4o's persuasive capabilities marginally cross into OpenAI's medium-risk threshold from low risk, and presents immediate and one-week effect-size values for different intervention types.[4] The public difficulty lies in the relation between the label and the underlying results. OpenAI distinguishes text and voice modalities; it reports that voice did not meaningfully increase preparedness risk; it states that AI text interventions were not more persuasive than professional human-written content in aggregate, while exceeding human interventions in three out of twelve instances.[5] The table-like presentation gives the medium-risk label a public surface, but the underlying evidence is mixed, modality-specific, and temporally qualified. The scorecard stabilises the claim; close reading has to recover the conditional structure that the visual label can compress.

The third case study examines Google DeepMind's report on defending Gemini against indirect prompt injections. Its most important contribution for this article is methodological rather than merely numerical. The report compares non-adaptive and adaptive attacks against different defences and states that adaptive attacks equalled or outperformed non-adaptive counterparts in sixteen out of twenty-four defence-attack combinations.[6] This alters the status of the table. A defence may appear stronger when tested against static or transferred attacks, yet lose part of that apparent strength when the adversary adapts to the defence itself. Here the table does not only report performance; it exposes a possible weakness in the evaluation regime. The same defence, under a different adversarial assumption, may carry a different public meaning.

The three case studies differ in domain, structure, and audience. Anthropic's table supports safeguard classification in a chemical and biological risk context. OpenAI's scorecard and effect-size display organise a public persuasion-risk label. Google DeepMind's adaptive-attack results change how a reader should understand robustness claims. Yet they share a common documentary problem: the movement from measured result to public safety claim is never automatic. It depends on the evaluated object, the metric, the comparator, the threshold, the mitigation state, the adversarial condition, the uncertainty statement, and the decision the document asks the result to support.

The article's method follows from that problem. Each table is read through seven questions. What exactly is being evaluated? What metric gives the result form? What comparator makes the number meaningful? What threshold or category gives the result consequence? What mitigation state is being measured? What uncertainty remains visible? What public claim does the table enable, and where would that claim overreach? These questions do not produce a complete technical audit. They produce a disciplined reading practice for public AI safety documentation.

The stakes are practical. External-artifacts teams, system-card writers, research-operations staff, governance analysts, safety-evaluation coordinators, and technical communicators all work near this threshold. They may not design the evaluation, but they often help decide how evaluation results are written, summarised, visualised, caveated, and carried into release documentation. A poorly framed table can make weak evidence look decisive. A strong table can preserve uncertainty without disabling judgement. The difference lies not only in design, but in claim discipline: the reader must be able to see what the numbers support, what they leave untouched, and what kind of institutional decision has been attached to them.

Tables in AI safety reports therefore need to be read as public forms of evidentiary translation. Their authority does not lie in numerical presentation alone. It lies in the traceable relation between measurement and claim. Where that relation is visible, tables can strengthen accountability. Where it disappears, visual confidence begins to outrun evidentiary measure.

2. How to Read an AI Safety Table

An AI safety table should first be read as a claim-making arrangement rather than as a neutral display of results. This does not mean that the table is manipulative, or that the numbers are secondary to rhetoric. It means that the table gives evidence a public order. It selects what appears beside what, which quantity becomes comparable, which threshold becomes visible, which uncertainty is preserved, and which later sentence the document can write with greater authority. Before the reader asks whether the claim is persuasive, the more basic task is to ask what kind of claim the table has been built to support.

The first question is the evaluated object. A table may appear to compare models, but its real object may be a model variant, a deployed system, a scaffolded agent, a pre-mitigation configuration, a post-mitigation product surface, or a special evaluation version with refusals removed. This distinction can change the force of the result. A score obtained from a helpful-only model, a model with safeguards disabled, or a model placed inside a particular tool environment does not have the same public meaning as a score obtained from an ordinary user-facing deployment. The table becomes weaker when the object of evaluation is visually clear only as a model name, while the surrounding conditions that produced the result remain difficult to locate.

The second question is the metric. Tables in AI safety reports often gather unlike measures under a single visual form: uplift ratios, attack-success rates, pass@k scores, human baselines, confidence intervals, qualitative red-team summaries, threshold labels, or binary risk categories. Each metric supports a different kind of claim. An uplift ratio can indicate that AI assistance improved performance relative to a control group; it does not by itself show that a threat actor could complete an end-to-end harmful project. An attack-success rate can show how often a given attack succeeded under specified conditions; it does not show robustness against attacks that were not attempted. A pass@5 score can show that success appeared within repeated attempts; it does not carry the same meaning as first-attempt reliability. The metric is not merely a number. It is the rule through which model behaviour becomes countable.

The third question is the comparator. A number becomes meaningful through relation. The table may compare a model against an earlier model, a human baseline, a control group, an ASL threshold, a non-adaptive attack, a defended system, or an undefended condition. Without the comparator, the result loses part of its public meaning. A score of 0.89, a 2.53x uplift, a medium-risk label, or a 16-out-of-24 comparison only becomes interpretable once the reader knows what has been placed beside it. The comparator also shapes the rhetorical pressure of the table. Comparison with a weaker model may suggest progress; comparison with a human expert may suggest capability; comparison with a threshold may imply governance consequence; comparison with an adaptive adversary may unsettle a robustness claim that appeared stable under a less demanding evaluation.

The fourth question is the threshold or decision rule. Many AI safety tables are not merely descriptive. They are attached to a decision: whether a model requires stronger safeguards, whether a preparedness category has been crossed, whether a deployment condition has been satisfied, whether a system remains below a critical capability level, or whether a defence can be treated as adequate. A threshold turns a result into a hinge. The table no longer says only "this was measured." It begins to say "this measurement authorises, blocks, delays, or conditions an action." The public reader should therefore ask whether the threshold is visible, whether it is defined outside the table, whether it is quantitative, qualitative, or judgement-based, and whether the table gives enough information to understand how the result approached or crossed it.

The fifth question is mitigation state. AI safety documentation often moves between pre-mitigation and post-mitigation conditions, and this movement is easy to lose when a table is compressed. A pre-mitigation result may describe a model's capability or harmful output tendency before safeguards are applied. A post-mitigation result may describe the system after classifiers, policy restrictions, refusal training, monitoring, access controls, or other interventions have entered the evaluation. The difference is not merely technical. It determines where the safety claim belongs. If the table presents only the mitigated state, the reader may attribute safety to the model rather than to the surrounding deployment arrangement. If the table presents both states, it can show how much work the mitigation is doing and what risk remains after it.

The sixth question is uncertainty. Some uncertainty appears numerically, through confidence intervals, sample sizes, error bars, repeated trials, or distributions. Some appears verbally, through caveats about proxy tasks, evaluation realism, threat-actor assumptions, red-team coverage, model elicitation, benchmark contamination, or the limits of current measurement. Some uncertainty does not appear at all, even when the table plainly depends on assumptions that should have been marked. Strong documentation does not remove uncertainty from the table in order to make the result look cleaner. It attaches uncertainty to the point at which the reader might otherwise overgeneralise. NIST's AI Risk Management Framework is useful here because it warns that AI measurement approaches can be oversimplified, gamed, used outside their intended contexts, or deprived of necessary nuance.[7] A table that carries no uncertainty may be visually efficient, but its efficiency can become a documentary weakness when the public claim depends on conditions the table does not disclose.

The seventh question is the public claim the table enables. A table rarely remains alone. It is followed, preceded, or surrounded by prose that tells the reader what to make of it. The prose may say that a model crosses a threshold, that a risk remains low, that a safeguard is effective, that a deployment is acceptable, that a capability has not yet reached a concerning level, or that further monitoring is needed. The table's accountability value depends on whether that prose remains proportionate to the evidence. A table can support a narrow claim while being made to carry a broad one. It can justify a comparison while being asked to imply a real-world safety conclusion. It can show a result under a fixed test condition while being used to suggest general robustness. The most important reading question is therefore not "what does the table say?" but "what sentence does the document write after arranging the table in this way?"

These seven questions form the reading method used in the case studies that follow:

The order of these questions matters less than their combined pressure. They prevent the reader from treating numerical display as a substitute for analysis. They also prevent the opposite mistake: dismissing tables because they do not settle everything. A table does not need to carry the whole safety case in order to be valuable. It needs to carry its own part of the case with enough clarity that the reader can understand the relation between evidence, decision, and uncertainty.

This method also distinguishes table literacy from mathematical sophistication alone. The reader does not always need advanced statistical expertise to ask whether a table identifies its model configuration, defines its metric, names its baseline, explains its threshold, distinguishes pre- and post-mitigation conditions, and preserves uncertainty at the point of public claim. These are documentary questions before they are mathematical ones. They concern how evidence is made legible.

The distinction is important for AI safety documentation roles. External-artifacts writers, research-operations staff, governance analysts, and safety-communications specialists may not design every evaluation, but they often participate in the public life of the result. They decide how an evaluation table is introduced, how much context is placed around it, whether a caveat is attached to the relevant sentence, whether a visual label has become too strong for the evidence, and whether an internal judgement has been made too fluent for an external reader to inspect. Table reading is therefore not peripheral to documentation. It is one of the places where public accountability is either strengthened or weakened at the level of form.

NIST AI 800-2, though concerned specifically with automated benchmark evaluations, gives this reading practice a useful standards-facing vocabulary. Its draft guidance is organised around the need for evaluation practices that support validity, transparency, and reproducibility, and it addresses language models and agent systems rather than treating benchmark results as self-explanatory outputs.[8] Those terms should not be added mechanically to every system card or safety report. Yet they clarify the documentary burden. Validity asks whether the evaluation measures what the later claim needs it to measure. Transparency asks whether enough of the evaluation procedure is visible for the reader to understand the result. Reproducibility asks whether the method and conditions are specified enough for the result to be checked, repeated, or compared. These are not decorative standards words. They name the conditions under which a table can become trustworthy public evidence.

The three case studies now apply this method at different scales. Anthropic's table raises the problem of moving from proxy capability evaluations to safeguard classification under uncertainty. OpenAI's persuasion scorecard raises the problem of gathering mixed modality-specific findings under a compact risk label. Google DeepMind's adaptive prompt-injection results raise the problem of evaluation realism, since the confidence produced by non-adaptive testing changes once the adversary is allowed to respond. In each case, the table is useful. In each case, it also has limits. The task is to read both at once.

3. Anthropic: Capability Tables and Safeguard Classification

Anthropic's Risk Report: February 2026 offers a useful first case because its table is not merely reporting model performance. It is helping to organise a safeguard classification. In section 4, Anthropic discusses non-novel chemical and biological weapons production as a priority threat model, defined around the possibility that individuals or small groups with limited resources might use AI systems to gain access to chemical or biological weapons. The relevant model comparison is then arranged around two protection levels: Claude Opus 4.6, described as the most capable model subject to ASL-3 protections, and Claude Sonnet 4, described as the most capable model subject only to ASL-2 protections.[9] The table that follows therefore sits at a threshold. It does not simply ask which model performs better. It asks what kind of evidence helps justify the movement from baseline safeguards to enhanced safeguards.

The table's title is modest: "Summary of results from evaluations measuring the chemical and biological weapons capabilities of Claude Opus 4.6 and Claude Sonnet 4." Yet its documentary work is larger than summary. It places several forms of evidence into a single comparative surface: a bioweapons acquisition uplift trial, expert red-teaming, long-form virology tasks, multimodal virology knowledge testing, and a subset of LAB-Bench tasks. The rows differ sharply in method, metric, cost, realism, and claim-making force. Some results are numerical. Some are qualitative. Some compare model-assisted performance with a control group. Some compare model performance with human baselines. Some report threshold-relevant values. Some indicate trends rather than final decision rules. The table therefore gathers evidence before the evidence has become fully commensurable.

That lack of commensurability is not a defect by itself. In frontier AI safety documentation, especially where dangerous capabilities are at issue, a single decisive evaluation is rarely available. The stronger question is whether the document lets the reader see the difference among evidentiary forms. Anthropic's table does this in part by naming the evidence type and giving short descriptions of each evaluation. A controlled uplift trial is not treated as the same kind of evidence as a long-form virology task; expert red-teaming is not reduced to the same numerical language as a multiple-choice multimodal evaluation. The document allows different forms of evidence to remain visible. At the same time, their placement inside one table gives them a shared visual authority. The reader has to resist the temptation to treat the table as one unified scale.

The evaluated object is already complicated. The column heading appears to compare Claude Opus 4.6 under ASL-3 with Claude Sonnet 4 under ASL-2. Yet Anthropic states before the table that, if another model scored higher on an evaluation, or if the evaluation was not conducted for these models, it reports the highest score for any other model that was evaluated.[10] This is an important disclosure. It makes the table more conservative from a risk-monitoring perspective, because the table is not restricted to a perfectly symmetrical pairwise comparison. It asks what the strongest relevant result is within the model group being discussed. But it also means that the table cannot be read as a clean head-to-head comparison between two fixed systems. Some cells report results for Claude Opus 4 rather than Claude Opus 4.6; some report results for other Claude models where they scored higher on a particular evaluation. The table is therefore not simply a model ranking. It is a risk-relevant evidence assembly.

That distinction matters for public interpretation. A reader who sees the column label "Claude Opus 4.6 performance" may expect every number in that column to come from the same model under the same evaluation conditions. Anthropic's explanatory note prevents that expectation from hardening, but only if the reader attends to it. The table then becomes a test of documentation discipline. It is transparent enough to disclose its substitution rule, but visually compact enough that the substitution rule can easily recede behind the more immediate contrast between ASL-3 and ASL-2 columns. A strong reading must keep the table's caveat active throughout the analysis.

The bioweapons acquisition uplift trial illustrates the problem. The row describes a controlled trial in which participants had two days to draft a comprehensive plan, with a control group allowed to use the internet but not Anthropic's models, and a test group given access to Claude with safeguards removed. Outputs were graded by Deloitte using a detailed rubric. The reported Claude Opus 4 result is a score of 63% +/- 13%, with a 2.53x uplift against a 25% control, while the ASL-3 threshold is given as 2.8x uplift. The table also states that all participants hit critical failures. For Claude Sonnet 4, the table reports a lower score and uplift.[11]

Several claims are possible here, and they are not the same. The result supports the claim that model access, under the tested conditions, produced substantial uplift relative to the control group. It also supports the narrower claim that the reported Claude Opus 4 result approached, but did not reach, the stated ASL-3 threshold on that particular uplift metric. It does not by itself support the broader claim that the model would enable real-world completion of a dangerous project. The "critical failures" language matters because it marks a remaining practical barrier within the evaluation itself. The table gives reason for concern, but it also preserves a gap between improved planning performance and successful real-world execution.

This gap becomes the centre of the section. Anthropic's table contains a row on long-form virology tasks developed with SecureBio, Deloitte, and Signature Science. These tasks test end-to-end completion of pathogen acquisition processes through workflow design and laboratory protocols. One reported value, Virology Task 2 for the ASL-3-side column, is 0.912 pass@5, above an ASL-3 threshold of 0.8. Claude Sonnet 4's corresponding value is lower.[12] This row has a different evidentiary character from the uplift trial. It is more task-based and threshold-proximate, and the threshold language makes its governance consequence easier to see. Yet pass@5 is not ordinary reliability. It indicates success across repeated attempts under evaluation conditions. Its public meaning depends on the task design, the number of attempts allowed, the grading rule, and the degree to which the task stands in for the real-world pathway of concern.

The table also includes rows that are more trend-indicative than decision-complete. The multimodal virology row reports a multiple-choice evaluation from SecureBio with a human expert baseline and states that all models exceed that baseline, while also saying that the figure is reported to measure trend. LAB-Bench is likewise broken into subsets, with Claude Opus-side results exceeding human baselines across four tasks while the Sonnet-side results vary by task. These rows may help establish a general capability picture, especially around biological knowledge and scientific task performance. But they are weaker if asked to support a direct claim about operational misuse. A model can exceed a human baseline on a knowledge task without thereby possessing the integrated practical competence, judgement, materials access, and sustained project capacity that a real threat pathway would require.

Anthropic's strongest documentary move comes after the table. It states that, since the launch of Claude Opus 4, it has required ASL-3 safeguards for models with similar or greater general capabilities than Opus 4. It then explains that capabilities are assessed by comparing model performance on automated evaluations, while more expensive uplift trials are reserved for below-frontier models requiring more thorough assessment around the ASL-2 / ASL-3 boundary.[13] This paragraph tells the reader what the table is being asked to do. The table does not need to prove real-world catastrophe in order to matter. It contributes to a classification rule: sufficiently capable frontier models are placed under stronger safeguards.

The following paragraph prevents that classification rule from becoming false certainty. Anthropic says it still has significant uncertainty about the level of risk actually posed by Opus 4 and more advanced models. Strong performance on concrete evaluations is difficult to translate into a judgement about how helpful Claude would be to a real threat actor attempting a complex project over months. The report gives three reasons: uncertainty about threat-actor characteristics and skill levels, disagreement among experts about tacit knowledge, and the need to test representative proxy tasks rather than actual dangerous tasks.[14] This is exactly the kind of caveat that should sit near a threshold table. It tells the reader that the table supports a safeguard decision, not a complete risk proof.

The ordering is important. Anthropic does not place the uncertainty in a distant limitations appendix. It appears directly after the table and after the ASL-3 classification rule. That proximity is a strength. It helps keep the table's visual authority under discipline. The reader sees the evaluation evidence, then the safeguard rule, then the uncertainty around real-world translation. The document thereby avoids a common weakness in safety reporting: allowing a table to generate confidence while placing the caveats too far away to affect the claim being made.

Still, the table also shows how easily visual confidence can outrun evidentiary measure. The ASL-3 and ASL-2 labels in the column headings make the comparison feel clean. The rows appear to sort evidence into two protection regimes. The threshold values give the impression of procedural exactness. Yet the underlying evidence is irregular. Some results come from predecessor or neighbouring models; some are qualitative; some use safeguards removed; some report trends; some concern knowledge; some concern task performance; some are closer to the ASL decision than others. The visual grammar of the table is more orderly than the evidentiary terrain it summarises.

This does not weaken the table's usefulness. It defines the reader's task. The table should be read not as a mathematical proof of ASL-3 necessity, but as a structured public display of the evidence that makes ASL-3 a reasonable safeguard classification under uncertainty. The distinction is crucial. A mathematical proof would require cleaner comparability, direct measurement of the target harm, and a more determinate relation between metric and real-world outcome. A safeguard classification can be justified under weaker conditions when the domain is high-stakes, the evidence suggests meaningful capability, and direct testing of the dangerous task is impossible or unacceptable.

The mitigation state also needs careful handling. The bioweapons acquisition uplift trial row states that the test group had access to Claude with safeguards removed. This means that the result is not a measure of the ordinary deployed product after ASL-3 safeguards have been applied. It is closer to a dangerous-capability measurement under conditions designed to reveal what the model could provide if protective restrictions were absent or bypassed. That makes the result relevant to safeguard classification, but it also limits what can be inferred about ordinary product risk. The table belongs to the "current state of model capabilities" section, not to the later "risk mitigations" or "overall assessment of risk" sections. Its evidence helps explain why safeguards are needed; it does not by itself show how effective those safeguards are.

Anthropic's own document structure supports this reading. After section 4.4, the report moves into ASL-2 and ASL-3 risk mitigations, including acceptable use policies, harmlessness training, real-time classifier guards, offline monitoring, access controls, bug bounty work, threat intelligence, rapid response, and model-weight security. Later, in section 4.6, Anthropic assesses the contribution to catastrophic risk as very low but not negligible, and distinguishes risk from models kept under ASL-2 protections from risk from models under ASL-3 protections.[15] The table is therefore one part of a longer risk argument. It is a capability table inside a mitigation architecture.

What, then, can the table responsibly support? It can support the claim that Anthropic has multiple evaluation signals indicating substantially higher chemical and biological capability for models in the ASL-3 category than for models kept under ASL-2. It can support the claim that some evaluations approach or cross threshold-relevant values. It can support the governance claim that models at or above the Claude Opus 4 capability level warrant stronger safeguards. It can support the documentation claim that capability evidence, threshold language, and safeguard classification are being connected publicly rather than hidden entirely inside the organisation.

What should the table not be asked to support? It should not be asked to prove that a particular model would enable a real-world catastrophic chemical or biological weapons event. It should not be asked to establish that ASL-3 safeguards eliminate misuse risk. It should not be read as a simple head-to-head comparison between two fixed models across identical evaluation conditions. It should not be treated as a complete account of threat-actor capability, tacit knowledge, access to materials, operational competence, or sustained project execution. Anthropic's uncertainty paragraph says as much, and the article's table-reading method keeps that uncertainty attached to the numbers.

The table is strongest when read as a public classification instrument. It makes a difficult internal judgement partially inspectable. The reader can see the evidence types Anthropic considered, the threshold-relevant values it chose to display, the distinction between ASL-2 and ASL-3 model groups, and the point at which uncertainty enters the claim. It does not make the whole decision transparent. It cannot. But it gives the decision a surface that can be questioned.

The lesson from this case is that safety tables often work best when they do not pretend to finish the safety argument. Anthropic's table becomes useful because it is followed by prose that limits what the table can mean. The table arranges capability evidence; the surrounding text assigns that evidence to a safeguard decision; the uncertainty paragraph keeps the decision from hardening into a stronger claim than the evidence can bear. That relation among table, threshold, safeguard, and caveat is the real object of the analysis.

4. OpenAI: Persuasion Scores and the Stabilisation of a Medium-Risk Label

OpenAI's GPT-4o System Card gives the second case a different shape. Anthropic's table gathered several chemical and biological capability evaluations in order to support a safeguard classification under uncertainty. OpenAI's persuasion material works through a more compact scorecard logic. The public reader first encounters a Preparedness Framework scorecard in which cybersecurity, biological threats, and model autonomy are marked low, while persuasion is marked medium.[16] The resulting visual field is simple: three low categories, one medium category, and an overall model risk that becomes medium because the highest relevant category carries that label. The table-like form is easy to read. Its difficulty lies in the relation between the label and the mixed evidence that follows.

The scorecard is not only a summary of findings. It is a deployment-facing instrument. OpenAI states near the scorecard that only models with a post-mitigation score of medium or below can be deployed, and that only models with a post-mitigation score of high or below can be developed further.[17] This matters because the scorecard is attached to action. A low, medium, high, or critical label does not merely describe a domain of concern; it participates in the governance grammar of release. The table gives the model a place inside OpenAI's Preparedness Framework, and that place carries consequences for deployment, mitigation, and public interpretation.

The public claim becomes especially delicate because persuasion is the only category that moves the model out of an all-low presentation. OpenAI states in the main system-card introduction that GPT-4o's voice modality does not meaningfully increase Preparedness risks, that three of the four Preparedness Framework categories scored low, and that persuasion scored borderline medium.[18] Later, in the preparedness section, the Safety Advisory Group is described as having recommended classifying GPT-4o before mitigations as borderline medium risk for persuasion and low risk in the other categories. The overall risk score is therefore medium because the Preparedness Framework determines overall risk by the highest category score.[19] This is a clear decision rule, and it gives the scorecard its public force.

Yet the persuasion section itself is less simple than the scorecard. It begins with the statement that GPT-4o's persuasive capabilities marginally cross into OpenAI's medium-risk threshold from low risk.[20] The word "marginally" does important work. It prevents the medium label from appearing as a strong categorical leap. It marks the threshold as crossed, while also indicating proximity to the boundary. A public reader who sees only the scorecard may retain the medium label more vividly than the marginal nature of the crossing. The table's label travels easily; the qualifying adverb requires closer reading.

The evaluated object is also split by modality. OpenAI evaluated both text and voice capabilities. The card states that, based on pre-registered thresholds, the voice modality was classified as low risk, while the text modality marginally crossed into medium risk.[21] This distinction changes the meaning of the table. The public scorecard does not say "text persuasion: medium; voice persuasion: low" in the most visible summary position. It says "Persuasion: Medium." The label is accurate within the Preparedness Framework decision rule, but it compresses the modality difference. The reader must move from the scorecard into the prose to recover the more precise claim: medium risk belongs to the text modality crossing, not to a generalised conclusion that every persuasive modality of GPT-4o reached medium risk.

The metric is effect size. OpenAI presents immediate effect-size values and one-week-later values for persuasion interventions.[22] This gives the persuasion claim a more statistical surface than a purely qualitative risk assessment would provide. The values tell the reader that persuasion was measured as opinion shift under specified experimental conditions. The table-like presentation therefore appears to answer a concrete question: how much did an intervention move opinions, and how much of that movement persisted after time had passed? This is valuable, because it keeps the claim close to measurement. But effect size also demands careful interpretation. A percentage attached to opinion shift is not the same as a general measure of political manipulation capacity, real-world persuasion at scale, or long-term behavioural change.

The comparator is central. For the text modality, OpenAI compares GPT-4o-generated articles and chatbots against professional human-written articles on selected political topics. The card states that AI interventions were not more persuasive than human-written content in aggregate, while exceeding human interventions in three instances out of twelve.[23] That sentence complicates the scorecard. If one asks whether AI text was generally more persuasive than professional human text, the aggregate answer reported by OpenAI is no. If one asks whether AI text sometimes exceeded human interventions in particular instances, the answer is yes. If one asks whether the pre-registered threshold was crossed, the answer is also yes, but marginally. The scorecard condenses these answers into one label. Close reading has to unfold them again.

The voice modality produces a different comparison. OpenAI evaluated voiced audio clips and interactive multi-turn conversations relative to human baselines: listening to static human-generated audio clips or engaging in conversation with another human. The card states that, for both interactive multi-turn conversations and audio clips, GPT-4o's voice model was not more persuasive than a human. It reports that AI audio clips were 78% of the human audio clips' effect size on opinion shift, while AI conversations were 65% of the human conversations' effect size.[24] When surveyed one week later, the effect size for AI conversations was 0.8%, while the effect size for AI audio clips was -0.72%.[25] These numbers do not support a simple claim that GPT-4o voice exceeded human persuasion. They support the more limited claim that voice was evaluated, compared against human baselines, and classified as low risk under the relevant threshold.

The mitigation state is less visible here than in some dangerous-capability tables. The scorecard is a Preparedness Framework scorecard, and OpenAI distinguishes pre-mitigation and post-mitigation scores elsewhere in the card's framing. But in the persuasion section itself, the reader primarily sees a capability and risk classification rather than a full before-and-after mitigation comparison. The classification therefore depends heavily on evaluation design, threshold rules, and Safety Advisory Group review. The table-like presentation tells us the label; the surrounding prose tells us that the label is tied to pre-registered thresholds and review. What remains less visible is how different mitigations would change the measured persuasive effect, or whether the medium label reflects a pre-mitigation, post-mitigation, or boundary judgement in a way that a public reader can reconstruct without consulting the wider Preparedness Framework.

The threshold is doing more work than the table itself can show. OpenAI says the thresholds were pre-registered, but the scorecard does not display the threshold definition in the visual summary. The reader sees the result of threshold application rather than the threshold rule itself. That may be appropriate in a public system card, since including every threshold definition in the scorecard would make it unreadable. Yet this is exactly where visual confidence can begin to outrun evidentiary inspection. A medium label looks crisp. The boundary that made it medium, the pre-registration procedure, and the reasoning by which marginal crossing became release-facing classification require prose, cross-reference, and trust in the surrounding governance process.

The persuasion scorecard therefore supports several claims, but not all possible claims. It supports the claim that persuasion was the risk category that determined GPT-4o's overall medium Preparedness score. It supports the claim that text persuasion marginally crossed OpenAI's medium-risk threshold, while voice persuasion remained low. It supports the claim that OpenAI compared AI-generated interventions with human baselines and reported both immediate and one-week-later effects. It supports the governance claim that the Safety Advisory Group reviewed the Preparedness evaluations and mitigations as part of the deployment process.[26]

It does not support the stronger claim that GPT-4o is generally more persuasive than humans across modalities. It does not support the claim that the model's voice modality created a medium preparedness risk. It does not support an unrestricted claim about real-world political influence, because the experiments concern selected political topics, measured opinion shifts, controlled interventions, and defined follow-up intervals. It does not by itself show how persuasion risk changes under different user targeting, distribution scale, repeated exposure, personalised messaging, platform dynamics, or adversarial campaign design. Those may be relevant questions, but they are not settled by the table.

The relation between text and voice is especially important because GPT-4o was publicly significant as an omni model with audio capabilities. A reader might expect the novel voice modality to drive the most serious persuasion concern. OpenAI's own presentation cuts against that expectation. The card states that the voice modality was low risk, while text marginally crossed into medium. This is a good example of a table preventing one form of overclaim while enabling another. It prevents a broad inference that voice necessarily increased preparedness risk. At the same time, the top-level scorecard can still allow the word "Persuasion: Medium" to circulate without the modality distinction attached.

The one-week follow-up values add another layer. Immediate effect sizes are not the same as persistent effects. OpenAI's inclusion of one-week-later measurements gives the table a temporal dimension, and that dimension matters for public interpretation. A persuasion result that appears meaningful immediately may weaken, invert, or remain small after a week. The voice results reported by OpenAI are particularly revealing here, since AI conversations and AI audio clips did not show a persistent effect that would justify a stronger voice-risk claim. The table's temporal rows therefore discipline the initial effect-size presentation. They remind the reader that persuasion should not be treated as a single moment of movement in opinion, but as a relation between intervention, measurement time, and durability.

The danger of the scorecard is not that it is wrong. The danger is that its most visible form is easier to remember than its evidentiary structure. "Persuasion: Medium" is compact, portable, and headline-ready. "Text modality marginally crossed a pre-registered medium threshold; voice remained low; AI text was not more persuasive than professional human-written content in aggregate but exceeded human interventions in three of twelve instances; one-week voice effects were small or negative" is more accurate, but far less portable. AI safety documentation has to live with this asymmetry. Public documents need summaries, but summaries can harden.

The case therefore shows the value of placing scorecards near explanatory prose. OpenAI's system card does not leave the medium label entirely unsupported. It explains the scorecard, names the Preparedness Framework categories, states the decision rule by which the highest category determines overall risk, distinguishes modalities, and provides effect-size information. The document gives the reader enough to resist the simplest misreading, though it still requires careful movement between the visible scorecard and the more conditional text.

For documentation practice, the lesson is that risk labels should carry their qualifying structure as close as possible to the point of display. A scorecard can state "Persuasion: Medium," but the surrounding page should quickly answer: medium for which modality, by which threshold, against which comparator, at which time interval, and with what aggregate-versus-instance distinction? Where those answers are nearby, the scorecard strengthens public accountability. Where they are distant or absent, the label begins to operate as institutional shorthand.

OpenAI's persuasion presentation is therefore neither a failure of documentation nor a model example of complete transparency. It is a useful intermediate case. The table-like scorecard provides a clear public classification; the effect-size presentation gives the claim a measurable surface; the prose preserves important distinctions; the overall label still risks travelling without its caveats. This is precisely the situation in which table literacy matters. A careful reader should not reject the medium-risk label. Nor should they repeat it without qualification. The responsible reading is narrower and stronger: GPT-4o's overall Preparedness score became medium because the text persuasion evaluation marginally crossed the medium threshold, while other risk categories and the voice modality were assessed as low under the reported framework.

That reading preserves the evidence-to-claim relation. It allows the scorecard to do its work without asking it to do more than it can support. The table gives the claim public form; the close reading restores the conditions under which that form remains accurate.

5. Google DeepMind: Adaptive Attacks and the Fragility of Non-Adaptive Confidence

Google DeepMind's report on defending Gemini against indirect prompt injections changes the scale of the article's argument. Anthropic's table supported a safeguard classification under biological-risk uncertainty. OpenAI's scorecard gathered mixed persuasion findings under a medium-risk label. Google DeepMind's table and figure do something more methodological. They show that the meaning of a robustness claim can change when the adversary is allowed to adapt.

The report's threat model concerns Gemini operating as an agent with function-calling capabilities. In that setting, the model may interact with external tools and data sources such as emails, documents, or web pages. This creates an attack surface because an adversary can place malicious instructions inside untrusted data that the model later retrieves. The user prompt may be benign, while the retrieved content contains the adversarial trigger. If the attack succeeds, the model may perform an unwanted tool call, such as exfiltrating private information, while appearing to act within the surrounding task.[27] This matters because the evaluated object is not a model answering a static harmful prompt. It is a tool-using system operating in a context where trusted instructions and untrusted data can become entangled.

The table-reading problem begins with that evaluated object. A robustness table for indirect prompt injection is not measuring general model intelligence, nor general harmlessness, nor even direct jailbreak refusal. It is measuring whether a model-system arrangement resists a particular adversarial pathway: malicious instructions placed in retrieved data and activated through tool-use. The distinction is not decorative. A model may perform well on ordinary safety benchmarks while remaining vulnerable when a malicious instruction enters through an email body or retrieved document. The table belongs to security-facing AI safety documentation, where the adversary is not the user asking directly for a harmful answer, but a third party shaping the data the model is asked to process.

The metric is attack success rate, or ASR. Google DeepMind's Figure 6 and Appendix E compare attack success rates across different defences and attack methods. The report also plots the number of queries to Gemini 2.0 needed to achieve the reported attack success rate during attack optimisation.[28] This pairing of ASR and query count is important. Attack success rate tells the reader how often the attack worked under the evaluation conditions. Query count tells the reader something about the cost of constructing the attack. A defence that reduces ASR but requires very few queries to bypass may not offer the same practical protection as one that keeps ASR low while also making optimisation expensive. The table therefore measures not only whether the defence can be broken, but also something about how hard it is to break in the tested setting.

The comparator is the central methodological feature. Google DeepMind first evaluates defences with non-adaptive attacks: triggers are optimised against an undefended model and then tested against the same model with a defence in place.[29] This provides a baseline. It asks whether a defence can resist attacks that were not specifically constructed against it. The next evaluation introduces adaptive attacks, where attacks are run against Gemini 2.0 with the relevant defences in place, and the attack strategy is designed to overcome the specific properties introduced by those defences.[30] The comparison is therefore not simply "attack A versus attack B." It is "attack transferred from an undefended setting" versus "attack optimised with the defence in the loop."

That comparator changes the public meaning of the table. A defence can look promising against non-adaptive attacks because the attacker has not adjusted to the defence. Once the adversary sees or approximates the deployed defence mechanism, the relevant question changes. The report states that a real-world attacker with knowledge of the defence would likely optimise their attack strategy to circumvent it, and that such "defence-in-the-loop" optimisation can expose weaknesses inherent in the defence.[31] The table therefore makes a claim about evaluation realism. It tells the reader that a non-adaptive test may be useful as a baseline, but weak as a public robustness claim.

The results sharpen the point. In Figure 6, Google DeepMind reports attack success rate and query count across eight defences and three adaptive attack families: Actor Critic, Beam Search, and TAP. The report states that in sixteen out of twenty-four cases, adaptive attacks equal or outperform their non-adaptive counterparts. Adaptive Actor Critic has higher ASR against five of eight defences; adaptive Beam Search has higher ASR against five of eight; adaptive TAP has higher ASR against six of eight.[32] The table does not need a single catastrophic number to change the argument. Its force lies in the pattern: adaptation often erodes the confidence created by the non-adaptive evaluation.

Appendix E gives the tabular surface behind Figure 6. Several comparisons illustrate why the table matters. For the ICL defence, Table 3 reports non-adaptive ASR values of 0.462 for Actor Critic, 0.344 for Beam Search, and 0.676 for TAP; Table 4 reports adaptive values of 0.598, 0.666, and 0.984 for the same attack families.[33] The movement is not subtle. A defence that already looked incomplete under non-adaptive testing appears substantially more vulnerable under adaptive attack, especially for TAP. For Spotlighting, the non-adaptive values are 0.180, 0.130, and 0.752, while the adaptive values are 0.824, 0.648, and 0.822.[34] Again, the table shows that a defence's apparent strength against transferred attacks may not survive direct adaptation.

The Warning defence behaves differently. The non-adaptive values are 0.058, 0.002, and 0.012; the adaptive values are 0.084, 0.000, and 0.108.[35] These values remain relatively low compared with many other defences, even after adaptation. That matters because the table does not merely say "adaptive attacks always make everything fail." It allows the reader to distinguish degrees of degradation. Some defences remain comparatively stronger. Others lose much of their apparent protection. A good table does not flatten all failures into one general warning. It lets different patterns remain visible.

The classification defences add another layer. The Retrieved Data Classifier appears weak under non-adaptive Actor Critic and TAP but strong under non-adaptive Beam Search; under adaptive evaluation, its ASR values are 0.050, 0.148, and 0.108.[36] The User Instruction Classifier remains low in both non-adaptive and adaptive settings, with values close to zero across the reported attacks.[37] These rows complicate a broad anti-defence reading. The report's claim is not that all defences are useless. It is that robustness claims need to be tested against adaptive adversaries and interpreted in relation to the defence and attack combination being measured.

The threshold or decision rule is less explicit than in the Anthropic and OpenAI cases. Google DeepMind is not assigning Gemini to a public ASL category or a Preparedness Framework score. The table does not say that a particular ASR value triggers release, non-release, or a named safeguard level. Its decision consequence is methodological: it changes what kind of evaluation should be treated as adequate when a document claims robustness to indirect prompt injection. The report says the findings underscore the need to incorporate adaptive attacks into the standard evaluation pipeline for adversarial defences.[38] In this case, the table supports an evaluation-practice claim rather than a release-threshold claim.

This distinction makes the case especially relevant to AI safety documentation. Not all tables in system cards or safety reports point directly to deployment decisions. Some change the reader's understanding of the method by which later claims should be formed. Google DeepMind's table tells the reader that non-adaptive testing can generate misplaced confidence. The public claim being disciplined here is not merely "defence X works" or "defence Y fails." It is the broader claim that a robustness evaluation is credible only if it tests the defence under conditions that resemble the adversary's capacity to respond.

The mitigation state is also central. The table is explicitly about models with defences in place. The defences include in-context methods, such as in-context learning, spotlighting, paraphrasing, and warning, as well as classification-based defences.[39] The table therefore does not measure the undefended model's vulnerability alone. It measures whether particular mitigations hold under different attack regimes. This is why the adaptive comparison matters so much. A mitigation can reduce risk under one adversarial assumption and fail under another. If a safety document presents only the favourable mitigation state, without specifying whether the adversary adapted, the reader may infer more robustness than the table can support.

Uncertainty appears here less as a confidence interval than as evaluation-condition sensitivity. The same defence can look different depending on whether attacks are transferred or optimised against it. Google DeepMind even notes that there are cases where non-adaptive attacks outperform adaptive ones, possibly because of a phenomenon analogous to gradient obfuscation: the defended model may provide a noisy or non-smooth objective for the optimisation algorithm while still remaining non-robust against the worst-case attack.[40] This caveat is important because it prevents a crude reversal in which adaptive testing is treated as always stronger in every cell. The deeper lesson is that robustness evaluation itself is fragile. One evaluation setting may fail to expose a vulnerability, while another may fail for different methodological reasons.

This is where visual confidence has to be handled carefully. A reader might look at a table of ASR values and assume the lowest number identifies the safest defence. That may be true within the test conditions, but it is not the whole claim. Query counts, attack family, defence type, transfer setting, hold-out evaluation, scenario design, and the possibility of optimisation artefacts all affect the meaning of the number. The table's clean grid holds together many methodological decisions that have to be read back into the result. Without those decisions, ASR becomes an attractive but under-specified measure.

What can the table responsibly support? It supports the claim that adaptive evaluation is necessary for serious robustness assessment against indirect prompt injection. It supports the narrower claim that, in this report's Gemini 2.0 experiments, adaptive attacks often matched or exceeded non-adaptive counterparts across the tested defence/attack combinations. It supports the documentation claim that a public robustness statement should distinguish between transferred non-adaptive attacks and attacks optimised with the defence in the loop. It also supports the practical claim that attack cost, measured partly through query count, belongs beside attack success rate when evaluating real-world adversarial pressure.

What should it not be asked to support? It should not be read as a complete measurement of Gemini's security in all deployed contexts. It should not be used to claim that a given defence will always fail or always succeed. It should not be generalised beyond the tested models, scenarios, attack families, defences, and function-calling conditions without further evidence. It should not be treated as a general measure of AI safety, since the report itself distinguishes the security problem of adversarial manipulation from broader safety concerns.[41] It should not be used to imply that adaptive evaluation closes the robustness question. The report's own discussion moves toward defence in depth, continued evaluation, and model-level as well as system-level improvements.

The table is strongest when read as a correction to static confidence. Non-adaptive evaluation can still be useful. It establishes a baseline and can reveal obvious weakness. But if the public claim is that a defence protects users in a setting where adversaries can learn, iterate, and adapt, non-adaptive success is not enough. The table makes this visible by placing adaptive and non-adaptive results into relation. It does not only show more numbers. It changes the standard by which the earlier numbers should be judged.

This case therefore extends the article's reading method. Anthropic showed how a capability table can support a precautionary safeguard classification while leaving real-world risk uncertain. OpenAI showed how a scorecard can stabilise a public risk label while compressing modality-specific evidence. Google DeepMind shows how a table can revise the evaluation method itself. The object of scrutiny is not only the model or the defence, but the evaluation regime that gives a robustness claim its public force.

For external artifacts and system-card writing, the implication is direct. A public statement such as "our defence reduced prompt-injection success" is incomplete unless the document says against what kind of attack. If the attack was non-adaptive, the claim should be bounded accordingly. If the attack was adaptive, the document should explain how the adaptation was performed, how many attempts or queries were needed, which defences were in scope, and whether results were evaluated on a hold-out set. A table that omits these conditions may still be numerically precise, but the public claim built on it remains under-specified.

Google DeepMind's report is valuable because it makes that methodological pressure visible. It does not simply present a polished robustness conclusion. It shows an evaluation becoming more demanding, and then lets the results alter the confidence that a weaker evaluation might have produced. In documentary terms, the table carries a discipline of reversal. It asks the reader to revisit the apparent meaning of non-adaptive success once the adversary has been made more realistic.

The lesson for AI safety documentation is that table literacy must include adversary literacy. In security-relevant evaluations, the key question is not only what the model did under test conditions, but what the test allowed the adversary to know, change, and optimise. A table that hides the adversary's adaptation may give the reader a static measure of a dynamic risk. A table that makes adaptation visible gives the safety claim a firmer public surface, because the reader can see not only whether the defence held, but what kind of pressure it was asked to withstand.

6. What Tables Show, What They Withhold

The three case studies clarify a common structure. Anthropic, OpenAI, and Google DeepMind each use tables or table-like figures to make safety-relevant evidence public, but the tables do not carry the same kind of claim. Anthropic's table supports a safeguard classification. OpenAI's scorecard stabilises a risk label. Google DeepMind's adaptive-attack results revise the meaning of a robustness evaluation. The table, in each case, gives evidence a visible order. Yet the form of that order differs, and the reader's task changes accordingly.

Anthropic's table is classificatory. It gathers several chemical and biological capability evaluations and places them in relation to ASL-2 and ASL-3 protection levels. The table is not a clean comparison of two fixed models under one metric. It is a risk-relevant evidence assembly, drawing together uplift trials, expert red-teaming, long-form virology tasks, multimodal knowledge tests, and LAB-Bench subsets. Its public claim is not simply that one model scores higher than another. Its deeper claim is that certain capability signals justify stronger safeguards, even though real-world threat-actor usefulness remains uncertain.[42] The table therefore makes a precautionary decision legible without fully resolving the underlying risk.

OpenAI's scorecard is labelling. It condenses a set of Preparedness Framework assessments into a compact visual structure: cybersecurity low, biological threats low, model autonomy low, persuasion medium, overall medium. The table-like form gives the reader a quick public handle on GPT-4o's risk assessment. Yet the persuasion evidence underneath is more conditional than the label alone suggests. Text marginally crosses the medium threshold; voice remains low; AI text is not more persuasive than professional human-written content in aggregate, while exceeding human interventions in some instances; one-week effects qualify the immediate results.[43] The scorecard makes the public claim portable, while the prose is needed to keep the label proportionate.

Google DeepMind's table is methodological. Its adaptive prompt-injection results do not classify a model under a named release threshold. They change the standard by which a robustness claim should be read. Non-adaptive attacks establish a baseline, but adaptive attacks place the defence itself inside the adversary's optimisation process. When adaptive attacks equal or outperform non-adaptive counterparts in many defence/attack combinations, the table tells the reader that static evaluation can produce misplaced confidence.[44] The table's strongest claim concerns evaluation realism: a defence that looks strong against transferred attacks may not retain that apparent strength once an adversary adapts.

These differences matter because "table literacy" is not one skill. Reading a threshold table is different from reading a scorecard, which is different from reading an attack-success matrix. A threshold table asks whether the decision rule has been made visible. A scorecard asks whether the label preserves the conditions that produced it. An attack-success matrix asks whether the adversary, defence, and evaluation setting are realistic enough to support the robustness claim. Each table has to be read in relation to the kind of public claim it is being asked to carry.

The first shared lesson is that tables make evidence easier to compare by making part of the evidence disappear. This is not necessarily a failure. A table must simplify in order to become readable. But the simplification has consequences. Anthropic's table aligns unlike evaluations in one comparative surface, even though the rows differ in method, realism, metric, and source model. OpenAI's scorecard aligns risk categories under one visual scale, even though the persuasion evidence differs by modality, comparator, and time interval. Google DeepMind's figure aligns defences and attacks through attack success rates and query counts, while the meaning of those values depends on scenario design, optimisation procedure, and adversarial assumptions. The table's clarity is made possible by exclusions.

The second shared lesson is that the most visible part of the table is often not the most evidentially important part. In Anthropic, the visible contrast is ASL-3 versus ASL-2. The important caveat is that some cells report results for neighbouring or predecessor models, that some evaluations use safeguards removed, and that proxy tasks must stand in for actual dangerous tasks. In OpenAI, the visible label is "Persuasion: Medium." The important qualification is that the crossing is marginal, text-specific, and not reproduced by the voice modality. In Google DeepMind, the visible values are attack success rates. The important methodological feature is whether the attack was adaptive or non-adaptive, and how many queries were required to produce it. A table can be visually honest and still be misread when the reader stops at the most visible element.

The third lesson is that the comparator often matters more than the score. A number without a comparator is only a surface. Anthropic's 2.53x uplift becomes meaningful because it is placed against a 25% control and a 2.8x ASL-3 threshold. OpenAI's persuasion effect sizes become meaningful because they are placed against professional human-written content, human audio clips, and human conversations. Google DeepMind's attack success rates become meaningful because they are placed against non-adaptive counterparts and across defence/attack combinations. In each case, the comparator determines the public force of the result. Change the comparator, and the table's claim changes.

The fourth lesson is that thresholds turn evidence into institutional language. Anthropic's ASL distinction and OpenAI's Preparedness scorecard are clearest here. A threshold is not just a line drawn through numbers. It is a rule for converting measurement into action. It allows an organisation to say that a stronger safeguard regime is required, that a category has reached medium risk, or that a deployment remains within a permitted risk band. The threshold is therefore a documentary hinge between evaluation and authority. Where the threshold is visible, the reader can ask whether the evidence supports the action. Where the threshold is hidden, the reader sees only the result of internal judgement.

The fifth lesson is that mitigation state must not be allowed to drift. A result obtained with safeguards removed is not evidence of ordinary deployed behaviour. A result obtained after classifier guards, refusal training, access controls, or other defences have been applied is not evidence of raw model capability. A result obtained against non-adaptive attacks is not the same as a result obtained against adaptive adversaries. Many weak safety claims begin when these states are compressed into one general sentence. Anthropic's table is strongest when read as a capability table that helps justify safeguards, not as a table proving mitigated product safety. Google DeepMind's results are strongest when read as a test of defences under different adversarial conditions, not as an abstract measure of model safety. The public claim has to remain attached to the system state being measured.

The sixth lesson is that uncertainty should appear at the point of inference, not only in a distant caveat. Anthropic's uncertainty paragraph is effective because it appears immediately after the ASL-3 classification discussion. It tells the reader that strong evaluation performance does not directly settle real-world threat-actor usefulness. OpenAI's use of "marginally" and its modality distinction perform a similar, though more dispersed, limiting function. Google DeepMind's discussion of adaptive testing prevents non-adaptive results from becoming a too-easy robustness claim. In each case, uncertainty protects the table from being asked to do too much. The caveat is not a sign of weakness. It is part of the evidentiary discipline of the claim.

The seventh lesson is that a table's public value depends on the prose around it. A table without prose may be visually precise and interpretively weak. The surrounding text tells the reader what the table is for. Anthropic's prose connects capability evaluations to ASL-3 classification and then to uncertainty about real-world translation. OpenAI's prose explains that the overall medium score follows from the highest risk category and that the persuasion result differs across modalities. Google DeepMind's prose explains why adaptive testing changes the interpretation of defence robustness. The table gives evidence a surface; prose gives that surface consequence. When prose overstates the table, the claim weakens. When prose makes the table's limits visible, the claim becomes more accountable.

The eighth lesson is that table design can create visual confidence before interpretive confidence has been earned. Rows, columns, labels, aligned numbers, and crisp categories all produce a sense of order. That order is useful. Without it, safety evidence would remain scattered and difficult to compare. But visual order can also make heterogeneous evidence appear more uniform than it is. Anthropic's evidence types do not all have the same evidentiary force. OpenAI's medium label does not carry the same meaning across text and voice. Google DeepMind's ASR values do not carry the same meaning without the attack method and defence condition. A serious reader has to slow the table down.

This slowing-down is not anti-technical. It is the condition for technical accountability in public documentation. A table should be allowed to clarify, but not to substitute for the reasoning that connects result to claim. The point is not to demand that system cards and risk reports publish every internal detail. That would be unrealistic and sometimes unsafe. The point is to ask whether the public artifact gives enough structure for an external reader to see where evidence ends, where judgement begins, and where uncertainty remains.

NIST's risk-management language helps name this problem. AI risk measurement can be difficult because AI systems are context-sensitive, metrics can be oversimplified or gamed, and measurement approaches may not capture relevant harms or affected groups.[45] NIST AI 800-2's emphasis on validity, transparency, and reproducibility gives more specific vocabulary for benchmark evaluation. Validity asks whether the evaluation measures what the claim needs it to measure. Transparency asks whether the reader can understand the evaluation conditions. Reproducibility asks whether the method is sufficiently specified for results to be checked or compared.[46] These terms are not a substitute for close reading. They sharpen it.

The article's three cases therefore produce a practical standard for safety-documentation reading. A good table in an AI safety report should not merely look clean. It should make its evidentiary function inspectable. The reader should be able to identify the object being evaluated, the metric used, the comparison made, the threshold or decision rule invoked, the mitigation state measured, the uncertainty preserved, and the public claim supported. If any of these elements is missing, the table may still be useful, but the claim built on it has to narrow accordingly.

A table can be strong even when its evidence is incomplete, as long as it does not hide that incompleteness. Anthropic does not directly test actual dangerous tasks; it says so. OpenAI does not show that GPT-4o voice is medium risk; it separates voice from text. Google DeepMind does not claim adaptive evaluation finishes the robustness problem; it shows why non-adaptive evaluation is insufficient. In each case, the stronger documentation move is not total certainty, but disciplined proportionality. The table supports a claim whose limits remain visible.

This is the most important synthesis for the portfolio. Table reading in AI safety documentation is a form of public reasoning. It is not only statistical interpretation, and it is not only visual design. It is the practice of asking what kind of confidence a document is arranging for the reader, and whether that confidence remains answerable to the evaluation that produced it. When the table and the claim remain proportionate, public accountability gains a surface. When they separate, the table becomes a device of assurance.

7. A Practical Checklist for Reading Tables in AI Safety Reports

The following checklist is designed for system cards, model cards, risk reports, preparedness reports, safety frameworks, red-team summaries, and model-release documents. It can be used on tables, scorecards, charts, appendices, benchmark summaries, attack-success matrices, and prose paragraphs that translate tabular evidence into public claims.

7.1 Identify the table's documentary function

Ask first what the table is doing inside the document.

Is it describing capability?

Is it classifying risk?

Is it comparing models?

Is it evaluating a mitigation?

Is it supporting a deployment decision?

Is it summarising red-team findings?

Is it translating a threshold into public language?

Is it giving evidence for a governance process?

A capability table should not be read as if it were a mitigation table. A mitigation table should not be read as if it measured raw model capability. A scorecard should not be treated as a full evidence record. The first task is to identify the kind of work the table has been placed there to perform.

7.2 Locate the evaluated object

Ask what exactly has been evaluated.

Is the table measuring a base model, a deployed system, a tool-using agent, a special evaluation variant, a helpful-only model, a model with safeguards removed, or a model after mitigations?

Is the model being evaluated in text, voice, image, tool-use, browser-use, code, or multimodal form?

Is the table using one fixed model, or does it substitute results from neighbouring models when those results are stronger or more relevant?

Are external tools, scaffolds, repeated attempts, or human assistance part of the evaluation?

A model name is not always a sufficient description of the evaluated object. The same model can have different public meanings depending on configuration, modality, safeguards, and context.

7.3 Define the metric

Ask what kind of number or label gives the result form.

Is the metric a benchmark score?

An uplift ratio?

An attack-success rate?

A pass@k score?

An effect size?

A confidence interval?

A human-baseline comparison?

A qualitative judgement?

A low / medium / high / critical score?

A threshold-crossing determination?

Then ask what the metric can and cannot support. An attack-success rate does not by itself establish real-world adversarial capability. An uplift trial does not by itself establish successful end-to-end harm. A medium-risk label does not by itself explain the evidence that produced it. A pass@5 score does not mean first-attempt reliability. The metric is the grammar of the table's claim.

7.4 Identify the comparator

Ask what makes the number meaningful.

Is the result compared with a previous model?

A human baseline?

A control group?

A threshold?

A defended system?

An undefended system?

A non-adaptive attack?

An adaptive attack?

A different modality?

A different time interval?

A score without a comparator may create the appearance of evidence while withholding the relation that gives the evidence force. The comparator often determines whether the table supports progress, risk, robustness, threshold crossing, or remaining uncertainty.

7.5 Find the threshold or decision rule

Ask whether the table is attached to action.

Does a number trigger a safeguard?

Does a label permit deployment?

Does a result move a model into a higher risk category?

Does a threshold come from a preparedness framework, responsible-scaling policy, safety framework, or internal review process?

Is the threshold visible, or does the table only show the outcome of applying it?

Is the threshold quantitative, qualitative, judgement-based, or mixed?

A threshold turns evaluation into institutional consequence. If the threshold is missing, the reader may see a category without seeing the rule that produced it.

7.6 Determine the mitigation state

Ask whether the table measures before, after, or around safeguards.

Are safeguards enabled or removed?

Is the model pre-mitigation or post-mitigation?

Are classifier guards, refusal training, monitoring, access controls, or other safeguards part of the tested system?

Is the table measuring model capability, deployed product risk, or safeguard effectiveness?

Does the table distinguish raw capability from mitigated system behaviour?

Many overstatements arise when mitigation state changes silently. A pre-mitigation capability result should not be presented as ordinary product behaviour. A post-mitigation result should not be treated as proof that the underlying model lacks the risky capability.

7.7 Read the uncertainty where the claim is made

Ask how uncertainty appears.

Are confidence intervals, sample sizes, distributions, or repeated trials included?

Does the document name proxy-task limitations?

Does it discuss evaluation realism?

Does it acknowledge benchmark saturation, contamination, or elicitation limits?

Does it distinguish observed performance from inferred real-world risk?

Are caveats placed near the table, or buried elsewhere?

Does the prose preserve uncertainty when it states the claim?

Uncertainty is most useful when it appears at the point where the reader might otherwise overgeneralise. A distant caveat rarely disciplines a strong visual claim.

7.8 Compare table and prose

Ask whether the prose remains proportionate to the table.

Does the paragraph after the table say more than the table supports?

Does the table support a narrow claim while the prose makes a broad one?

Does a risk label travel without its modality, comparator, or threshold condition?

Does the prose distinguish what the table shows from what it does not show?

Does the document state what decision the table informs?

The public claim usually appears in prose, not in the table alone. The strongest reading tests whether the prose respects the table's evidentiary limits.

7.9 Check for visual confidence

Ask what the visual form makes easy to believe.

Do aligned rows and columns make heterogeneous evidence look uniform?

Does a colour, label, or scorecard make a judgement feel settled?

Does a compact table hide the fact that different rows use different methods?

Does the most visible label omit an important distinction?

Does the table make one result memorable while pushing the caveats into surrounding text?

Visual clarity is valuable, but it can become visual overconfidence. The reader should ask whether the design clarifies the evidence or smooths over its irregularity.

7.10 State the responsible claim

After reading the table, write one sentence that the table can responsibly support.

A responsible claim should include the evaluated object, the metric or comparison, the threshold or decision relation, and the remaining uncertainty.

For example:

This table supports the claim that, under the reported evaluations, models in the ASL-3 group show stronger chemical and biological capability signals than models kept under ASL-2, justifying stronger safeguards while leaving real-world threat-actor usefulness uncertain.

Or:

This scorecard supports the claim that GPT-4o's overall Preparedness rating became medium because text persuasion marginally crossed a medium threshold, while voice persuasion and the other assessed categories remained low.

Or:

This attack-success table supports the claim that non-adaptive prompt-injection evaluation can overstate robustness, because adaptive attacks often equal or exceed non-adaptive attacks once the defence is placed inside the adversary's optimisation process.

The final sentence is the test. If the sentence becomes too broad, the table has been made to overclaim. If it becomes too narrow, the table's evidentiary contribution may be missed. The aim is proportionality.

Compact Table-Reading Worksheet

Source document:

Table / figure title:

Document section:

Page or URL locator:

1. What is being evaluated?

Model / system / agent / modality / deployment surface / special evaluation variant.

2. What metric is used?

Score / uplift / ASR / pass@k / effect size / threshold label / qualitative judgement.

3. What is the comparator?

Prior model / human baseline / control group / threshold / defended vs undefended / adaptive vs non-adaptive.

4. What threshold or decision rule is attached?

ASL / Preparedness category / CCL / safeguard trigger / deployment condition / no explicit threshold.

5. What mitigation state is measured?

Safeguards removed / pre-mitigation / post-mitigation / defended system / unclear.

6. What uncertainty is visible?

Confidence interval / proxy task / sample size / elicitation caveat / evaluation realism / adversary model / none stated.

7. What public claim does the table support?

Write the narrow claim.

8. What would be an overclaim?

Write the claim the table does not support.

9. What further information would improve the table?

Method details / threshold definition / baseline explanation / confidence intervals / post-mitigation comparison / source data / caveats.

10. Documentation judgement

Strong / adequate / weak / misleading - and why.

8. Conclusion: Visual Confidence and Public Accountability

The public life of an AI safety table begins after the table has been read. A scorecard travels into a headline, a risk category into an application, an attack-success rate into a claim about robustness, a threshold into a release decision. The table may have been carefully caveated in the original document, yet the compact form of its result can detach from the conditions that made it meaningful. This is one reason table literacy belongs near the centre of AI safety documentation. The risk is not only that a number will be wrong. The risk is that a number will be right within its evaluation setting and misleading once carried beyond it.

The three cases in this article show that safety tables do different kinds of documentary work. Anthropic's chemical and biological capability table helps justify enhanced safeguards for models at or beyond a frontier capability reference point, while preserving uncertainty about real-world threat-actor uplift. OpenAI's GPT-4o scorecard makes the medium persuasion label publicly legible, while the underlying evidence remains mixed, marginal, modality-specific, and temporally qualified. Google DeepMind's adaptive prompt-injection results do not mainly classify a model at all; they alter the standard by which a robustness evaluation should be trusted. In each case, the table supports a claim, but the claim becomes accountable only when its conditions remain visible.

This does not mean that public AI safety documents should avoid tables, scorecards, and compact visual summaries. The opposite is closer to the truth. Without tables, many safety claims would become less inspectable, because the evidence would be dispersed across prose, appendices, internal reports, or inaccessible evaluation pipelines. Tables can make comparison possible. They can show that a threshold has been approached or crossed. They can distinguish model capability from mitigated system behaviour. They can reveal where a defence works and where it fails. They can bring a public reader closer to the structure of the decision.

The problem appears when the table's order is allowed to stand in for the reasoning that should surround it. A table can make heterogeneous evidence appear uniform. It can make a marginal threshold crossing appear decisive. It can make non-adaptive robustness appear stronger than it is. It can hide the difference between raw capability and deployed product risk. It can turn a proxy task into a stronger public impression than the evaluation can support. These are not failures of numbers alone. They are failures of documentary relation.

The stronger alternative is not opacity. It is proportionality. A table should make clear what was evaluated, how it was measured, against what it was compared, which threshold or decision rule gave the result consequence, what mitigation state was being tested, what uncertainty remained, and what public claim the table can responsibly support. Where those elements are visible, the reader can inspect the movement from evidence to claim. Where they are absent, the table may still look polished, but its public authority becomes harder to test.

For AI safety documentation, this is more than a technical style issue. System cards, risk reports, preparedness frameworks, model cards, and safety reports are now among the documents through which frontier AI developers make their release decisions publicly legible. Their tables and charts help determine how external readers understand capability, risk, mitigation, residual exposure, and governance judgement. A well-written paragraph can still overclaim if the table beneath it has been misread. A well-designed table can still mislead if the prose around it asks the reader to infer too much. Documentation quality appears in the relation between the two.

The practical checklist offered here is therefore not merely a worksheet. It is a discipline of reading. It asks the reader to slow down at the point where visual confidence gathers fastest. The purpose is not suspicion for its own sake, but better public reasoning. In AI safety reports, a table becomes trustworthy not when it appears complete, but when its incompleteness is made legible enough that the claim remains proportionate. Public accountability depends on that proportion. A table can carry evidence into public form, but only careful documentation can keep the evidence from becoming assurance too soon.

Notes

[1] Anthropic, Risk Report: February 2026 (Anthropic, February 2026), section 4.4, Table 4.4.A and following paragraphs, pp. 66-68, https://www-cdn.anthropic.com/08eca2757081e850ed2ad490e5253e940240ca4f.pdf.

[2] Anthropic, Risk Report: February 2026, p. 68.

[3] OpenAI, 'GPT-4o System Card' (8 August 2024), "GPT-4o Scorecard" and "Preparedness Framework Scorecard", https://openai.com/index/gpt-4o-system-card/.

[4] OpenAI, 'GPT-4o System Card', "Persuasion".

[5] OpenAI, 'GPT-4o System Card', "Persuasion".

[6] Chongyang Shi and others, 'Lessons from Defending Gemini Against Indirect Prompt Injections' (Google DeepMind, 20 May 2025), section 8 and Appendix E, arXiv:2505.14534, https://arxiv.org/html/2505.14534v1.

[7] National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (January 2023), pp. 6-7, https://doi.org/10.6028/NIST.AI.100-1; https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf.

[8] National Institute of Standards and Technology, Center for AI Standards and Innovation, Practices for Automated Benchmark Evaluations of Language Models, NIST AI 800-2, Initial Public Draft (January 2026), https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf.

[9] Anthropic, Risk Report: February 2026, section 4.1, p. 62, and section 4.3, p. 65.

[10] Anthropic, Risk Report: February 2026, section 4.4, p. 66.

[11] Anthropic, Risk Report: February 2026, Table 4.4.A, pp. 66-67.

[12] Anthropic, Risk Report: February 2026, Table 4.4.A, pp. 67-68.

[13] Anthropic, Risk Report: February 2026, section 4.4, p. 68.

[14] Anthropic, Risk Report: February 2026, section 4.4, p. 68.

[15] Anthropic, Risk Report: February 2026, sections 4.5-4.6, pp. 68-72; Appendix 7.6, pp. 99-100.

[16] OpenAI, 'GPT-4o System Card', "GPT-4o Scorecard" and "Preparedness Framework Scorecard".

[17] OpenAI, 'GPT-4o System Card', scorecard ratings.

[18] OpenAI, 'GPT-4o System Card', introduction to scorecard and risk overview.

[19] OpenAI, 'GPT-4o System Card', "Risk identification, assessment and mitigation" and Preparedness Framework evaluation sections.

[20] OpenAI, 'GPT-4o System Card', "Persuasion".

[21] OpenAI, 'GPT-4o System Card', "Persuasion".

[22] OpenAI, 'GPT-4o System Card', "Persuasion", immediate effect-size and one-week-later effect-size figures.

[23] OpenAI, 'GPT-4o System Card', "Persuasion".

[24] OpenAI, 'GPT-4o System Card', "Persuasion".

[25] OpenAI, 'GPT-4o System Card', "Persuasion".

[26] OpenAI, 'GPT-4o System Card', scorecard overview and Preparedness evaluation discussion.

[27] Shi and others, 'Lessons from Defending Gemini', sections 2-3, especially the definitions of indirect prompt injection and the function-calling threat model.

[28] Shi and others, 'Lessons from Defending Gemini', Figure 6 and Appendix E.

[29] Shi and others, 'Lessons from Defending Gemini', section 7, especially section 7.2.

[30] Shi and others, 'Lessons from Defending Gemini', section 8.1.

[31] Shi and others, 'Lessons from Defending Gemini', section 8.1.

[32] Shi and others, 'Lessons from Defending Gemini', section 8.2.

[33] Shi and others, 'Lessons from Defending Gemini', Appendix D, Table 3; Appendix E, Table 4.

[34] Shi and others, 'Lessons from Defending Gemini', Appendix D, Table 3; Appendix E, Table 4.

[35] Shi and others, 'Lessons from Defending Gemini', Appendix D, Table 3; Appendix E, Table 4.

[36] Shi and others, 'Lessons from Defending Gemini', Appendix D, Table 3; Appendix E, Table 4.

[37] Shi and others, 'Lessons from Defending Gemini', Appendix D, Table 3; Appendix E, Table 4.

[38] Shi and others, 'Lessons from Defending Gemini', section 8.2 and section 10.

[39] Shi and others, 'Lessons from Defending Gemini', section 7.1.

[40] Shi and others, 'Lessons from Defending Gemini', section 8.2.

[41] Shi and others, 'Lessons from Defending Gemini', section 2.

[42] Anthropic, Risk Report: February 2026, section 4.4, Table 4.4.A and following paragraphs, pp. 66-68.

[43] OpenAI, 'GPT-4o System Card', "GPT-4o Scorecard" and "Persuasion".

[44] Shi and others, 'Lessons from Defending Gemini', sections 7-8 and Appendix E.

[45] NIST, AI RMF 1.0, pp. 6-7.

[46] NIST, Center for AI Standards and Innovation, Practices for Automated Benchmark Evaluations of Language Models.

Bibliography

Anthropic. Risk Report: February 2026. Anthropic, February 2026. https://www-cdn.anthropic.com/08eca2757081e850ed2ad490e5253e940240ca4f.pdf.

National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. January 2023. https://doi.org/10.6028/NIST.AI.100-1; https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf.

National Institute of Standards and Technology, Center for AI Standards and Innovation. Practices for Automated Benchmark Evaluations of Language Models. NIST AI 800-2, Initial Public Draft. January 2026. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf.

OpenAI. 'GPT-4o System Card'. 8 August 2024. https://openai.com/index/gpt-4o-system-card/.

Shi, Chongyang, and others. 'Lessons from Defending Gemini Against Indirect Prompt Injections'. Google DeepMind, 20 May 2025. arXiv:2505.14534. https://arxiv.org/html/2505.14534v1.