How Frontier AI Labs Present Safety Claims

System Cards, Evaluation Evidence, and the Public Form of Model-Release Documentation

Saman Samadi, PhD (Cantab)


Article / PDF | 3 June 2026

Focus: Comparative system-card analysis; safety claims; mitigation language; deployment reasoning.

Download PDF | Return to AI Safety Portfolio

Abstract

Frontier AI laboratories increasingly release system cards, model cards, preparedness reports, and safety-framework documents as public accounts of model capability, risk, mitigation, and deployment reasoning. These documents no longer operate as minor supplements to technical publication. They have become part of the documentary order through which a frontier model is made publicly answerable, even when the most decision-critical evidence remains inside the organisation that produced it. This article examines eight public release documents and companion governance frameworks from OpenAI, Anthropic, Google DeepMind, and Meta, reading them as safety-claim documents whose claimed neutrality is part of the public form being examined. The argument proceeds from a claim-level question: how does evaluation evidence acquire the public form of a safety judgement? Across the corpus, stronger documentation appears where the relation between evaluated capability, threshold, mitigation, residual risk, and deployment decision can be reconstructed by an external reader. Weaker documentation appears where public readers encounter polished conclusions without enough access to the underlying decision rule, elicitation conditions, version specificity, or post-deployment monitoring pathway. The article also compares the corpus with relevant US federal materials, including NIST AI RMF 1.0, the NIST Generative AI Profile, NIST guidance on benchmark evaluation and post-deployment monitoring, the GAO AI Accountability Framework, OMB M-25-21, and the US/UK AISI report on OpenAI o1. These sources do not impose a single standard on frontier labs, yet they clarify the documentary expectation that risk claims should be traceable, qualified, monitored, and attached to accountable decision processes. The resulting analysis proposes that the future of system-card practice lies less in longer reports than in a more disciplined public chain between evidence and claim.

Keywords: AI safety documentation; system cards; model cards; frontier AI governance; evaluation evidence; safety claims; residual risk; public accountability

1. Introduction: Safety Claims as Public Release Documents

The model card began as a modest documentary proposal. In Mitchell et al.’s influential formulation, it was a short document accompanying a trained model, intended to report model details, intended use, evaluation data, ethical considerations, and other information needed for responsible deployment.[1] That genre has not disappeared. Yet at the frontier of AI development it has been placed under a heavier burden. System cards, preparedness reports, safety reports, and frontier-safety assessments now appear at the moment when a model release must become legible to publics, regulators, developers, researchers, journalists, and future users. The card or report is asked to make a technical system visible without disclosing the whole of its construction. It must name capabilities without turning capability into advertisement; it must describe risk without producing panic or false reassurance; it must translate internal testing into public language while preserving enough uncertainty for the claim to remain answerable.

This article reads frontier AI release documentation as a public form of safety-claim production. Its concern is not whether a given model is safe in an absolute sense. Absolute safety is the wrong scale for these documents. Their function lies elsewhere, in the way they carry evaluation evidence into a claim about deployment readiness, risk management, mitigation, and accountability. A system card becomes serious when an external reader can follow the movement from a measured behaviour to a safety claim, from the safety claim to a mitigation story, and from the mitigation story to some account of residual risk. When that movement breaks, the document may still be candid, detailed, and useful, yet the public reader is left with a summary of institutional judgement with only partial traceability to the decision record.

The corpus examined here spans OpenAI’s GPT-4, GPT-4o, and o1 system cards; Anthropic’s Claude 3.7 Sonnet and Claude 4 system cards; Google DeepMind’s Gemini 3 Pro model card and Gemini 3 Pro Frontier Safety Framework report; and Meta’s Muse Spark Safety & Preparedness Report.[2] Companion frameworks from the same organisations are used where they supply the governing vocabulary behind release claims, especially OpenAI’s Preparedness Framework, Anthropic’s Responsible Scaling Policy, Google DeepMind’s Frontier Safety Framework materials, and Meta’s Advanced AI Scaling Framework.[3] The comparison is deliberately asymmetrical. GPT-4 belongs to an earlier phase of public safety reporting, where vivid risk examples and mitigation narratives carry much of the document’s force. Claude 4 and Muse Spark belong to a more recent phase, in which thresholds, scorecards, external evaluations, and release-governance pathways are much more explicit. Google DeepMind’s Gemini 3 Pro documentation occupies a different form again, since its public claim is distributed across a model card, a frontier-safety report, and a methodology supplement. The differences matter because documentary form shapes what a reader can test.

The article’s thesis is that frontier AI labs now make safety claims through a recurring sequence of compression, anchoring, evidentiary staging, and governance invocation. Internal evaluation evidence is compressed into scorecards, threshold statements, charts, capability labels, or domain summaries. Those summaries are then anchored in a named internal policy such as a Preparedness Framework, Responsible Scaling Policy, Frontier Safety Framework, or Advanced AI Scaling Framework. Selected tables, figures, external assessments, and short caveats give the claim a surface. The final public document frames deployment as acceptable within a defined context, usually after mitigation, while leaving many elements of the underlying evidence chain inaccessible to outside readers. Stronger documents expose enough of this chain for the public claim to be interrogated. Weaker documents ask readers to accept the organisation’s evaluative interpretation at precisely the points where scrutiny should be most active.

The stakes are professional as well as conceptual. For AI safety documentation, external artifacts, responsible AI governance, and model-evaluation communication, the central task begins with clear writing about technical material and extends into preserving the pressure of evidence as it moves into public prose. A benchmark result does not speak for itself. A red-team finding does not automatically become a deployment conclusion. A mitigation does not close the risk question unless the remaining risk is named with enough precision. Public documentation should therefore be judged at the level of claim discipline. The question is whether the language of the release remains proportionate to what the evidence can bear.

2. Method and Corpus: Reading System Cards as Documentary Forms

The method used here is comparative documentary analysis. Each selected document is read for its public role, internal organisation, evidentiary vocabulary, treatment of safety claims, representation of uncertainty, and relation to a wider governance framework. This is not a model-performance comparison. The article does not rank GPT-4o against Claude 4 or Gemini 3 Pro according to capability. It asks how different organisations make model-release claims readable, and where that readability strengthens or weakens accountability.

The corpus begins with OpenAI because its GPT-4 System Card helped establish the contemporary system-card genre. The GPT-4 document opens by naming safety challenges arising from model limitations and capabilities, then gives a high-level overview of the safety processes used before deployment, including measurements, model-level changes, product and system interventions, and external expert engagement.[4] Its documentary force lies in its candour about brittleness. OpenAI states early that mitigations and processes alter GPT-4’s behaviour and prevent certain misuses, while remaining limited and brittle in some cases. The card also says the company balanced minimising deployment risk, enabling positive uses, and learning from deployment.[5] The document therefore makes risk visible, yet it does so before the later public vocabulary of preparedness thresholds, tracked categories, and safeguard sufficiency had acquired its more formal OpenAI shape.

GPT-4o and o1 change that register. GPT-4o introduces a more modular release form, with sections on risk identification, safety evaluations, preparedness scorecards, third-party assessments, and societal impacts.[6] Its most visible public claim is that GPT-4o’s overall risk was assessed as medium, because the highest domain score in OpenAI’s preparedness categories was medium for persuasion, while other tracked categories were scored lower.[7] The card also makes a comparatively narrow residual-risk statement about unauthorised voice generation after voice-selection controls. OpenAI’s o1 system card adds a different kind of evidentiary care, since reasoning models raise special questions around chain-of-thought visibility, deception monitoring, version mismatch, elicitation, and lower-bound evaluation. The o1 card explicitly cautions that preparedness evaluations should be understood as a lower bound on potential risk, since further prompting, fine-tuning, scaffolding, or longer rollouts could reveal more capability.[8]

Anthropic’s Claude 3.7 Sonnet and Claude 4 system cards present another trajectory. Claude 3.7 treats visible thinking as both a product feature and a safety surface. Its structure moves from harmlessness, child safety, bias, and computer use into harms and faithfulness in extended thinking, reward hacking, and Responsible Scaling Policy evaluations.[9] Its importance for this article lies in the tension between displaying reasoning traces and admitting limits to their faithfulness. Claude 4 then turns that tension toward a broader release-governance problem. At 124 pages, its system card is much closer to a preparedness dossier than a conventional card. It connects reward hacking, alignment stress tests, CBRN evaluations, autonomy, cyber, third-party pre-deployment assessment, and ongoing monitoring to Anthropic’s Responsible Scaling Policy.[10] Its defining public move is precautionary: Anthropic states that it has not determined that Claude Opus 4 definitively crosses the CBRN threshold requiring ASL-3 protections, yet it also states that it cannot rule out such risk and therefore deploys with ASL-3 protections.[11]

Google DeepMind’s Gemini 3 Pro documentation requires a different reading path. The model card is short, public-facing, and deliberately summary-like. It gives essential information on intended use, limitations, mitigation approaches, safety performance, and known issues, while stating that Gemini 3 Pro did not reach any Critical Capability Levels under Google’s Frontier Safety Framework.[12] The Frontier Safety Framework report carries the heavier argumentative load: it defines the relevant Critical Capability Levels, describes risk domains such as CBRN, cybersecurity, machine-learning research and development, and harmful manipulation, and states the basis on which Gemini 3 Pro was deemed acceptable for deployment.[13] The methodology supplement then adds valuable context around evaluation sources, including cases where non-Gemini comparison scores rely on provider-reported results outside Google’s direct testing.[14] Read together, these documents form a more serious public case than the model card alone can support.

Meta’s Muse Spark Safety & Preparedness Report provides the most expansive single-document case in the corpus. It is framed as the evidence that informed Meta’s launch decision and organised through the Advanced AI Scaling Framework.[15] Its importance lies in the visibility of pre-mitigation and post-mitigation reasoning. The report states that catastrophe-relevant evaluations identified elevated risks before safeguards, that chemistry and biology capability likely reached a high-risk category before mitigation, and that multi-layered safeguards reduced residual risk to a level considered acceptable for deployment in Meta AI.[16] The report’s appendices, scorecards, confidence intervals, behavioural evaluations, content-safety sections, and changelog create a dense documentary surface. They also create a reading problem: comprehensiveness can produce confidence before the reader has understood the limits of comparison, access conditions, model filtering, and evaluation awareness.

The final layer of the method uses US federal sources as external benchmarks for documentation quality. NIST AI RMF 1.0, the NIST Generative AI Profile, NIST AI 800-2 on benchmark-evaluation practice, NIST AI 800-4 on post-deployment monitoring, the GAO AI Accountability Framework, OMB M-25-21, and the US/UK AISI report on OpenAI o1 do not function here as legal standards for private frontier labs.[17] Their value is diagnostic. They clarify what traceability, qualified claims, evaluation reporting, governance visibility, and monitoring would look like if public AI safety documentation were measured against a more demanding accountability horizon.

3. From Model Card to Public Safety Case

The system card’s transformation from model label to release document changes the status of the public artifact. A conventional model card can describe model facts, training data, intended uses, limitations, and evaluation results. A frontier system card does more than this. It participates in the public justification of deployment. The distinction is visible in the way these documents arrange evidence around risk and decision. GPT-4 reports a model through a narrative of why a system with known risks was released under a particular mitigation regime. GPT-4o summarises multimodal capabilities by attaching those capabilities to preparedness categories and third-party assessments. Claude 4 documents Claude Opus 4 and Claude Sonnet 4 through a precautionary governance decision under conditions of unresolved CBRN uncertainty. Muse Spark presents capability scores as part of a case that deployment in Meta AI is acceptable after a defined set of mitigations.

This case-like function is often implicit. The documents rarely call themselves safety cases in the formal assurance sense. Yet the public reader encounters them as safety cases in practice, because each report must support a claim that the organisation has tested enough, mitigated enough, and reasoned enough to release the system under specified conditions. The burden becomes clearer when the document includes a threshold. A threshold gives the claim a public hinge. It allows the reader to ask what was tested, whether the tested behaviour maps to the threshold, whether the threshold maps to the harm being claimed, and whether mitigation has altered the risk pathway enough to justify deployment.

Anthropic’s Claude 4 card offers the most explicit version of this structure. Its rule-out, rule-in, and middle-zone logic gives readers a way to understand why uncertainty is not collapsed into a binary answer. A model may fall below a definitive rule-in threshold while remaining too close to dismiss. That middle zone, once made public, changes the ethical texture of the release document. Anthropic’s decision to apply ASL-3 protections to Claude Opus 4 does not rest on a claim of certainty. It rests on a public account of precaution where the organisation reports that it cannot rule out a relevant capability threshold.[18] The label ASL-3 matters through the way it is made claimable through a public relation between threshold, evaluation, uncertainty, and safeguard.

Meta’s Muse Spark report builds a different case. It gives a pre-mitigation risk claim, a mitigation story, and a residual-risk conclusion. That sequence is unusually valuable because it lets the reader distinguish model capability from deployed system risk. Muse Spark’s unmitigated chemistry and biology performance is treated as potentially high risk under Meta’s framework, while the deployed system, after safeguard layers, is described as presenting acceptable residual risk for use in Meta AI.[19] The documentary gain is substantial: the safety claim is not attached to raw model capability alone. It is attached to a configured deployment context. The pressure point then shifts to whether the safeguards are sufficiently described, validated, and monitored.

OpenAI’s documents show a more gradual movement. GPT-4’s card belongs to a formative stage in which public safety reporting relies on examples, red-team narratives, and mitigation descriptions. Its early candour remains valuable. The document says that examples are illustrative and not enough to show the breadth of possible harms, and it states that the system card is not comprehensive.[20] Those caveats give the document a degree of humility that later scorecard-heavy formats sometimes compress. GPT-4o and o1, however, bring OpenAI’s public reporting closer to a thresholded architecture. GPT-4o’s preparedness scorecards create a quick public route from domain evaluation to risk label. o1 adds stronger caveats around lower bounds and model-version differences. The public reader can see more structure than in GPT-4, though crucial internal decision artifacts still remain largely behind the surface.

Google DeepMind’s Gemini 3 Pro documentation makes the multi-document nature of frontier reporting especially clear. The model card is useful as an entry point, but it cannot carry the whole decision record. Its statement that Gemini 3 Pro did not reach any Critical Capability Levels becomes meaningful only when read alongside the Frontier Safety Framework report, which explains how those levels are defined and how the model was assessed against them.[21] The public case therefore depends on a documentation stack. That stack is not inherently weaker than a single long report. It may even be clearer if readers are given a stable route through the documents. The weakness appears when the model card circulates as the visible public artifact while the more demanding evidence remains elsewhere.

The movement from model card to public safety case therefore does not require every release document to become longer. Length can create its own opacity. The point is structural legibility. Strong documentation makes visible the relation between the model being evaluated, the system being deployed, the threshold being used, the evidence being offered, the mitigation being claimed, and the uncertainty that remains. The release document becomes accountable when those relations can be followed without relying on institutional trust alone.

4. Evidence, Thresholds, and the Movement from Evaluation to Claim

Evaluation evidence becomes public safety language through a difficult translation. A benchmark result, red-team finding, expert assessment, or capability elicitation run begins as a measurement under particular conditions. Once placed in a system card, it may become part of a broader claim about risk, mitigation, or deployment readiness. The translation is fragile because each stage changes the scale of the claim. A model’s performance on a cyber benchmark may indicate a capability under elicitation conditions. That capability may matter for a threat model. The threat model may justify a threshold. The threshold may trigger a safeguard. The safeguard may support a release decision. Each movement carries uncertainty, and the document’s quality depends on whether that uncertainty remains visible.

NIST AI 800-2 is helpful here because it treats benchmark evaluation as a staged process. The draft organises good practice around defining the measurement target, implementing and running the evaluation, and analysing and reporting results.[22] It also asks that evaluation objectives be clear, that the intended use of measurements be documented, and that reported claims be qualified. This vocabulary clarifies why some frontier AI safety documents are more useful than others. A score alone is thin evidence. A score attached to a target, protocol, threshold, uncertainty statement, and interpretive limit is much stronger.

OpenAI’s o1 system card shows one strong version of this logic. Because o1 is a reasoning model, the document must handle the problem of elicited capability under different prompting and scaffolding conditions. The card explicitly warns that preparedness evaluations are lower bounds on risk. It also states that exact production performance may vary from the near-final model tested, because final parameters, system prompts, and other deployment features can affect behaviour.[23] These caveats are not signs of weakness. They are part of the document’s evidentiary discipline. They prevent the safety claim from exceeding its measurement base.

Claude 4’s CBRN reasoning offers another form of claim discipline. The document does more than state whether the model is safe or unsafe. It positions Claude Opus 4 within a threshold regime where the inability to rule out a more demanding risk level leads to stronger protections. This matters because many public AI documents present uncertainty as a qualification at the end of a section. Claude 4 makes uncertainty part of the release decision itself. The threshold alters how the model is deployed, how protections are framed, and how external assessment is integrated into the public account.[24]

Meta’s Muse Spark report makes the evidence-to-claim movement visible through mitigation staging. Its central public claim depends on the difference between pre-mitigation capability and post-mitigation residual risk. The report says that elevated risks were identified before safeguards, especially in chemical and biological domains, and that a multi-layered mitigation package reduced the remaining risk to a level considered acceptable for deployment in Meta AI.[25] This is a strong documentary structure because it avoids the common confusion between model capability and deployed-system risk. At the same time, it raises the standard for evidence about safeguards. Once residual risk becomes the decisive public term, the reader needs to know how safeguards were tested, how robust they remain under adversarial pressure, and what monitoring will detect after release.

GPT-4o demonstrates the power and danger of scorecard compression. A preparedness scorecard can give public readers an accessible domain summary, especially where the alternative would be scattered narrative. GPT-4o’s scorecard makes it possible to see quickly that persuasion drove the overall medium risk rating, while other domains were judged lower.[26] Yet that same compression can detach the risk label from the method that produced it. Readers may remember the label and forget the threshold, task design, sampling method, elicitation regime, or deployment configuration. Scorecards therefore require nearby interpretive support. Otherwise, they become polished surfaces that carry more confidence than the visible evidence can fully sustain.

The same problem appears in tables and figures. Claude 3.7’s helpfulness/refusal analysis is valuable because it shows a real trade-off in safety tuning, where refusal behaviour and helpful answering must be balanced in a way that resists describing safety as costless progress.[27] Claude 4’s threshold-zone visualisation is strong because it resists a false binary between below-threshold and above-threshold safety. Gemini 3 Pro’s highly compact frontier-safety table gives readers a quick conclusion, but its supporting reasoning lives in the FSF report. Muse Spark’s scorecards and confidence intervals provide a rich surface for evaluation, yet the report itself warns that peer-model comparisons can be affected by access conditions, filtering, and provider-side differences.[28]

The strongest evidence-to-claim practice in the corpus can therefore be expressed as a chain. The document should specify which model version or checkpoint was tested; what capability or behaviour was being measured; what protocol, benchmark, red-team process, or expert evaluation produced the result; what threshold or decision criterion made the result safety-relevant; what mitigation followed; what residual risk remained; and what public or internal governance process approved the release. Few documents reveal all of this with equal clarity. The best ones reveal enough to let the reader locate the missing parts.

5. Four Organisational Styles of Safety Claim

The four organisations studied here produce more than different labels. They produce different documentary styles. OpenAI’s public style moves from narrative risk disclosure toward scorecard preparedness. Anthropic’s style foregrounds thresholded governance and precautionary deployment reasoning. Google DeepMind’s style distributes the safety case across a documentation stack. Meta’s style produces a large preparedness report that publicly stages pre-mitigation capability, mitigation, and residual-risk judgement. These differences should not be flattened into a single frontier-safety vocabulary. Comparative reading depends on preserving their local grammar.

OpenAI’s documentation is strongest when it makes the limits of evaluation visible. GPT-4’s early system card provides striking examples of risk, explains mitigation work, and acknowledges that the card is not comprehensive. It also reports that examples are selected to illustrate observed risks and that one example cannot show the breadth of possible manifestation.[29] This language is unusually careful for a launch-adjacent document. Its limitation lies in the absence of a fully public threshold architecture. By the time of GPT-4o and o1, OpenAI has more explicit preparedness categories and a clearer relation between evaluations, domain labels, and internal review. The o1 card is especially strong because it attaches caveats to method and model version, reducing the chance that evaluation results will be over-read as deployment guarantees.

Yet OpenAI’s strongest public documents still depend on internal materials that are only partially visible. The Preparedness Framework v2 refers to Capabilities Reports and Safeguards Reports, including assessment of residual risk and safeguard limitations.[30]Those reports, if fully public, would be central to evaluating the release decision. Their existence strengthens internal governance, but their absence from public view creates a traceability gap. The public reader sees the system card, scorecard, and framework. The fullest internal chain from test output to leadership approval remains largely unavailable.

Anthropic’s distinctive strength is the explicitness of its release reasoning. Claude 3.7 already ties system-card evidence to the Responsible Scaling Policy, while also introducing a difficult issue around visible reasoning. The document reports that chain-of-thought visibility may help users and developers inspect some model behaviour, while also acknowledging that such traces are not fully faithful and that the company may adjust its display decisions over time.[31] Claude 4 extends that discipline into a wider decision record. The public card names the relevant threshold logic, places uncertainty inside the release decision, and describes external assessment as part of the evidence environment. Anthropic’s RSP v3.1 also gives the broader policy architecture through which AI Safety Levels, risk categories, and reporting commitments are made durable.[32]

Anthropic’s documentation can still be difficult to read because the relevant evidence is distributed across long system cards, policy documents, external assessment statements, and promised or minimally redacted reports. The value lies in the fact that the public reader can see where the chain passes through governance. Claude 4’s public form does not ask the reader to treat the absence of certainty as a marginal caveat. It makes uncertainty consequential.

Google DeepMind’s Gemini 3 Pro documents show the benefits and hazards of a layered public record. The model card is accessible and compact. It states major limitations, intended uses, safety findings, and the conclusion that Gemini 3 Pro did not reach any Critical Capability Levels.[33] The FSF report gives that conclusion its proper documentary ground by naming the relevant capability domains, defining the framework’s risk logic, and explaining why the model was deemed acceptable for deployment.[34] The methodology supplement adds further value by marking where comparison scores rely on provider-reported results. That methodological admission is important because it prevents benchmark comparison from becoming a false common scale.

The risk in the Google case is that public circulation may privilege the shortest document. A hiring manager, developer, journalist, or policy reader may encounter the model card without following the FSF report and methodology supplement. The public case then appears thinner than it is. This is a documentation-design problem. A stacked architecture can work if the stack itself is made legible. It weakens when the interpretive path is left for readers to reconstruct.

Meta’s Muse Spark report is the richest single example of public evidence-to-claim presentation in this corpus. It gives the reader extensive detail on preparedness evaluations, behavioural safety, content safety, catastrophic risk domains, red teaming, robustness, confidence intervals, and later correction.[35] It also states one of the most important caveats in contemporary evaluation reporting: behavioural evaluations may fail to rule out a sufficiently strategic model that calibrates outputs to appear credible under test conditions.[36] This is a strong uncertainty statement because it addresses the epistemic limit of the method itself. The report is weaker where a casual reader may treat the richness of evaluation as equivalent to completeness. Its scale can create a sense of documentary abundance before the reader has worked through access differences, filtering effects, and the exact relation between Meta AI deployment and broader possible contexts.

Across these organisations, terminology drift is unavoidable. OpenAI’s preparedness categories do not map cleanly onto Anthropic’s AI Safety Levels, Google’s Critical Capability Levels, or Meta’s catastrophic-risk thresholds. OpenAI’s autonomy, Anthropic’s autonomy, Google’s machine-learning research and development, and Meta’s loss of control carve the risk space along different joints. A comparative article that treats these labels as equivalent units will manufacture convergence. The better comparison lies in the documentary work each term performs: what it measures, what it triggers, what evidence supports it, and what it allows the organisation to claim.

6. Uncertainty, Residual Risk, and the Missing Audit Trail

Uncertainty is one of the most reliable signs of serious safety documentation. The important question is how uncertainty is placed. If uncertainty appears only as a general disclaimer, it protects the organisation without informing the reader. If it is attached to an evaluation method, a model version, a threshold, or a mitigation, it becomes part of the claim’s structure. It tells the reader how far the evidence reaches.

GPT-4’s uncertainty is broad and often narrative. The card says the system is not comprehensive, that examples are illustrative, that mitigations have limitations, and that further work is needed. This is valuable, especially for an early document. GPT-4o narrows uncertainty into domain and modality issues, including the special residual-risk question around voice generation and the limits of third-party assessments. o1 goes further by giving uncertainty a methodological form: lower-bound language, near-final model caveats, and acknowledgement that additional elicitation could shift risk interpretation.[37]

Anthropic’s Claude 3.7 and Claude 4 cards make uncertainty operate at a different point. Claude 3.7’s uncertainty centres on the faithfulness of reasoning traces and the possibility that visible thinking may not provide reliable access to the model’s internal process.[38]Claude 4’s uncertainty sits at the threshold of release. The company reports that it cannot rule out the relevant CBRN threshold and therefore applies stronger safeguards.[39] This is a mature documentary move because uncertainty shapes the claim before the release decision settles into public form.

Google DeepMind’s uncertainty appears in the layered relation between model card, FSF report, and methodology supplement. The FSF report’s strongest contribution is its framework logic, while the methodology supplement clarifies how certain comparisons should be interpreted. The point is small but consequential: if scores from non-Gemini providers are self-reported or otherwise differently obtained, then the table is not a neutral cross-model measurement field. It is a comparative surface with methodological seams.[40] Those seams need to remain visible if the table is going to support a public claim.

Meta’s Muse Spark report makes uncertainty unusually explicit in the domain of behavioural evaluation. Its discussion of strategic calibration and evaluation awareness recognises that behavioural tests can be gamed by a sufficiently capable or situationally aware system.[41] This caveat does not undermine the entire report. It strengthens the report by preventing behavioural assessment from being inflated into total assurance. The same is true of the report’s caveats around peer comparison. A rich table can mislead if the reader forgets that deployment context, access mode, and filtering shape the meaning of the numbers.

Residual risk is the companion concept to uncertainty. A safety document that describes mitigation without residual risk leaves the reader at a weak point in the argument. It says what was done, but not what remains. OpenAI’s Preparedness Framework explicitly incorporates residual risk in its internal safeguard reasoning.[42] Meta’s Muse Spark report is the clearest public example of residual risk being staged through pre- and post-mitigation assessment. Claude 4’s public reasoning is more threshold-centred, but its ASL-3 deployment logic similarly ties unresolved risk to stronger protections. Google DeepMind’s FSF report gives a framework-based acceptability judgement, though its public treatment of residual risk is less textured than Meta’s.

The missing audit trail appears around the same pressure points in all organisations. Prompting and scaffolding details often remain only partially visible. The exact system prompt, elicitation regime, internal dashboards, threshold calibration, and full post-deployment monitoring metrics are rarely available. OpenAI’s Capabilities Reports and Safeguards Reports are central in the Preparedness Framework but not generally released in full. Anthropic refers to minimally redacted capabilities reporting to external institutes, but the public reader does not necessarily receive that full evidentiary file. Google DeepMind’s model card depends on the FSF report and methodology supplement. Meta’s report gives much more detail, yet public readers still cannot reproduce the internal risk decision from the outside.

This gap should be named carefully. It is not proof that the labs are acting in bad faith. It is a structural feature of frontier AI safety documentation under commercial, security, and governance constraints. The public artifact is usually a disclosure layer, not the full internal case. That distinction matters because the reader’s confidence should be calibrated to the form of evidence available. A system card can make a safety claim publicly discussable. It rarely makes the complete safety case independently auditable.

Post-deployment monitoring is the largest remaining weakness in the public record. Most documents gesture toward monitoring, incident response, ongoing evaluation, or policy enforcement. Fewer publish concrete monitoring metrics, update triggers, incident rates, failure patterns, or change-control records after release. This matters because pre-deployment evaluation is only a partial account of model behaviour. Once a model moves into real use, distribution shift, adversarial adaptation, user behaviour, tool integration, and downstream context begin to alter the risk surface. A serious public release document should therefore point forward as well as backward. It should explain the justification for launch while also stating how the organisation will know if that justification weakens.

7. Federal Benchmarks and the Public Accountability Horizon

US federal sources provide a useful external horizon for this analysis because they sharpen the vocabulary of traceability, governance, evaluation reporting, and monitoring. They do not govern every frontier-lab release directly. Their value here is analytical: they help identify what a stronger public safety document would need to make visible.

NIST AI RMF 1.0 foregrounds governance, documented roles and responsibilities, risk management processes, and continuing mechanisms for identifying and tracking risks.[43]The NIST Generative AI Profile extends this into the specific domain of generative AI, stressing oversight, documentation, pre-deployment testing, evaluation, verification, and validation, while recognising that laboratory tests may fail to capture real-world context.[44] These materials clarify one of the central weaknesses in current frontier-lab reporting. Many documents are strong around pre-release evaluation; fewer make the continuing governance of released systems equally visible.

NIST AI 800-2 is particularly useful for reading benchmark claims. It asks evaluators to define the measurement target, implement and run the evaluation, and analyse and report results with qualification and attention to uncertainty.[45] In the frontier-lab corpus, the strongest public claims are those that satisfy this structure in compressed form. Claude 4 does so where threshold zones and uncertainty are attached to release reasoning. Meta does so where pre-mitigation and post-mitigation risk are connected to deployment context. o1 does so where lower-bound and version caveats qualify the evaluation result. GPT-4, by current standards, looks less exact because it often shows risk phenomena and mitigation without a stable public threshold or measurement target.

NIST AI 800-4 and the GAO AI Accountability Framework sharpen the monitoring gap. NIST AI 800-4 argues that deployed AI systems require measurement and monitoring after release because pre-deployment testing cannot capture the full range of real-world behaviour, especially in systems that are non-deterministic, adaptive, or context-sensitive.[46] GAO gives monitoring equal standing with governance, data, and performance, and asks for evidence from monitoring activity and corrective action.[47] Read against those sources, current system cards and safety reports remain heavily front-loaded. They explain how the model was assessed before launch much more fully than they explain how public accountability will continue after launch.

OMB M-25-21 provides a further public-sector standard for high-impact federal AI use. It requires documented pre-deployment testing, impact assessment, public reporting of determinations and waivers, ongoing monitoring, and appropriate mechanisms for human review and appeal.[48] Frontier AI labs are different institutions, and their release documents cannot be mapped directly onto agency use-case determinations. Still, M-25-21 shows what a complete public decision record might include: purpose, benefit, assessed harms, evidence, mitigation, approval, monitoring, and corrective procedure. That structure is useful for assessing the gap between release transparency and accountability.

The US/UK AISI pre-deployment evaluation report on OpenAI o1 provides perhaps the strongest model of cautious evaluation language in the source set. It states that the report should not be read as an indication that the system is safe or appropriate for release, that the findings are preliminary and partial, that methods are still developing, and that updated model versions or deployed-system observation could change conclusions.[49] This language is valuable because it separates evaluation reporting from deployment endorsement. It offers a public style in which evidence is made available without being inflated into broader assurance.

Taken together, these federal materials suggest an audit lens for frontier safety documentation. A strong public release document should identify the relevant governance pathway, specify the model version and deployment context, define the evaluation target, disclose the threshold or decision criterion, explain mitigation in relation to the risk pathway, describe residual risk, mark uncertainty in proximity to the claim it qualifies, and state how post-deployment monitoring will affect future action. These criteria do not require a single universal template. They require a public chain strong enough for the claim to bear scrutiny.

Measured against that lens, Anthropic’s Claude 4 and Meta’s Muse Spark are the strongest documents in the corpus, though they reach strength through different means. Claude 4 excels in threshold reasoning and precautionary release logic. Muse Spark excels in public staging of pre-mitigation and post-mitigation risk. OpenAI’s o1 card is strong in methodological caution. Google DeepMind’s Gemini 3 Pro documentation is strong when read as a stack. GPT-4 remains important as a historical baseline, and GPT-4o as an intermediate form, while both give public readers less traceable decision logic than the strongest later documents.

The federal benchmark comparison also prevents a complacent conclusion. The public documentation of frontier AI has improved substantially. Yet public traceability still thins around internal decision artifacts, monitoring data, and the exact conditions under which evaluation results are produced. The question for the next phase of system-card practice is therefore not whether labs can publish longer or more polished reports. The question is whether they can produce public documents in which safety claims remain continuously attached to evidence, uncertainty, and accountable procedure.

8. Conclusion: Toward Claim-Level Discipline in AI Safety Documentation

Frontier AI release documentation has entered a new phase. System cards and model-release reports are no longer peripheral explanations of a deployed system. They are among the main public sites where evaluation evidence, risk judgement, governance structure, mitigation language, and institutional accountability meet. Their importance follows from their placement. They appear when a model crosses from internal development into public use, developer access, platform integration, or large-scale social consequence. They are documents of translation, and translation is the place where overstatement most easily appears.

The eight-document corpus examined here shows real improvement in public safety reporting. GPT-4’s system card established an early mode of candid risk narration. GPT-4o introduced more explicit preparedness scorecards and third-party assessments. o1 added stronger method-level caution around reasoning models, version specificity, and lower-bound risk. Claude 3.7 made visible reasoning itself into a safety-documentation problem. Claude 4 made uncertainty consequential by connecting unresolved threshold risk to ASL-3 protections. Gemini 3 Pro showed the value and fragility of a stacked public documentation model. Muse Spark made the pre-mitigation to post-mitigation movement unusually explicit and incorporated significant caveats about behavioural evaluation.

The strongest pattern across the corpus is the emergence of traceable release reasoning. Safety claims become more credible when readers can reconstruct how the organisation moved from evaluation to threshold, from threshold to mitigation, and from mitigation to residual-risk judgement. The weakest pattern is the persistence of hidden or only partially visible evidence chains. Internal reports, prompt scaffolds, exact elicitation conditions, threshold calibration, monitoring dashboards, post-release incident data, and approval records often remain beyond public view. The resulting documents can be serious and still incomplete. Seriousness and completeness should not be confused.

A publishable system card should therefore be judged less by the sheer quantity of information it contains than by the discipline of its claim structure. Does the document identify the model version tested? Does it specify what the evaluation measured? Does it attach the result to a decision rule or threshold? Does it explain what changed after mitigation? Does it say what remains uncertain? Does it describe the residual risk in the deployed context? Does it give public readers some account of who approved release, under what framework, and how the organisation will respond if post-deployment evidence changes the risk picture?

This article has treated those questions as documentary questions. They also define a professional field. AI safety documentation exceeds technical writing applied to AI. It is claim-level governance in public language. The strongest contribution of a documentation specialist is to notice where evidence has lost contact with wording, where a table gives a stronger impression than its method can support, where uncertainty has been smoothed into fluency, where terminology changes register across sections, and where a release claim depends on an internal artifact that the public cannot see. The task is not to make safety language sound reassuring. The task is to make it answerable.

The next step, following from this article, is a practical safety-claim audit framework. Such a framework should treat every claim as a relation between claim type, evidence source, evaluation method, threshold, uncertainty, mitigation, residual risk, audience, and accountability pathway. It should be usable on a paragraph, a table, a scorecard, or a release summary. The purpose would not be to produce a moral verdict on the model. It would be to test whether the public document allows its own safety claims to be read with sufficient pressure.

The public form of AI safety will remain imperfect. Commercial confidentiality, security concerns, competitive pressure, and genuine technical uncertainty will continue to limit disclosure. Yet even under those limits, documentary practice can improve. A system card can make a release more legible without pretending to publish the entire internal case. A preparedness report can disclose residual risk without performing certainty. A model card can direct readers toward the documents that make its claims testable. A federal benchmark can sharpen language without supplying a universal template. Across these forms, the central demand remains the same: evidence must not vanish into the smoothness of public claim.

Appendix: Compact Evidence-to-Claim Matrix

The matrix below condenses the principal evidence-to-claim relations analysed in the article. It does not introduce a separate evidentiary base. It makes visible the public passage from document, to claim, to the remaining pressure that a reader must keep in view when judging how much confidence the safety language can bear.

Matrix columns: Documentary object | Public safety claim movement | Evidence-to-claim relation and remaining pressure

OpenAI GPT-4 System Card

Public safety claim movement: Known risks are presented as sufficiently mitigated for staged deployment and learning from real-world use.

Evidence-to-claim relation and remaining pressure: The card is candid about brittleness and non-comprehensiveness, but public threshold logic and residual-risk decision rules remain lighter than in later release documents.

OpenAI GPT-4o System Card

Public safety claim movement: Preparedness scorecards support a post-mitigation overall risk judgement, with persuasion supplying the highest domain rating.

Evidence-to-claim relation and remaining pressure: The scorecard improves readability and gives a quick public route from evaluation to risk label, while much of the method, threshold calibration, and deployment configuration remains compressed.

OpenAI o1 System Card

Public safety claim movement: Preparedness evaluations support a cautious public account of a reasoning model released under specific model-version and elicitation conditions.

Evidence-to-claim relation and remaining pressure: Lower-bound language, near-final model caveats, and production-variation warnings keep the public claim closer to the evaluation conditions that produced it.

Anthropic Claude 3.7 Sonnet System Card

Public safety claim movement: Visible reasoning traces are treated as both a product feature and a safety surface rather than as transparent access to model cognition.

Evidence-to-claim relation and remaining pressure: The value of displayed thinking is qualified by limits on faithfulness, which prevents reasoning visibility from becoming a stronger assurance claim than the evidence supports.

Anthropic Claude 4 System Card

Public safety claim movement: ASL-3 protections are justified because the relevant CBRN risk cannot be ruled out with sufficient confidence.

Evidence-to-claim relation and remaining pressure: This is the strongest threshold-to-deployment relation in the corpus: unresolved uncertainty alters the safeguard condition rather than remaining a final disclaimer.

Google DeepMind Gemini 3 Pro Model Card and FSF Report

Public safety claim movement: Deployment acceptability is framed through the statement that Gemini 3 Pro did not reach any Critical Capability Levels.

Evidence-to-claim relation and remaining pressure: The claim is strong when the model card, FSF report, and methodology supplement are read together; the model card alone is too compressed to carry the full public safety case.

Meta Muse Spark Safety & Preparedness Report

Public safety claim movement: Pre-mitigation high-risk chemical and biological capability is presented as reduced to acceptable residual risk through defined safeguards.

Evidence-to-claim relation and remaining pressure: The report makes the pre-mitigation to post-mitigation movement unusually visible, but peer-comparison caveats, filtering differences, and post-deployment monitoring limits still discipline the claim.

US/UK AISI o1 report and federal benchmark sources

Public safety claim movement: Evaluation evidence should remain separated from release endorsement, with method, version, and uncertainty made public.

Evidence-to-claim relation and remaining pressure: The federal materials supply an external accountability horizon: documented governance, qualified evaluation reporting, residual-risk reasoning, and post-deployment monitoring should remain connected to public claims.

Notes

[1] Margaret Mitchell and others, ‘Model Cards for Model Reporting’, arXiv:1810.03993 (2018), https://arxiv.org/abs/1810.03993. See also Weixin Liang and others, ‘Systematic Analysis of 32,111 AI Model Cards Characterizes Documentation Practice in AI’, Nature Machine Intelligence, 6 (2024), https://www.nature.com/articles/s42256-024-00857-z.

[2] OpenAI, GPT-4 System Card (March 2023), https://cdn.openai.com/papers/gpt-4-system-card.pdf; OpenAI, GPT-4o System Card (8 August 2024), https://cdn.openai.com/gpt-4o-system-card.pdf; OpenAI, OpenAI o1 System Card (5 December 2024), https://cdn.openai.com/o1-system-card-20241205.pdf; Anthropic, Claude 3.7 Sonnet System Card (February 2025), https://www.anthropic.com/claude-3-7-sonnet-system-card; Anthropic, Claude 4 System Card (May 2025), https://www.anthropic.com/claude-4-system-card; Google DeepMind, Gemini 3 Pro Model Card (May 2026), https://deepmind.google/models/model-cards/gemini-3-pro; Google DeepMind, Frontier Safety Framework Report: Gemini 3 Pro (November 2025), https://deepmind.google/models/fsf-reports/gemini-3-pro/; Meta, Muse Spark Safety & Preparedness Report (26 May 2026), https://ai.meta.com/static-resource/muse-spark-safety-and-preparedness-report/.

[3] OpenAI, Preparedness Framework, Version 2 (15 April 2025), https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf; Anthropic, Responsible Scaling Policy, Version 3.1 (2026), https://www-cdn.anthropic.com/files/4zrzovbb/website/bf04581e4f329735fd90634f6a1962c13c0bd351.pdf; Meta, Advanced AI Scaling Framework, Version 2 (2026), https://ai.meta.com/static-resource/Meta_Advanced-AI-Scaling-Framework-v2/; Google DeepMind, Model Evaluation - Approach, Methodology & Results: Gemini 3 Pro (2026), https://deepmind.google/models/evals-methodology/gemini-3-pro.

[4] OpenAI, GPT-4 System Card, p. 1.

[5] OpenAI, GPT-4 System Card, pp. 1-3.

[6] OpenAI, GPT-4o System Card, pp. 1-4.

[7] OpenAI, GPT-4o System Card, pp. 5-8.

[8] OpenAI, OpenAI o1 System Card, pp. 1-6, 12-14.

[9] Anthropic, Claude 3.7 Sonnet System Card, pp. 1-4, 26-36.

[10] Anthropic, Claude 4 System Card, pp. 1-8, 72-104.

[11] Anthropic, Claude 4 System Card, pp. 99-104.

[12] Google DeepMind, Gemini 3 Pro Model Card, pp. 1-4.

[13] Google DeepMind, Frontier Safety Framework Report: Gemini 3 Pro, pp. 1-8, 18-24.

[14] Google DeepMind, Model Evaluation - Approach, Methodology & Results: Gemini 3 Pro, pp. 1-3.

[15] Meta, Muse Spark Safety & Preparedness Report, pp. 1-12.

[16] Meta, Muse Spark Safety & Preparedness Report, pp. 5-12, 26-58.

[17] National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (January 2023), https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf; National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1 (July 2024), https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf; National Institute of Standards and Technology, Center for AI Standards and Innovation, Practices for Automated Benchmark Evaluations of Language Models, NIST AI 800-2, Initial Public Draft (January 2026), https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf; National Institute of Standards and Technology, Center for AI Standards and Innovation, Challenges to the Monitoring of Deployed AI Systems, NIST AI 800-4 (2026), https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-4.pdf; US Government Accountability Office, Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities, GAO-21-519SP (June 2021), https://www.gao.gov/assets/gao-21-519sp.pdf; Office of Management and Budget, Accelerating Federal Use of AI through Innovation, Governance, and Public Trust, Memorandum M-25-21 (3 April 2025), https://www.whitehouse.gov/wp-content/uploads/2025/02/M-25-21-Accelerating-Federal-Use-of-AI-through-Innovation-Governance-and-Public-Trust.pdf; US AI Safety Institute and UK AI Safety Institute, Joint Pre-Deployment Test: OpenAI o1 (December 2024), https://www.nist.gov/system/files/documents/2024/12/18/US_UK_AI%20Safety%20Institute_%20December_Publication-OpenAIo1.pdf.

[18] Anthropic, Claude 4 System Card, pp. 99-104.

[19] Meta, Muse Spark Safety & Preparedness Report, pp. 5-12, 33-58.

[20] OpenAI, GPT-4 System Card, pp. 1-3.

[21] Google DeepMind, Gemini 3 Pro Model Card, pp. 1-4; Google DeepMind, Frontier Safety Framework Report: Gemini 3 Pro, pp. 1-8, 18-24.

[22] NIST, Practices for Automated Benchmark Evaluations of Language Models, pp. i, 1-4, 24-28.

[23] OpenAI, OpenAI o1 System Card, pp. 1-6, 12-14.

[24] Anthropic, Claude 4 System Card, pp. 99-104.

[25] Meta, Muse Spark Safety & Preparedness Report, pp. 5-12, 33-58.

[26] OpenAI, GPT-4o System Card, pp. 5-8, 22-26.

[27] Anthropic, Claude 3.7 Sonnet System Card, pp. 9-15.

[28] Meta, Muse Spark Safety & Preparedness Report, pp. 5-12, 33-58, 137-52.

[29] OpenAI, GPT-4 System Card, pp. 1-4.

[30] OpenAI, Preparedness Framework, Version 2, pp. 3-14.

[31] Anthropic, Claude 3.7 Sonnet System Card, pp. 26-36.

[32] Anthropic, Responsible Scaling Policy, Version 3.1, pp. 1-12.

[33] Google DeepMind, Gemini 3 Pro Model Card, pp. 1-4.

[34] Google DeepMind, Frontier Safety Framework Report: Gemini 3 Pro, pp. 1-8, 18-24.

[35] Meta, Muse Spark Safety & Preparedness Report, pp. 1-12, 26-58, 137-52.

[36] Meta, Muse Spark Safety & Preparedness Report, pp. 137-52.

[37] OpenAI, GPT-4 System Card, pp. 1-3; OpenAI, GPT-4o System Card, pp. 5-8, 22-26; OpenAI, OpenAI o1 System Card, pp. 1-6, 12-14.

[38] Anthropic, Claude 3.7 Sonnet System Card, pp. 26-36.

[39] Anthropic, Claude 4 System Card, pp. 99-104.

[40] Google DeepMind, Model Evaluation - Approach, Methodology & Results: Gemini 3 Pro, pp. 1-3.

[41] Meta, Muse Spark Safety & Preparedness Report, pp. 137-52.

[42] OpenAI, Preparedness Framework, Version 2, pp. 10-14.

[43] NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0), pp. 20-30.

[44] NIST, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, pp. 8-20, 30-43.

[45] NIST, Practices for Automated Benchmark Evaluations of Language Models, pp. i, 1-4, 24-28.

[46] NIST, Challenges to the Monitoring of Deployed AI Systems, pp. 1-12.

[47] US Government Accountability Office, Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities, pp. 1-16, 69-88.

[48] Office of Management and Budget, Accelerating Federal Use of AI through Innovation, Governance, and Public Trust, Memorandum M-25-21 (3 April 2025), pp. 10-18.

[49] US AI Safety Institute and UK AI Safety Institute, Joint Pre-Deployment Test: OpenAI o1, pp. 1-6.

Bibliography

Anthropic. Claude 3.7 Sonnet System Card. February 2025. https://www.anthropic.com/claude-3-7-sonnet-system-card.

Anthropic. Claude 4 System Card. May 2025. https://www.anthropic.com/claude-4-system-card.

Anthropic. Responsible Scaling Policy, Version 3.1. 2026. https://www-cdn.anthropic.com/files/4zrzovbb/website/bf04581e4f329735fd90634f6a1962c13c0bd351.pdf.

Google DeepMind. Frontier Safety Framework Report: Gemini 3 Pro. November 2025. https://deepmind.google/models/fsf-reports/gemini-3-pro/.

Google DeepMind. Model Evaluation - Approach, Methodology & Results: Gemini 3 Pro. 2026. https://deepmind.google/models/evals-methodology/gemini-3-pro.

Google DeepMind. Gemini 3 Pro Model Card. May 2026. https://deepmind.google/models/model-cards/gemini-3-pro.

Liang, Weixin, Nazneen Rajani, Xinyue Yang, and others. ‘Systematic Analysis of 32,111 AI Model Cards Characterizes Documentation Practice in AI’. Nature Machine Intelligence, 6 (2024). https://www.nature.com/articles/s42256-024-00857-z.

Meta. Advanced AI Scaling Framework, Version 2. 2026. https://ai.meta.com/static-resource/Meta_Advanced-AI-Scaling-Framework-v2/.

Meta. Muse Spark Safety & Preparedness Report. 26 May 2026. https://ai.meta.com/static-resource/muse-spark-safety-and-preparedness-report/.

Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. ‘Model Cards for Model Reporting’. arXiv:1810.03993 (2018). https://arxiv.org/abs/1810.03993.

National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. January 2023. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf.

National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1. July 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf.

National Institute of Standards and Technology, Center for AI Standards and Innovation. Practices for Automated Benchmark Evaluations of Language Models. NIST AI 800-2, Initial Public Draft. January 2026. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf.

National Institute of Standards and Technology, Center for AI Standards and Innovation. Challenges to the Monitoring of Deployed AI Systems. NIST AI 800-4. 2026. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-4.pdf.

Office of Management and Budget. Accelerating Federal Use of AI through Innovation, Governance, and Public Trust. Memorandum M-25-21. 3 April 2025. https://www.whitehouse.gov/wp-content/uploads/2025/02/M-25-21-Accelerating-Federal-Use-of-AI-through-Innovation-Governance-and-Public-Trust.pdf.

OpenAI. GPT-4 System Card. March 2023. https://cdn.openai.com/papers/gpt-4-system-card.pdf.

OpenAI. GPT-4o System Card. 8 August 2024. https://cdn.openai.com/gpt-4o-system-card.pdf.

OpenAI. OpenAI o1 System Card. 5 December 2024. https://cdn.openai.com/o1-system-card-20241205.pdf.

OpenAI. Preparedness Framework, Version 2. 15 April 2025. https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf.

US Government Accountability Office. Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities. GAO-21-519SP. June 2021. https://www.gao.gov/assets/gao-21-519sp.pdf.

US AI Safety Institute and UK AI Safety Institute. Joint Pre-Deployment Test: OpenAI o1. December 2024. https://www.nist.gov/system/files/documents/2024/12/18/US_UK_AI%20Safety%20Institute_%20December_Publication-OpenAIo1.pdf.