A Safety-Claim Audit Framework for Frontier AI Documentation

Auditing evidence, thresholds, safeguards, and public accountability in model-release documents

Saman Samadi, PhD (Cantab)

Framework / PDF / templates | 17 June 2026

Focus: Safety-claim review; audit methods; evidence traceability; terminology discipline.

Download PDF | Compact Audit Template | Expanded Audit Template | Return to AI Safety Portfolio

Abstract

Frontier AI developers now publish system cards, model cards, preparedness frameworks, risk reports, safety reports, and frontier-safety policies as part of the public record surrounding increasingly capable models. These documents do not verify model safety by themselves. Their value lies in a narrower and more inspectable task: they make certain safety-relevant claims available to readers outside the organisation that produced the model. This article proposes a practical audit framework for such claims. It asks whether a public safety statement identifies its object, scope, threat model, threshold, evidentiary basis, safeguards, residual risk, governance process, and decision consequence with enough precision for the claim to be assessed rather than merely received. The framework draws on directly verified primary sources from OpenAI, Anthropic, Google DeepMind, Meta, Amazon, NIST, the U.S. Government Accountability Office, METR, and the frontier-AI safety-case literature. Its main output is an eight-category rubric, a compact scoring method, a reusable audit template, a glossary, and a worked demonstration on OpenAI’s Preparedness Framework. The article argues that documentation quality should be separated from evidence confidence. A document may be well structured while leaving the underlying evidence inaccessible; equally, a technically serious evaluation may become weak public evidence if its claim is compressed into language that hides uncertainty, redaction, mitigation state, or decision responsibility. The framework therefore audits public documentation, not the underlying model. It provides a method for judging whether a safety claim remains proportionate to the evidence that the public document allows readers to inspect.

Keywords: AI safety documentation; system cards; safety claims; model-release documentation; auditability; assurance; residual risk; preparedness frameworks; responsible scaling; frontier AI governance.

1. Introduction: from safety statement to auditable claim

Frontier AI documentation now performs work that older technical appendices were not designed to carry. A model release may still be accompanied by benchmark tables, user-facing notices, and conventional descriptions of capability, but the safety-relevant public record has become more varied. System cards describe model behaviour and mitigations. Preparedness frameworks define capability thresholds and decision procedures. Responsible-scaling policies connect increasing capability to new safeguards. Risk reports move beyond the single release event by placing deployed and internally used systems within continuing threat models, monitoring practices, and residual-risk judgements. Model cards still matter, though their role changes when the public question is not only what a model is, but what kind of risk argument can be made about its release.

This article proposes a framework for auditing safety claims in that documentary setting. A safety claim is any statement that asserts, implies, or licenses the judgement that a model, system, 1 deployment, or governance process is acceptably safe relative to a specified risk and decision. The claim may be explicit, as when a document states that a model falls below a capability threshold. It may be procedural, as when a framework says that deployment will not proceed unless safeguards are judged sufficient. It may be implicit, as when a table, benchmark, or scorecard invites the reader to infer that risk has been brought under control. In each case, the audit question is not whether the model is actually safe in the world. That would require access to evidence that public documentation rarely provides. The question is whether the public document supports the safety-relevant conclusion it asks the reader to accept.

This distinction matters because public frontier-AI documents sit between internal safety work and external judgement. OpenAI’s Preparedness Framework, for example, defines tracked categories for biological and chemical capabilities, cybersecurity capabilities, and AI self-improvement, links them to threat models and capability thresholds, and states that models reaching High capability thresholds require safeguards before deployment, while Critical capabilities require safeguards during development as well.[1] Anthropic’s Responsible Scaling Policy, in its current version 3.3, describes the RSP as a voluntary framework for managing catastrophic risks, then places Risk Reports, Frontier Safety Roadmaps, external review, publication, redaction, and governance within the company’s public safety architecture.[2] Google DeepMind’s Frontier Safety Framework 3.1 defines Tracked Capability Levels and Critical Capability Levels across misuse, machine-learning R&D, and misalignment risks, and connects those levels to inherent-risk assessment, residual-risk assessment, safety cases, governance review, disclosure, and post-deployment update processes.[3] These are not simply statements of institutional intention. They define the route by which evidence is expected to become decision.

Yet a route is not the same as a completed public proof. The public reader may still be unable to inspect the underlying model evaluations, the internal deliberations of a safety advisory group, the details withheld for security reasons, the precise standard by which a safeguard was judged adequate, or the confidence interval around a risk judgement. That limitation cannot be eliminated by demanding total disclosure. Frontier-AI documentation contains legitimate redactions, and some technical information should not be published in ways that facilitate misuse. The audit problem is more exact. A public document should preserve enough of the chain between threat model, threshold, evidence, mitigation, residual risk, and decision consequence for the reader to see where judgement has occurred and what remains inaccessible.

The framework proposed here therefore treats public safety documentation as a claim environment. It does not ask a system card to become a full independent safety audit. It asks whether each safety-relevant statement has been given enough structure to be assessed. A strong claim names its object and scope; situates itself within a threat model; explains the evaluation or evidence on which it rests; identifies the threshold or decision rule that gives the evidence consequence; connects safeguards to the risk pathway; states residual uncertainty; and identifies the governance process that turned the evidence into a release or non-release judgement. When these elements are missing, the claim may still be made in good faith, but its public accountability is weakened.

2. Source base and direct verification

This framework was prepared from a source base that is deliberately mixed. Company documents supply the immediate documentary material, while government, standards, and assurance sources supply vocabulary for risk management, monitoring, accountability, and audit evidence. The main company sources verified for this version are OpenAI’s Preparedness Framework, Version 2, last updated 15 April 2025; Anthropic’s Responsible Scaling Policy, Version 3.3, effective 26 May 2026; Anthropic’s Risk Report: February 2026, updated on 26 May 2026; Google DeepMind’s Frontier Safety Framework, Version 3.1, published 17 April 2026; OpenAI’s GPT-4o System Card; Meta’s Llama 4 model card; and Amazon’s Frontier Model Safety Framework.[4]

Two corrections to the Deep Research dossier are important. The dossier treated OpenAI’s Preparedness Framework as updated on 18 September 2025, but the directly verified PDF states Version 2 and “Last updated: 15th April, 2025.” It also used Google DeepMind’s earlier Frontier Safety Framework 2.0 as a central anchor, while the directly verified current PDF is Version 3.1, published 17 April 2026.[5] This does not invalidate the dossier. It changes the source hierarchy. For a publishable artifact, the OpenAI framework should be cited as April 2025, and Google DeepMind’s Version 3.1 should replace Version 2.0 wherever the current framework is being discussed.

The government and standards-facing sources are not used as binding standards for private frontier laboratories. They function as diagnostic anchors. NIST’s AI RMF 1.0 makes AI risk management context-sensitive and organisationally embedded, warning that risk measurement can be oversimplified, gamed, misapplied, or weakened when context and affected groups are ignored.[6] NIST’s Generative AI Profile narrows the RMF toward governance, content provenance, pre-deployment testing, and incident disclosure, while also stressing that generative AI risks vary by lifecycle stage, model or system level, deployment context, and ecosystem effects.[7] GAO’s AI Accountability Framework gives the audit orientation: it emphasises documentation of data, requirements, testing methodology, performance results, and monitoring, while noting that implementation should allow third-party assessment and audit.[8] These sources do not tell a frontier lab exactly how to write a system card. They clarify why traceability, documentation, role assignment, monitoring, and evidence handling matter for public accountability.

The framework also draws on safety-case scholarship and frontier policy comparison. The safety-case template for a cyber inability argument defines a safety case as a structured, evidence-based argument that a safety-critical risk is acceptable, and it explicitly connects risk models, proxy tasks, evaluation settings, and evaluation results through a claims-arguments-evidence structure.[9] METR’s comparison of frontier AI safety policies identifies common elements across published policies, including capability thresholds, model-weight security, deployment mitigations, halting conditions, full capability elicitation, evaluation timing, accountability, and policy updating.[10] These sources provide the bridge between a broad risk-management vocabulary and the more specific task of claim-level auditing.

The source base remains version-sensitive. Anthropic’s RSP page records multiple 2026 updates, with Version 3.3 effective on 26 May 2026. Google DeepMind’s framework includes a version list showing Version 3.1 in April 2026, Version 3.0 in September 2025, Version 2.0 in February 2025, and Version 1.0 in May 2024.[11] Any future publication of this artifact should update the source dates and version numbers before release.

3. What counts as a safety claim?

A safety claim in frontier-AI documentation is not merely a sentence containing the word safety. It is a public proposition whose function is to support, imply, or stabilise a judgement about acceptable risk. The proposition may concern a model, a deployment surface, a set of safeguards, a risk classification, a governance procedure, or a disclosure practice. Its public force depends on what it asks the reader to infer.

The most basic test has six parts. A safety claim becomes audit-ready when the document allows the reader to identify the subject of the claim, the scope in which the claim holds, the threat model it addresses, the threshold or standard against which evidence is interpreted, the evidence used to support it, and the decision consequence attached to it. The decision consequence may be deployment, delayed deployment, further evaluation, stronger safeguards, external review, restricted access, model-weight protection, monitoring, or revision of a framework. Without such consequence, a threshold becomes descriptive rather than operational.

Several claim types recur across current frontier documentation.

The framework distinguishes several recurrent claim types.

Scope claim. The claim asks the reader to accept that a document applies to a defined model, system, deployment, or policy context. The audit must locate the version, deployment surface, exclusions, and update status.

Threat-model claim. The claim asserts that a harm pathway is relevant to the model or deployment. The audit must identify the actor, mechanism, harm scale, and route from capability to outcome.

Capability claim. The claim states that a model can or cannot perform a risk-relevant task. The audit must examine the evaluation method, elicitation conditions, comparator, and uncertainty.

Threshold claim. The claim says that a capability level has or has not been reached. The audit must ask how the threshold is defined, how it is measured, and what decision follows from crossing it.

Safeguard claim. The claim states that controls reduce risk to an acceptable level. The audit must map controls to risks, assess efficacy evidence, and look for limitations.

Residual-risk claim. The claim concerns what remains after mitigation. The audit must identify post-mitigation evidence, assumptions, uncertainty, and the responsible decision body.

Governance claim. The claim asserts that a process or body has reviewed, authorised, challenged, or overseen the safety judgement. The audit must locate roles, escalation paths, decision rights, dissent channels, and external review where relevant.

Transparency claim. The claim says that enough information will be disclosed for public or expert scrutiny. The audit must distinguish what is public, what is redacted, why it is redacted, and what remains assessable.

Monitoring claim. The claim promises continuing review after deployment. The audit must locate signals, incident pathways, update triggers, and responsibility for revision.

These claim types are not exclusive. A single paragraph can contain several at once. OpenAI’s public-disclosure section in the Preparedness Framework is a transparency claim, a governance claim, and a conditional safeguard-disclosure claim. Anthropic’s Risk Report structure contains threat-model claims, capability claims, mitigation claims, and overall residual-risk claims. Google DeepMind’s risk acceptance process contains threshold claims, mitigation claims, residual-risk claims, and governance claims. The audit method begins by separating those functions before judging them.

4. The eight audit categories

The framework uses eight categories. Each category receives a documentation-quality score from 0 to 4. The score does not measure the actual safety of the model. It measures how well the public document supports the claim under review.

The framework uses eight audit categories.

1. Claim object and scope, weight 10. The audit asks what is being assessed, and under which deployment conditions. Stronger documentation identifies the model, version, system surface, release condition, exclusions, and update status. Weaker documentation refers generally to “our models” or “the system” without enough scope.

2. Threat model and harm pathway, weight 12. The audit asks what harm could occur, by whom, and through what mechanism. Stronger documentation names the actor, capability, pathway, harm scale, and uncertainty. Weaker documentation relies on broad misuse language without a pathway.

3. Capability evidence and elicitation, weight 14. The audit asks how the relevant capability was tested and whether it may have been under-elicited. Stronger documentation states evaluation settings, scaffolding, comparators, limitations, and external input. Weaker documentation gives benchmark-only reporting or vague red-team references.

4. Threshold and decision rule, weight 12. The audit asks what happens if the threshold is reached. Stronger documentation links the threshold to deployment, development, security, or safeguard action. Weaker documentation uses threshold labels without consequence.

5. Safeguards and residual risk, weight 18. The audit asks which controls address which risk, how adequacy was judged, and what remains after those controls. Stronger documentation maps risks to controls, gives efficacy evidence, states limitations, and names residual risk. Weaker documentation lists mitigations as features without showing how they change the claim.

6. Governance and challenge, weight 12. The audit asks who judged the evidence and who can challenge the conclusion. Stronger documentation names bodies, review routes, escalation processes, external review, or third-party testing. Weaker documentation refers to leadership or internal review without process or access detail.

7. Transparency, redaction, and uncertainty, weight 10. The audit asks what is public, what is withheld, and what the document does not know. Stronger documentation states the disclosure boundary, redaction rationale, limitations, and uncertainty. Weaker documentation turns uncertainty into reassurance.

8. Monitoring, incidents, and revision, weight 12. The audit asks how the claim will be updated after release or after new evidence appears. Stronger documentation gives monitoring signals, incident channels, update triggers, and periodic review. Weaker documentation leaves no route for correction after deployment.

The weights are intentionally approximate. They prevent the framework from pretending that the whole audit can be reduced to a mathematically exact score. Safeguards and residual risk receive the highest weight because many public safety claims collapse at that point. A document may identify a threat model and threshold, yet still fail to show how a mitigation changes the risk after deployment. Capability evidence also receives a high weight because frontier evaluations can understate or misstate capability when elicitation, scaffolding, tool access, red teaming, or baseline comparison remain unclear. The remaining categories preserve the public accountability conditions around the claim.

The scoring scale is deliberately simple. A score of 0 means that the category is absent, contradicted, or purely rhetorical. A score of 1 means that it is asserted but not tied to evidence or decision consequence. A score of 2 means that it is partly specified, with some evidence but important gaps. A score of 3 means that it is substantial, decision-relevant, and transparent about limits. A score of 4 means that it is audit-ready: explicit, evidenced, challengeable, and updateable.

The weighted score is calculated by multiplying each category weight by the rating divided by four, then summing the result. The score should be reported with a confidence label. Documentation quality and evidential confidence should not be merged. A public document can be careful in structure while relying on redacted or internal evidence. A score should therefore be accompanied by one of three confidence labels:

The confidence label should then be stated separately. High confidence means that methods, thresholds, and update logic are public, with meaningful external challenge or replication. Medium confidence means that methods are public enough to inspect, while key evidence remains internal or only partly externally tested. Low confidence means that the claim depends mainly on redacted, future-tense, or unspecified evidence.

Three special cases need careful handling. “Not relevant” removes a category from the denominator only when the document’s stated scope truly excludes it. “Not reported” receives little or no credit, because the public document cannot be credited for work that may exist internally but is not visible. “Redacted” can receive partial credit when the document explains what is withheld, why it is withheld, and which public structure remains available for assessing the claim.

This method also prevents a common error in public AI writing: treating disclosure as equivalent to assurance. A detailed document can make uncertainty easier to inspect, but detail alone does not establish that the underlying risk has been reduced to an acceptable level. The audit asks whether the public claim remains attached to a traceable argument.

5. Compact and expanded audit templates

A practical audit needs to be usable on a paragraph, a table, a scorecard, a short section, or a whole document. The compact template is for quick claim review. The expanded template is for a fuller portfolio demonstration.

Compact template

The compact template records the document and version, organisation, passage or section, claim under review, claim type, object and scope, evidence cited, evaluation method, threshold or decision rule, safeguards and residual risk, governance process, transparency and redaction, monitoring or update route, documentation score, evidence confidence, and a one-sentence verdict.

Expanded template

The expanded template adds the exact source identity, the claim text or paraphrase, the threat model, elicitation conditions, comparator or baseline, mitigation state, residual-risk judgement, governance body, external challenge, redaction and limits, monitoring signal, score and confidence, and a recommended revision. It is intended for portfolio demonstrations and source-checking notes rather than for every quick reading exercise.

The audit sequence is straightforward. Identify the claim. Fix its object and scope. Map it to a threat model. Locate the threshold or decision rule. Check evidence, elicitation, comparator, and mitigation state. Then examine safeguards, residual risk, governance, disclosure, and revision. Only after those steps should a score be assigned.

6. Worked demonstration: OpenAI’s public-disclosure commitment

The demonstration passage comes from OpenAI’s Preparedness Framework, Section 5.2, “Transparency and external participation.” The relevant passage states that OpenAI will release information about Preparedness Framework results for major deployments, including the scope of testing, capability evaluations for each Tracked Category, reasoning for the deployment decision, and decisive contextual information; if a model is beyond a High threshold, the disclosure will also include information about implemented safeguards, though results and safeguards may be redacted or summarised when necessary to protect intellectual property or safety.[12]

This is a good demonstration passage because it is not a direct claim that a given model is safe. It is a public-auditability claim. It says what kind of material OpenAI will make available, under which broad conditions, and with which redaction boundary. The claim sits inside a stronger framework that defines Tracked Categories, threat models, High and Critical capability thresholds, Safeguards Reports, SAG review, third-party evaluation possibilities, and independent expert input.[13] Yet the passage itself remains conditional and future-facing. It promises a disclosure practice rather than giving the underlying evidence for a particular release.

The audit gives the passage the following category scores. Claim object and scope receives 3/4, since the claim applies to major deployments and Tracked Categories, while the determination of what qualifies as major remains partly internal. Threat model and harm pathway receives 2/4, because threat models exist elsewhere in the framework but the audited passage itself concerns disclosure rather than a specific harm pathway. Capability evidence and elicitation receives 2/4, since the passage promises disclosure of capability evaluations but does not itself provide methods or elicitation detail. Threshold and decision rule receives 3/4, because the passage is embedded in a framework where High and Critical thresholds carry safeguard consequences. Safeguards and residual risk receives 2/4, since safeguard disclosure is promised for models beyond a High threshold but the passage does not state how adequacy will be publicly assessed. Governance and challenge receives 3/4, because the surrounding section includes SAG review, third-party evaluation, stress testing, and independent expert input. Transparency, redaction, and uncertainty receives 3/4, because the passage names what will be disclosed and acknowledges redaction, though the redaction standard remains open-ended. Monitoring, incidents, and revision receives 2/4, because the broader framework describes review and improvement, while the specific disclosure passage does not provide an incident or update pathway.

The indicative documentation score is 61/100. The evidence-confidence label is Medium. The audit verdict is that the passage is strong as a policy-level transparency commitment and moderate as a public-auditability claim.

The passage supports the claim that OpenAI has a public-disclosure architecture for major deployments under the Preparedness Framework. It does not support a claim that any specific model’s safeguards have been independently verified, nor that the public will receive enough information to reproduce the internal assessment. The most important audit distinction is between disclosure commitment and realised evidence. The first belongs to governance architecture. The second belongs to the evidence base of a particular release.

A revision that would strengthen the passage without requiring unsafe disclosure might specify how redacted disclosures will preserve auditability. For example, OpenAI could state that when details are summarised or withheld, the public disclosure will still identify the claim type, Tracked Category, threshold status, evaluation family, safeguard class, decision body, residual-risk judgement, and reason for redaction. Such a revision would not require publishing sensitive test items or exploit-relevant details, but it would make the public structure of the claim more stable.

The same framework can be applied to Anthropic’s Risk Report. That document is more audit-ready as a realised report because it gives a recurring structure for each threat model: relevant AI model, current state of model capabilities and behaviours, risk mitigations, overall assessment of risk, forward-looking monitoring, and connection to industry-wide recommendations.[14] It still contains redaction and self-assessment limits, but the document’s structure gives the reader more of the risk argument. Amazon’s Frontier Model Safety Framework offers a useful contrast. It states a clear non-deployment commitment for models exceeding specified risk thresholds without appropriate safeguards, and it lists critical capability evaluations and safety mitigations, yet the public document is much shorter on governance challenge, external review, and demonstrated safeguard adequacy.[15]

7. What stronger and weaker documentation look like

The audit framework is not meant to produce reputational rankings of companies. It distinguishes kinds of public support. A document can be excellent for one purpose and weak for another. Meta’s Llama 4 model card, for example, gives model information, training data, intended use, benchmarks, safeguards, system protections, and red teaming information; it is a substantial model card. It is less suitable as a catastrophic-risk release audit because it does not build its safety discussion around severe-risk thresholds, residual-risk determinations, or a governance process for deployment under frontier-risk conditions.[16]

The relative audit-readiness of the main examples can be stated without turning the comparison into a company ranking. Anthropic’s Risk Report: February 2026 is highly audit-ready for this purpose because it is a realised report with threat-model sections, mitigations, overall risk assessments, and forward-looking monitoring. Google DeepMind’s Frontier Safety Framework 3.1 is also highly audit-ready because it links capability levels, residual-risk assessment, safety cases, governance review, and updates. OpenAI’s Preparedness Framework offers strong policy architecture around tracked categories, thresholds, Capabilities Reports, Safeguards Reports, SAG review, and disclosure. OpenAI’s GPT-4o System Card is more mixed for this specific purpose: it is strong on red teaming, modality-specific evaluation, preparedness scores, and third-party assessment, while offering less of a full public safeguard-sufficiency case. Amazon’s Frontier Model Safety Framework gives a clear threshold-and-safeguard commitment, but the public evidence around external challenge and adequacy review is less developed. Meta’s Llama 4 model card is strong as model documentation, while being less structured as frontier-risk release governance.

The central test is whether the reader can reconstruct the route from evidence to decision. Google DeepMind’s framework is particularly useful here because it makes residual-risk assessment and safety cases part of its process for models reaching TCLs or CCLs, and states that external deployments or high-risk internal deployments occur only after a governance function determines that residual risk is acceptable.[17] Anthropic’s report is useful because its table of contents alone shows a document built around threat models, relevant models, capabilities and behaviours, risk mitigations, overall risk assessments, and looking-forward sections. OpenAI’s framework is useful because it defines Capabilities Reports and Safeguards Reports as distinct documentary steps, then places the SAG between those reports and deployment recommendations.

Weaker documentation may still contain useful information. Its weakness appears when the claim moves faster than its support. A model card that describes safety fine-tuning and red teaming may not show whether a severe-risk threshold was assessed. A framework that states a non-deployment principle may not explain how safeguard adequacy will be judged. A system card may report third-party testing without making clear whether that testing was advisory, independent, comprehensive, or limited to one threat model. The audit does not ask these documents to disclose everything. It asks them to mark the public limits of the evidence they make available.

8. Glossary for the audit framework

Safety claim. A public proposition that supports or implies a judgement of acceptable safety relative to a model, deployment, risk, or governance decision. The term should not be used for every reassuring sentence.

Safety case. A structured, evidence-based argument that risk is acceptable for a defined system and context. Current frontier-AI safety-case work often uses claims, arguments, evidence, assumptions, and defeaters to make risk reasoning explicit.

Threat model. A structured account of how a harm could occur, including the actor, mechanism, capability, target, and severity. A list of risk categories is not yet a threat model.

Capability threshold. A capability level that changes the required safety response. A threshold becomes useful when it is linked to a decision, such as stronger safeguards, delayed deployment, external review, or development restrictions.

Tracked Capability Level / Critical Capability Level. Google DeepMind’s framework uses TCLs for significant risks and CCLs for severe risks. These terms should not be treated as interchangeable with OpenAI’s High and Critical thresholds.

High capability. In OpenAI’s Preparedness Framework, High thresholds identify capabilities that significantly increase existing severe-risk vectors and require robust safeguards before deployment.

Critical capability. In OpenAI’s Preparedness Framework, Critical thresholds indicate qualitatively new threat vectors with no ready precedent and require safeguards during development as well as before deployment.

Capability elicitation. The process of trying to reveal what a model can do under strong prompting, scaffolding, tool use, repeated attempts, reduced refusals, or other conditions that approximate a capable user or adversary.

Comparator. The baseline against which a result becomes meaningful. It may be a prior model, a human actor, a threat-actor tier, a benchmark threshold, or a pre-mitigation system.

Safeguard. A control intended to reduce risk. It may include refusal behaviour, monitoring, access control, classifier systems, sandboxing, security controls, human oversight, or deployment restriction.

Residual risk. Risk remaining after mitigation. A document that names safeguards without residual risk leaves the reader unable to judge whether mitigation changed the decision.

Redaction. Withholding sensitive information for safety, security, intellectual property, or misuse-prevention reasons. Redaction weakens auditability when it hides the structure of the claim rather than only dangerous details.

External review. Review by an actor outside the developer’s immediate production process. It is stronger when scope, access, independence, and response to review are specified.

Third-party evaluation. Testing by an external organisation. It should be credited only for the risk domain and method actually tested.

Auditability. The degree to which an external reader can inspect whether a claim follows from the evidence made public. Transparency is a condition of auditability, not a substitute for it.

Decision consequence. The practical effect of a claim. A threshold or evaluation result becomes governance-relevant when it changes deployment, development, access, security, monitoring, or review.

Update trigger. A condition that requires a document, framework, risk report, or safety case to be revised. It may be periodic, incident-based, evidence-based, or tied to a new model capability.

9. Portfolio value and use in applications

This framework is strongest as a practical portfolio artifact rather than as another interpretive essay. It shows that the preceding work on system cards, evaluation-to-claim movement, risk frameworks, 11 and comparative documentation has become operational. The prior articles established the field. This piece supplies a method.

For external-artifacts roles, the framework demonstrates claim-level editorial judgement. It shows that a system-card or release-document writer can ask whether a sentence is proportionate to the evidence behind it. For responsible-AI governance roles, it demonstrates framework literacy and a capacity to connect voluntary policy, public disclosure, internal review, and residual risk. For evaluation-communication roles, it demonstrates the ability to interpret benchmark and red-team material without letting a result become a public claim too quickly. For research-communications roles close to safety, it shows that clarity is not simplification alone. It is the arrangement of evidence so that the public reader can see what a claim is permitted to bear.

The one-sentence portfolio description could read:

A practical framework for auditing whether public AI safety claims in system cards, model cards, risk reports, and preparedness frameworks remain proportionate to the evidence, uncertainty, safeguards, residual risk, and governance process that support them.

This is also the sentence that can later enter the role-specific CV. It avoids claiming machine-learning engineering expertise. It makes visible a narrower and more credible competence: source-grounded documentation analysis, claim discipline, governance vocabulary, risk-framework reading, and safety-reporting judgement.

10. Risks and limits of the framework

The framework has limits that should remain visible. It audits public documentation, not underlying model safety. A document can score well because it makes its reasoning clear while still describing a risky model. Conversely, a document can score modestly because much of the evidence is withheld, even if the internal work is serious. The score is a tool for public-document assessment, not a certificate.

Cross-company comparison also requires restraint. OpenAI’s High and Critical thresholds, Google DeepMind’s TCLs and CCLs, Anthropic’s RSP thresholds and AI Safety Level history, Amazon’s Critical Capability Thresholds, and Meta’s model-card categories do not share a single scale. The audit should not flatten them into one taxonomy. It should ask what each term does inside its own document.

Redaction creates another boundary. Some redaction is legitimate. A public claim does not become weak merely because dangerous procedural details are withheld. It becomes weak when redaction removes the reader’s ability to identify the claim type, evidence class, threat model, threshold, decision body, or residual-risk status. The audit should credit responsible withholding only when the public argument remains assessable.

The framework can also be gamed if turned into a checklist. A laboratory could include the names of all eight categories without producing stronger evidence. The scoring method therefore depends on relations rather than the mere presence of words. A threat model must connect to an evaluation. A threshold must connect to a decision rule. A safeguard must connect to residual risk. A governance body must connect to actual review authority. Documentation earns its public value through these relations.

11. Conclusion: public documentation as a site of accountable judgement

Frontier-AI safety documentation has become one of the principal public surfaces through which model-release decisions are made intelligible. It cannot bear the whole burden of safety. It cannot replace independent audit, regulatory access, internal research, third-party evaluation, or post-deployment monitoring. Yet it increasingly determines what outside readers can question. A system card or risk report gives public shape to evidence, mitigation, uncertainty, and institutional decision.

The safety-claim audit framework proposed here gives that public shape a method of review. It asks a reader to slow down at the point where fluent language becomes assurance. What is the object of the claim? Which threat model gives it meaning? What threshold turns evaluation into consequence? What evidence was used, and under which elicitation conditions? Which safeguards were applied, and what risk remains? Who judged the evidence, who could challenge it, and what will happen when the evidence changes? These questions do not make public documentation sufficient. They make it more difficult for a safety claim to drift away from the evidence that should hold it in place.

In the strongest frontier-AI documents, evidence and judgement remain visibly connected. In weaker documents, the reader receives a conclusion without enough of the route. The difference is not merely stylistic. It is the difference between public language that can be tested and public language that asks to be trusted.

Appendix: compact audit worksheet

For practical use, the compact worksheet should record: document and version; organisation; passage or section; claim under review; claim type; object and scope; threat model; evidence and evaluation method; comparator or baseline; threshold or decision rule; mitigation state; safeguards and residual risk; governance body; external review or third-party testing; transparency and redaction; monitoring or update trigger; documentation score; evidence confidence; main concern; recommended revision; and final judgement.

Notes

[1] OpenAI, Preparedness Framework, Version 2 (last updated 15 April 2025), pp. 1, 4-5, https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf.

[2] Anthropic, Anthropic’s Responsible Scaling Policy, Version 3.3 (effective 26 May 2026), pp. 1-4, 10-11, https://cdn.sanity.io/files/4zrzovbb/website/c11e84981d0a7281a1b229f3fa6af0da66eaf43f.pdf.

[3] Google DeepMind, Frontier Safety Framework, Version 3.1 (17 April 2026), pp. 1, 4-6, 13, 16-17, https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/strengthening-our-frontier-safety-framework/frontier-safety-framework_3-1.pdf.

[4] OpenAI, Preparedness Framework; Anthropic, Responsible Scaling Policy; Anthropic, Risk Report: February 2026 (updated 26 May 2026), https://anthropic.com/feb-2026-risk-report; Google DeepMind, Frontier Safety Framework; OpenAI, GPT-4o System Card (8 August 2024), https://cdn.openai.com/gpt-4o-system-card.pdf; Meta, Llama 4 Model Card (released 5 April 2025), https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md; Amazon, Amazon’s Frontier Model Safety Framework (2025), https://cdn.amazon.science/a7/7c/8bdade5c4eda9168f3dee6434fff/pc-amazon-frontier-model-safety-framework-2-7-final-2-9.pdf.

[5] OpenAI, Preparedness Framework, p. 1; Google DeepMind, Frontier Safety Framework, pp. 1, 17.

[6] National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (January 2023), pp. 1, 6, 22-23, https://doi.org/10.6028/NIST.AI.100-1; https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf.

[7] National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1 (July 2024), pp. 2-3, 11-12, https://doi.org/10.6028/NIST.AI.600-1; https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf.

[8] U.S. Government Accountability Office, Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities, GAO-21-519SP (June 2021), pp. 7-10, 24, https://www.gao.gov/assets/gao-21-519sp.pdf.

[9] Arthur Goemans and others, ‘Safety Case Template for Frontier AI: A Cyber Inability Argument’ (2024), arXiv:2411.08088, https://arxiv.org/abs/2411.08088.

[10] METR, ‘Common Elements of Frontier AI Safety Policies’ (16 December 2025), https://metr.org/common-elements.

[11] Anthropic, ‘Responsible Scaling Policy Updates’, Anthropic (last updated 26 May 2026), https://www.anthropic.com/responsible-scaling-policy; Google DeepMind, Frontier Safety Framework, p. 17.

[12] OpenAI, Preparedness Framework, pp. 12-13.

[13] OpenAI, Preparedness Framework, pp. 8-13.

[14] Anthropic, Risk Report: February 2026, pp. 6-8.

[15] Amazon, Amazon’s Frontier Model Safety Framework, pp. 1-3, 5-6.

[16] Meta, Llama 4 Model Card, sections ‘Model Information’, ‘Intended Use’, ‘Safeguards’, and ‘Evaluations’.

[17] Google DeepMind, Frontier Safety Framework, pp. 6, 13.

Bibliography

Amazon. Amazon’s Frontier Model Safety Framework. 2025. https://cdn.amazon.science/a7/7c/8bdade5c4eda9168f3dee6434fff/pc-amazon-frontier-model-safety-framework-2-7-final-2-9.pdf.

Anthropic. Anthropic’s Responsible Scaling Policy. Version 3.3. Effective 26 May 2026. https://cdn.sanity.io/files/4zrzovbb/website/c11e84981d0a7281a1b229f3fa6af0da66eaf43f.pdf.

Anthropic. Risk Report: February 2026. Updated 26 May 2026. https://anthropic.com/feb-2026-risk-report.

Goemans, Arthur, Marie Davidsen Buhl, Jonas Schuett, Tomek Korbak, Jessica Wang, Benjamin Hilton, and Geoffrey Irving. ‘Safety Case Template for Frontier AI: A Cyber Inability Argument’. arXiv, 12 November 2024. https://arxiv.org/abs/2411.08088. Google DeepMind. Frontier Safety Framework. Version 3.1. Published 17 April 2026. https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/strengthening-our-frontier-safety-framework/frontier-safety-framework_3-1.pdf.

Meta. Llama 4 Model Card. GitHub, released 5 April 2025. https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md.

METR. ‘Common Elements of Frontier AI Safety Policies’. 16 December 2025. https://metr.org/common-elements.

National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. January 2023. https://doi.org/10.6028/NIST.AI.100-1; https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf.

National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1. July 2024. https://doi.org/10.6028/NIST.AI.600-1; https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf.

OpenAI. GPT-4o System Card. 8 August 2024. https://cdn.openai.com/gpt-4o-system-card.pdf.

OpenAI. Preparedness Framework. Version 2. Last updated 15 April 2025. https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf.

U.S. Government Accountability Office. Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities. GAO-21-519SP. June 2021. https://www.gao.gov/assets/gao-21-519sp.pdf.