System Cards and the Public Form of AI Accountability

Safety Claims, Evaluation Evidence, and Frontier-Model Release Documentation

Saman Samadi, PhD (Cantab)

Article / PDF | 26 February 2026

Focus: System cards; model-release documentation; evaluation evidence; public accountability.

Download PDF | Return to AI Safety Portfolio

Abstract

System cards and related model-release documents have become one of the principal public forms through which frontier AI developers explain the release of increasingly capable systems. They give internal evidence a public sequence. Evaluation results are made to bear claims about capability; safety language is attached to mitigations and remaining exposure; deployment reasoning is placed into a form that can be read, cited, compared, and questioned beyond the organisations that release the systems. Their authority depends on this movement from internal evidence to public language. A benchmark result begins to function as a claim about capability, a red-team finding acquires the status of a safety concern, a mitigation becomes one condition under which deployment is defended, and unresolved uncertainty, when properly written, remains part of the document’s evidentiary honesty. Yet the same documents remain controlled by the organisations whose systems they describe. Their accountability value therefore lies in a difficult intermediate position. They support scrutiny, while external audit, regulatory access, third-party evaluation, and enforceable governance continue to set the horizon against which that scrutiny acquires consequence.

This article examines system cards and neighbouring release artifacts as public accountability documents, taking OpenAI, Anthropic, Google DeepMind, Meta, and Cohere as the principal documentary sites, while situating them in relation to the literature on model cards, system cards, transparency artifacts, auditability, responsible disclosure, and frontier AI governance. It argues that the strongest system cards make the evidence-to-claim relation visible by allowing the reader to see how a test becomes a risk classification, how a mitigation is attached to a remaining uncertainty, and how deployment is justified through a chain of judgement that remains open to scrutiny. Their weaker forms make safety language too fluent, allowing public confidence to separate from the evidentiary conditions that should discipline it. The system card is best understood as an accountability-supporting artifact, a documentary site where public trust is either made more answerable to evidence or allowed to drift into institutional self-description.

Keywords: system cards; model cards; AI safety documentation; model-release documentation; public accountability; evaluation evidence; safety claims; residual risk; frontier AI governance.

1. Introduction: the release document as a public site of judgement

Frontier AI systems now arrive with documents that do more than accompany release. Technical reports give the model a research surface through training descriptions, benchmark tables, architectural summaries, and comparative scores. Product notices give users a public account of availability, access, and new capability. Governance-facing documents move closer to the conditions of release, where preparedness thresholds, responsible-scaling commitments, red-team findings, deployment restrictions, and post-mitigation risk are placed into language that can be cited and questioned. Across these neighbouring forms, the system card has begun to acquire public weight because it gathers model behaviour, safety evaluation, and deployment reasoning at the moment when a system becomes available under stated conditions.

The document is therefore already part of the release. It belongs to the system’s public release ecology, because the act of describing model capability, safety evaluation, and deployment reasoning also helps constitute the public conditions under which the system becomes intelligible. In a system card, evaluation is not left as a table detached from judgement. A result is placed under a heading, attached to a risk category, measured against a threshold, qualified by caveats, and then carried into a sentence about capability, mitigation, or deployment. The movement can appear administrative; its force is epistemic. A safety claim becomes answerable, or evasive, through the documentary relations that support it.

The field has a documentary lineage. Model cards were proposed in 2019 as short documents accompanying trained machine-learning models, intended to report intended use, evaluation procedures, ethical considerations, and performance characteristics across relevant conditions.[1] Datasheets and Data Cards developed related practices for dataset documentation, making the provenance, composition, uses, and limits of data more visible to model developers and downstream users.[2] Meta introduced public-facing AI System Cards in 2022 as a prototype resource for explaining how AI systems operate across a product environment, with an early example focused on Instagram feed ranking and with explicit recognition that a model card captures less than the whole system into which a model is placed.[3] Gursoy and Kakadiaris used the term “system cards” in a more audit-oriented register, proposing scorecards for AI-based decision-making systems in public policy, built around a system accountability benchmark.[4] More recent work on Hazard-Aware System Cards extends the idea toward end-to-end transparency and governance, proposing a structured card that can connect hazards, controls, and lifecycle accountability.[5]

This article enters that documentary field at the point where frontier-model release has made system-card writing a practical institutional act. Its object is the existing system card and neighbouring release document, especially where evaluation evidence is translated into public language about capability, safety, mitigation, remaining exposure, uncertainty, and deployment. OpenAI’s GPT-4, GPT-4o, o-series, and GPT-5 safety documents, Anthropic’s Claude system cards and Responsible Scaling Policy, Google DeepMind’s Gemini model cards, Meta’s system-card initiative, and Cohere’s Command A technical report each provide a different public surface through which safety reasoning becomes visible, though never fully available.[6]

Accountability, in this context, should be treated with discipline. A public document improves scrutiny when it preserves the relation between evidence and claim, names uncertainty at the point where confidence might otherwise harden, describes the limits of evaluation, distinguishes pre-mitigation from post-mitigation behaviour, and gives the governance process enough public shape for the release decision to be questioned. The same document can also redirect scrutiny. It can convert open questions into polished categories, describe mitigations while leaving remaining exposure indistinct, let tables appear more decisive than their methods allow, or allow the releasing organisation to define the conditions under which its claims will be judged. The system card then becomes a form of public authority whose value depends on its own vulnerability to evidence.

The central claim of this article is that system cards and model-release documents should be read as documents of evidentiary translation. They carry internal evaluation, safety judgement, and deployment reasoning into public form. Their strongest passages make that translation inspectable. Their weakest passages allow the reader to see a conclusion without seeing enough of the route by which the conclusion was reached. Public accountability gathers at this threshold, where technical evidence has to become language without losing the limits, uncertainty, and residual risk that give the evidence its measure.

2. Documentary precedents: model cards, system cards, data documentation, and transparency artifacts

The model card literature gives the system card one of its most important genealogical anchors. Mitchell and her co-authors proposed model cards as a standardised reporting format for trained models, emphasising intended use, performance evaluation, ethical considerations, and the reporting of results across factors that might reveal disparate performance.[7] The model card was designed to make a trained model more legible to its users and stakeholders, especially where model performance varies across contexts, populations, or tasks. Its form remains compact, and that compactness disciplines a set of questions: who should use this model, for what purpose, under what conditions, with what evaluation results, and with what limits.

Dataset documentation emerged under related pressure. Gebru and colleagues proposed datasheets for datasets in order to make dataset creation, composition, collection, recommended uses, and potential harms more explicit.[8] Pushkarna and colleagues later framed Data Cards as a structured, participatory approach to dataset transparency, with attention to the different informational needs of stakeholders who encounter datasets at different points in a machine-learning lifecycle.[9] These data-documentation practices offer partial accountability by making the conditions of use, reuse, and evaluation less opaque, giving later claims about model behaviour a documentary ground that can be examined.

The shift from model card to system card changes the scale of the object. Meta’s 2022 account of system cards describes them as a response to a limitation of model cards: an AI system may include many models, ranking procedures, product choices, user interfaces, policy decisions, and downstream effects that exceed the description of one trained model.[10] The system card, in this account, moves toward the operational arrangement in which models actually function. It has to describe architecture, purpose, and user-facing operation while remaining digestible to people outside the engineering team. Meta’s account also registers a difficulty that remains central for frontier-model release documents: simplified public language can lose technical precision, and a single badly chosen word may compromise the adequacy of the explanation.[11]

Gursoy and Kakadiaris push the term in a different direction, using system cards as scorecards for formal audits of AI-based decision-making systems in public policy.[12]Their approach is evaluative and structured, built around criteria that span data, model, code, and system, as well as development, assessment, mitigation, and assurance. In this form, the card is a reporting artifact generated by an audit process. It makes the status of an AI system visible against a benchmark of accountability. Recent Hazard-Aware System Card work similarly treats system cards as structured governance artifacts, designed to provide a single source of truth for safety posture across hazards, controls, assurance practices, and lifecycle governance.[13]

The frontier-model system card occupies a less settled position. It borrows from model cards, technical reports, risk disclosures, release notes, and governance frameworks, yet it is rarely reducible to any one of them. It is more public-facing than many internal safety reports; more safety-oriented than a standard product release note; more system-oriented than a traditional model card; often less detailed than a full technical report; and more closely tied to deployment than an abstract policy commitment. Its instability as a genre gives it both value and risk. It can gather heterogeneous evidence into one public document, while the same heterogeneity can make the relation between evidence, interpretation, and institutional judgement difficult to track.

Transparency research helps clarify the stakes of this instability. The Foundation Model Transparency Index frames transparency as a precondition for public accountability, while documenting substantial gaps across developers in relation to training data, labour, downstream impact, and evaluation disclosure.[14] Transparency here appears as a set of specific disclosures through which claims can be evaluated, compared, and challenged. NIST’s AI Risk Management Framework similarly treats documentation, measurement, uncertainty, and governance as practical elements of risk management, while warning that AI risk measurement remains difficult because systems are context-sensitive, metrics can be gamed or oversimplified, and transparency practices are often insufficient.[15] These observations define the conditions under which the system card should be read.

The system card therefore inherits several documentary pressures at once. Model cards teach it to describe the trained artefact; dataset documentation gives it a concern for provenance and conditions of use; system explanation pushes it toward the deployed arrangement; audit scorecards and transparency indexes give it an accountability horizon; governance frameworks ask it to show how risk has been classified and acted upon. Its public force appears when these lineages converge around release. The document has to explain the system, attach safety language to evaluation, mark uncertainty, and make deployment reasoning public enough to be questioned. The system card becomes a site where announcement gives way to documentary order, and where safety claims can acquire, or fail to acquire, public accountability.

3. Defining the system card: document type, audience, and evidentiary scale

A system card can be defined as a public or semi-public document, usually associated with the release or major update of an AI system, that places system capability, safety evaluation, risk classification, mitigation, limitation, and deployment reasoning into a single documentary form. The term remains unsettled across organisations, and that lack of settlement belongs to the genre itself. OpenAI has used “system card” for documents such as the GPT-4 System Card and GPT-4o System Card, where capability description, external red teaming, safety challenge, mitigation, and preparedness scorecard sit within one release-facing account.[16] Anthropic’s Claude 4 System Card describes safety testing, usage-policy evaluations, CBRN and cyber assessments, autonomous capability concerns, alignment evaluations, model welfare assessment, and deployment decisions under the company’s Responsible Scaling Policy.[17] Google DeepMind often uses “model card” for documents that, in practice, contain substantial system-safety and deployment information, including model identity, intended use, evaluation approach, content-safety testing, red teaming, and frontier-safety classifications.[18]

The system card differs from a model card by scale and release function. A model card traditionally describes a trained model, its intended use, performance, limitations, and relevant ethical considerations. A frontier-model system card tends to describe the model as part of a deployed system, where user interaction, safety layers, policy classifiers, usage restrictions, prepared safeguards, risk frameworks, and deployment thresholds all shape the public meaning of release. The model card asks how a model should be understood and used; the system card asks how a system has been assessed, constrained, and released.

The technical report occupies another register. It may include architecture, training details, benchmark performance, model capabilities, and research contributions. Its primary claim often concerns what has been built and how it performs. A safety report or preparedness report narrows the field toward risk, dangerous capability, evaluation, and mitigation. A release note announces availability, feature changes, access conditions, or product updates, sometimes with safety context, though usually without the evidentiary density expected of a system card. Preparedness frameworks and Responsible Scaling Policies operate at a higher governance level. They define categories, thresholds, processes, and commitments through which later release documents can be evaluated. The system card draws from each of these forms, while its distinctive pressure lies in proximity to public release, where safety claims acquire immediate institutional consequence.

Audience also changes the document’s form. A system card has to speak across readers who do not need the same thing from it. Users look for limits and conditions of use; policymakers and civil society researchers look for claims that can be assessed; AI safety researchers compare evaluation practices; journalists and analysts need a citable account of risk; enterprise customers look for assurance; internal stakeholders need the external artifact to remain coherent. This multiple audience structure creates a recurrent tension. The document must remain readable enough to circulate publicly, while the claims it carries depend on technical material that resists simplification. A strong system card lets this tension remain visible through careful definitions, scoped claims, methodological disclosure, and explicit uncertainty.

System cards also differ in the kind of evidence they make public. Some evidence takes numerical form, as with benchmark scores, refusal rates, jailbreak robustness metrics, persuasion outcomes, or hallucination measurements. Some remains qualitative, as in red-team findings, expert review, manual probing, or narrative examples of failure. Some is procedural, appearing when a document reports that a safety board reviewed a release, an external evaluator was involved, a model stayed below a threshold, or a mitigation package was activated. These evidentiary forms cannot support identical claims. A benchmark can justify a bounded statement about performance on a specified task. A red-team finding can disclose an observed vulnerability, while its generality remains limited by the scope of testing. A governance review can show that a process was followed, even where the quality of the process remains open to scrutiny.

The system card’s documentary difficulty appears when these evidentiary types are compressed into fluent public prose. The reader may encounter a sentence that sounds like a safety conclusion, while the supporting evidence comes from a narrow benchmark, a non-exhaustive red-team process, an internal threshold, or an evaluation whose methodology remains only partially described. The difficulty centres on the public handling of such evidence. Perfect tests are unavailable for frontier systems. The firmer question lies in the way the document marks the relation between what was tested, what was inferred, what was mitigated, and what remains uncertain.

4. Frontier-model release documents: OpenAI, Anthropic, Google DeepMind, Meta, and Cohere

OpenAI’s GPT-4 System Card remains a central early example of the genre in relation to frontier general-purpose models. The document describes GPT-4’s limitations and safety challenges, including hallucinations, harmful content, biases, cybersecurity risk, autonomy-related concerns, and other risk areas. It also foregrounds process. OpenAI reports that more than fifty experts participated in adversarial testing and safety assessment, and it describes changes made through reinforcement learning from human feedback, rule-based reward models, classifiers, and policy mitigations.[19] The card also contains important limits on its own claims. It states that its examples are selective and that many evaluations are incomplete, which gives the reader a rare view of the boundary between demonstration and systematic evidence.[20]

The GPT-4 card is strong when it keeps this boundary visible. Its discussion of autonomous replication and resource acquisition, based on ARC’s evaluation, reports that GPT-4 probably lacked the ability to replicate autonomously and acquire resources in the tested setting, while also leaving the broader autonomy problem open.[21] Its hallucination section similarly combines quantitative comparison with residual limitation, reporting improvement over previous models while preserving the claim that hallucination remains a serious issue.[22] The card’s authority comes from this double movement: it offers evidence of mitigation while leaving enough uncertainty in view for the claim to remain answerable.

The GPT-4o System Card gives a later and more formally integrated example, shaped by OpenAI’s Preparedness Framework. It presents risk categories, including cybersecurity, biological threats, persuasion, and model autonomy, and reports a scorecard in which the model is assigned low or medium post-mitigation risk across tracked categories, with persuasion contributing the highest reported category.[23] The document also describes red teaming, structured measurements, and mitigations across audio, image, and text modalities.[24] Its documentary value lies in the way evaluation results are connected to deployment thresholds. Under the Preparedness Framework, deployment is withheld from models that retain post-mitigation high risk in a tracked category; the GPT-4o card therefore frames deployment as a thresholded decision whose authority depends on a stated risk classification beyond general confidence.[25]

The o3 and o4-mini system card shows the genre continuing to evolve under OpenAI’s updated Preparedness Framework. The document situates the models within reasoning and tool-use capabilities, reports evaluation results across cyber, biological, autonomy, safety-behaviour, jailbreak, hallucination, and vision-related risks, and notes that the Safety Advisory Group determined that no high-risk threshold was reached in the tracked categories.[26] Its more detailed Deployment Safety Hub presentation includes methodology notes around refusal evaluation, jailbreak testing, hallucination evaluation, and vision red teaming.[27] This shift toward web-based safety hubs changes the material form of the system card. The document becomes an updateable public interface beyond the static PDF, increasing navigability while raising new questions about versioning, citation stability, and the preservation of earlier claims.

OpenAI’s GPT-5 System Card intensifies the governance connection. It describes GPT-5 as a unified system with different reasoning modes and explicitly states that GPT-5 thinking is treated as having high capability in the biological and chemical domain under the Preparedness Framework, thereby activating associated safeguards.[28] The card introduces or foregrounds “safe completions” as a safety-training approach designed to provide helpful, bounded answers where possible while maintaining safety constraints.[29]It also includes jailbreak, instruction-hierarchy, prompt-injection, hallucination, deception, and chain-of-thought monitorability evaluations across the GPT-5 documentation set.[30]The public form becomes more granular as the evaluation catalogue expands, and the reader needs stronger guidance about what each evaluation supports, how robust the metric is, and which conclusions can be drawn from post-mitigation performance.

Anthropic’s Claude 4 System Card is especially valuable because it binds safety evaluation to a named scaling policy. The document describes pre-deployment safety tests under the Responsible Scaling Policy, usage-policy evaluations, reward-hacking assessments, agentic safety evaluations, alignment assessments, and model welfare considerations.[31]It reports that Claude Opus 4 was deployed under ASL-3 standards, while Claude Sonnet 4 was deployed under ASL-2 standards.[32] The card also describes the decision process: Anthropic’s RSP requires comprehensive safety evaluations in areas such as CBRN, cybersecurity, and autonomous capabilities, with internal and external partners contributing to the ASL determination process.[33]

Anthropic’s document gains public force where it acknowledges reliance on safety layers beyond the model itself. The Claude 4 card notes jailbreak susceptibility and describes cases in which additional layers outside the model were relied on to satisfy core RSP commitments.[34] It also marks the difficulty of detecting deception or hidden goals, stating that such tendencies are hard to test and that the assessments produced no concern about systematic deception, while certain self-preservation contexts elicited seriously misaligned behaviour.[35] This is precisely the kind of passage through which public accountability can become more than a score. The system card allows a reader to see that risk classification depends on a larger architecture of safeguards, evaluations, interpretation, and residual uncertainty.

Anthropic’s Responsible Scaling Policy version 3.0 adds another layer to this documentary order. It describes the RSP as a voluntary framework for managing catastrophic risks from advanced AI and introduces components such as Frontier Safety Roadmaps, Risk Reports, governance commitments, capability thresholds, and mitigations.[36] Its Risk Reports are intended to provide detailed information about a model’s safety profile, capabilities, threat models, mitigations, and overall risk, with publication after sensitive information is removed and external review performed.[37] The system card and risk report therefore sit within a wider public documentation ecology. The card describes a particular system; the policy defines the procedures and thresholds through which future systems should be judged.

Google DeepMind’s Gemini model cards provide a different nomenclature while carrying many similar documentary functions. The Gemini 3.1 Pro Model Card states that model cards provide essential information about models, including limitations, mitigation approaches, and safety performance, and notes that cards may be updated as additional information becomes available.[38] The card describes model inputs and outputs, distribution channels, intended uses, evaluation approaches, benchmark results, content-safety evaluations, human red teaming, and frontier-safety findings across domains such as CBRN, cyber, harmful manipulation, machine-learning research and development, and misalignment.[39] It also marks the distinct status of automated safety evaluations by placing them alongside human and red-team evaluation.[40]

Meta’s AI System Cards belong to a different product and transparency context, while still clarifying the meaning of system-level documentation. Meta’s 2022 system-card article describes a prototype resource for understanding how AI systems work, beginning with Instagram feed ranking, and explicitly situates system cards as more holistic than model cards because deployed AI systems involve products, policies, interactions, and downstream effects.[41] Meta also highlights the difficulty of making complex technical information both accurate and understandable, as well as the tension between transparency and security.[42] These constraints are directly relevant for frontier-model system cards. Public documentation has to reveal enough to support scrutiny while avoiding disclosures that could enable misuse or compromise safety systems.

Cohere’s Command A technical report extends the pattern into enterprise language-model documentation. The report describes a model intended for enterprise agentic and multilingual tasks, includes model capabilities and limitations, and reports benchmark performance and safety-related information for the released model.[43] Its form is closer to a technical report than a full system card, yet it still participates in the release-documentation ecology through which model capability and safety are made legible to external users. Such documents show that the boundary among model card, technical report, and system card is organisational as well as conceptual. Some developers retain older or neighbouring labels even when a document begins to perform system-card work.

Taken together, these documents establish the system card as a family resemblance whose fixed template remains unstabilised. OpenAI foregrounds system-card scorecards and preparedness categories; Anthropic binds system cards to an ASL-based scaling policy; Google DeepMind uses model cards that carry broader safety and frontier-risk material; Meta treats system cards as system-explanation artifacts for product AI; Cohere uses technical-report language in a release and enterprise context. The public problem remains consistent across this variation: evaluation evidence has to be carried into claims whose authority depends on how precisely the document preserves the route from test to conclusion.

5. The evidence-to-claim relation

The evidence-to-claim relation is the central discipline of system-card writing. In frontier-model documentation, evidence reaches the reader through different surfaces: a benchmark score, a red-team narrative, a refusal-rate metric, a biological troubleshooting task, a cybersecurity challenge, a persuasion experiment, a model-autonomy assessment, a prompt-injection test, a hallucination measure, or a safety-board decision. The document then has to say what kind of claim each surface can bear. A score may support a narrow statement about performance under specified conditions. A result may become interpretive when it is used to classify risk. A threshold may become procedural when it authorises release. A governance process may become institutional when deployment is justified under a stated policy.

The strength of a system card lies in making these transitions legible. The OpenAI GPT-4o card, for example, places cybersecurity evaluation, biological-risk testing, persuasion measurement, and autonomy assessment into separate evidentiary settings, and this separation allows each test to retain the kind of claim it can responsibly support.[44] Each setting carries a different force. A cyber result from a bounded capture-the-flag environment can inform claims about observed capability under that test setup, while leaving much of real-world cyber misuse outside its scope. A biological troubleshooting task can inform risk assessment around tacit knowledge and assistance, yet its force remains tied to task design, grading, and evaluator assumptions. A persuasion study can support a measured claim about relative persuasiveness under the conditions of that study, while remaining distant from a general account of social influence.

OpenAI’s later documentation sometimes makes this limitation explicit through methodology notes. The GPT-5.2 system-card update, for example, notes that internal production benchmarks are deliberately difficult and that error rates on those benchmarks require careful handling because they are unrepresentative of average production traffic.[45] It also states that certain StrongReject and prompt-injection evaluations overrepresent robustness because safeguards were applied in the benchmark samples.[46] Such statements form part of the evidentiary architecture of the document. They keep the public claim from exceeding the test.

Anthropic’s Claude 4 card offers similar lessons through its treatment of hard-to-test properties. The document discusses deception, hidden goals, and self-preservation contexts, while acknowledging the difficulty of evaluating these phenomena in a way that would justify confident general conclusions.[47] That difficulty changes the register of the claim. The card can report what assessments found, what contexts produced concerning behaviour, where evaluators found no supporting pattern, and where uncertainty remains. The value lies in the calibrated movement from result to judgement.

The same discipline applies to red teaming. Red-team findings are powerful because they expose behaviours that ordinary benchmark suites may miss. They are also selective, adversarial, and shaped by the skills and assumptions of the testers. The GPT-4 card explicitly notes that its examples are selected and that the system card gives an incomplete account of GPT-4’s capabilities and limitations.[48] This admission strengthens the document because it prevents anecdotal evidence from hardening into an unearned general claim. A red-team example can demonstrate the presence of a vulnerability; absence across a finite red-team process has a different evidentiary status from absence across the broader deployment environment.

Benchmarks create another risk because their numerical form carries visual confidence. Tables and scores can produce a stronger public impression than their methods support. NIST’s AI Risk Management Framework warns that AI risk measurement can be distorted by oversimplification, gaming, lack of robust verification, and the gap between laboratory measures and real-world behaviour.[49] A system card that presents benchmark results without enough methodological context may appear precise while leaving the real conditions of interpretation under-specified. Conversely, a system card that names the metric, task design, evaluation population, mitigation condition, and uncertainty gives the reader a way to measure the claim against the evidence.

The evidence-to-claim relation is therefore an editorial problem as well as a technical one. It appears in the way sentences are written, tables introduced, limitations placed, categories named, thresholds defined, and residual risk carried into public prose. A safety claim becomes stronger when it remains close to the test that supports it. It becomes weaker when it detaches from the tested condition and travels as general assurance.

6. Mitigation, residual risk, and the documentary life of safeguards

Mitigation is one of the most difficult terms in AI safety documentation because it can name interventions at different layers of the system. It may refer to training, classifier architecture, refusal behaviour, content filtering, system prompts, monitoring, access limits, staged rollout, human review, usage policy, or governance procedure. When a system card treats these layers as one undifferentiated category, the public claim becomes too smooth. A stronger document identifies the mitigation layer being invoked, the risk it is meant to reduce, the evidence that it reduced that risk, and the exposure that remains after the intervention.

OpenAI’s GPT-4 and GPT-4o system cards both make mitigation part of the public account. The GPT-4 card describes safety interventions such as reinforcement learning from human feedback, rule-based reward models, policy classifiers, and deployment-time mitigations, while still acknowledging persisting issues such as hallucination, harmful content, and evaluation incompleteness.[50] The GPT-4o card uses the Preparedness Framework to present post-mitigation risk categories, with deployment permitted only where tracked risks remain at medium or below after mitigation.[51] This post-mitigation vocabulary is important because it shifts the public question from whether risk exists to whether risk has been reduced to a level the organisation’s governance process treats as deployable.

That shift creates a new demand for residual-risk language. A post-mitigation score can be useful; it can also make the remaining risk appear settled too quickly. Readers need to know what mitigation changed, what remained unchanged, how the mitigated system was evaluated, and whether the evaluation covered realistic misuse pathways. The GPT-4o card’s medium persuasion rating, for example, carries different public meaning from a general safety statement. It says that after mitigations and review under a specific framework, persuasion remained the most elevated tracked risk category.[52] The category invites scrutiny of the underlying test design, thresholds, and deployment safeguards.

Anthropic’s Claude 4 card gives an especially clear example of mitigation as layered architecture. It acknowledges jailbreak susceptibility and states that additional safety layers outside the model were used to satisfy core Responsible Scaling Policy commitments.[53] This is a valuable public admission because it prevents the model itself from being treated as the sole bearer of safety. The system becomes safe enough for deployment, under the company’s policy, through a combination of model behaviour, external safeguards, monitoring, restrictions, and governance judgement. Residual risk therefore belongs to the system as a whole, beyond the model considered in isolation.

The Responsible Scaling Policy v3.0 formalises this relation by giving mitigation a place within a longer documentary chain.[54] A threshold is anticipated through a Frontier Safety Roadmap; a system is evaluated against dangerous-capability criteria; mitigations are specified; a Risk Report or system card carries selected elements into public form; governance commitments define what should follow when a threshold is approached or crossed.[55] Mitigation then appears less as a claim of safety in isolation and more as one stage in the documented movement from capability assessment to release decision.

Google DeepMind’s Gemini 3.1 Pro Model Card similarly links mitigation to frontier-safety classifications and continuous testing. Its model card describes safety buffers, evaluations across Frontier Safety Framework domains, and results across CBRN, cyber, harmful manipulation, machine-learning research and development, and misalignment.[56] It also distinguishes automated evaluations from human and red-team review, which helps readers understand the status of the evidence supporting safety claims.[57] The mitigation claim becomes more accountable because the card marks differences among benchmark testing, human review, red teaming, and frontier-safety governance.

Public documentation around safeguards should therefore resist a simple closure. The strongest mitigation language describes action and remaining exposure in the same movement. It allows the reader to see how a risk was reduced, where the reduction was measured, which assumptions remain active, and which pathways still require monitoring. Such writing is stronger than confident assurance because it lets the claim remain attached to the conditions under which it can be tested.

7. Public accountability and the limits of voluntary disclosure

System cards support accountability by making selected evidence public and by giving external readers a shared document through which release reasoning can be cited, compared, and questioned. They make it possible to ask whether a safety claim corresponds to an evaluation, whether a mitigation is described in operational terms, whether residual risk survives the prose of reassurance, and whether governance thresholds have been defined clearly enough to be scrutinised. They also stabilise terminology, or expose its drift.

Their limits are equally structural. The releasing organisation controls what is disclosed, how findings are framed, which examples are selected, how much methodology is revealed, and which information is withheld for security, commercial, or misuse-prevention reasons. Some withholding is legitimate. Meta’s system-card discussion directly recognises the tension between transparency and security, since full disclosure of system details can expose vulnerabilities or enable adversarial manipulation.[58] Frontier-model cards face the same problem under more dangerous conditions: enough information must be public to support scrutiny, while some operational details may need protection.

Voluntary disclosure therefore produces an accountability-supporting document whose settlement remains elsewhere, in the fuller ecology of audit, oversight, evidence access, and enforceable governance. The Foundation Model Transparency Index demonstrates the scale of what remains unavailable across leading developers, including areas such as data, labour, downstream impact, and many aspects of evaluation and governance.[59] Its later version reports improved average scores, while continuing to document systemic opacity across major transparency categories.[60] These findings place system cards within a larger field of disclosure gaps. A strong card may improve the public surface of one release, while independent access to training data, internal incident logs, downstream deployment information, labour conditions, and complete evaluation records remains beyond its own documentary reach.

NIST’s AI Risk Management Framework gives a parallel caution from the risk-management side. It describes AI risk management as contextual, iterative, and uncertain, and it organises practice around functions such as Govern, Map, Measure, and Manage.[61] It also emphasises that risk measurement can be difficult because AI systems behave differently across contexts, measurement methods may be immature, and transparency practices are often insufficient.[62] A system card can contribute to these functions by documenting evaluation and mitigation, and its adequacy depends on how the organisation defines the context, what it measures, how it reports uncertainty, and whether external parties can contest the claims.

The public accountability value of a system card should therefore be measured by the scrutiny it enables. A reader should be able to identify the system being described, distinguish capability from safety and deployment claims, see the evaluation methods and their limits, follow how mitigation could be checked against future failure or incident, understand what residual risk remains, and connect the release decision to a governance policy, threshold, or review process. Version history matters here, since an updateable safety hub strengthens public documentation only when earlier claims remain reconstructable.

These questions ask less for total disclosure than for preservation of the relations through which the document’s own claims can be tested. A system card that gives limited information in a carefully scoped way may support accountability better than a much longer document whose confident prose conceals the fragility of its evidence. Public accountability begins to take form when the document makes its own conditions of judgement available.

8. Strong documentary practice

Strong system-card practice begins with a clear object. The reader should know whether the document describes a base model, a fine-tuned model, a deployed product system, a multimodal system, a reasoning mode, a family of models, or a release bundle. OpenAI’s GPT-5 card, for example, distinguishes model labels and reasoning modes within a unified system, which is essential because risk can vary across modes and access pathways.[63]Google DeepMind’s Gemini 3.1 Pro Model Card similarly identifies model inputs, outputs, distribution channels, and intended use conditions before moving into evaluation and safety performance.[64] Such framing may seem elementary; without it the public claim loses its object.

A second strong practice is the visible separation of capability claims from safety claims. Capability evaluation and safety evaluation may rely on overlapping methods while supporting different public conclusions. A model’s high performance on coding, reasoning, or tool-use tasks can intensify concern in cyber, autonomy, or misuse contexts. The Claude 4 card explicitly states that the models are advanced in reasoning, visual analysis, computer use, tool use, and coding, while focusing the card on safety-related testing.[65]This helps the reader see why capability assessment and safety assessment remain entangled in deployment reasoning, even as their claims must remain distinguishable in prose.

A third practice lies in methodological disclosure. System cards are strongest when they explain what an evaluation tested, who conducted it, what metric was used, and how the result should be interpreted. OpenAI’s GPT-4 card reports external expert involvement and adversarial testing, while marking the incompleteness of the evaluation set.[66] Anthropic’s Claude 4 card describes both internal and external partners in the ASL determination process.[67] Google DeepMind’s Gemini model card distinguishes automated content-safety evaluation from human review and red teaming.[68] These details give public readers a way to understand the authority and limits of the evidence.

A fourth practice is explicit post-mitigation reasoning. The public needs to know whether a risk score refers to the raw model, a mitigated deployment, a restricted access context, or a system surrounded by external safeguards. OpenAI’s GPT-4o scorecard is useful here because it reports risk ratings in a post-mitigation framework and connects deployment to category thresholds.[69] Anthropic’s admission that additional safety layers outside the model contribute to satisfying RSP commitments performs a similar documentary function.[70] In both cases, the model is placed within a system of controls, and the card invites readers to ask how those controls were evaluated.

A fifth practice is careful versioning and update visibility. Web-based deployment-safety hubs can improve navigation and allow updates, while they also create archival and citation challenges. A PDF system card gives a stable document; an updated web hub can reflect changed evaluations, additional mitigations, or new benchmark results. The public value of the web form depends on clear dates, version histories, and stable references. OpenAI’s newer Deployment Safety Hub pages and Google DeepMind’s model-card pages both move toward living documentation, which is promising where updates are transparent and problematic where earlier claims become difficult to reconstruct.[71]

A sixth practice is disciplined uncertainty language. The best system-card passages treat uncertainty as part of the public claim. GPT-4’s acknowledgement of non-comprehensive evaluation, Claude 4’s discussion of the difficulty of testing deception and hidden goals, Gemini’s distinction between automated and human safety assessments, and GPT-5.2’s notes about benchmark representativeness all preserve the pressure of unresolved evidence.[72] Such passages give the document a stronger public form because they resist the temptation to convert limited evaluation into general assurance.

9. Recurring weaknesses and documentary risks

The most common weakness in system-card writing is the overextension of evaluation evidence. A bounded test becomes a broad claim about safety; a red-team exercise becomes a general assurance; a post-mitigation score becomes a public impression of controlled risk. The weakness may appear without any false statement. It can arise from placement, emphasis, visual hierarchy, or the absence of a nearby caveat. A table may be accurate and still exert too much rhetorical weight where the document leaves the governing test conditions and interpretive limits unclear.

A second weakness is the compression of mitigation. When mitigation is described through broad terms such as safeguards, safety layers, monitoring, policy enforcement, or refusal behaviour, the reader may struggle to see which risk is being addressed and how effectiveness was measured. The problem is especially acute when mitigation operates outside the model, through classifiers, access controls, monitoring, or use policies. These mechanisms may be essential; their public value depends on documentary specificity. A mitigation claim lacking connection to an evaluation method remains difficult to scrutinise.

A third weakness is residual-risk thinning. Many documents describe safety improvements more fully than remaining exposure. This produces an asymmetry in the public record. The improvement is narrated; the remainder is named only briefly, or displaced into general language about ongoing monitoring. Residual risk should shape the claim from within, because a credible safety statement concerns a particular risk reduced under specified conditions and still subject to further monitoring, restriction, or review. When residual risk appears only as a final caveat after the document has already created confidence, the documentary order has begun to weaken the very relation it needs to preserve.

A fourth weakness is terminology drift. A term such as risk, capability, mitigation, safeguard, autonomy, robustness, refusal, preparedness, or deployment can change force as it moves across the document. It may be technical in an evaluation section, procedural in a governance section, and public-relational in an introduction. Multi-author documentation intensifies this risk because sections can remain locally accurate while the whole document loses conceptual consistency. Terminology control is therefore part of accountability. When a term changes register, the safety claim attached to it can begin to drift.

A fifth weakness is the promotional pressure of release itself. System cards are published near moments of product or model introduction, often by the same organisations that benefit from public confidence in the release. This gives their prose special consequence. A model-release document must explain capability and safety without turning into marketing. The line is crossed when capability is narrated expansively while risk is handled procedurally, or when limitations appear in sections that few readers will reach after a strong opening impression has been formed. Public accountability depends on the weight the document gives to each relation: capability must remain in contact with limitation, evaluation with method, mitigation with remaining exposure, and uncertainty with the confidence that public prose is allowed to carry.

A sixth weakness is comparison without commensurability. System cards from different organisations use different categories, thresholds, metrics, and disclosure practices. OpenAI’s Preparedness categories, Anthropic’s ASL levels, Google DeepMind’s Frontier Safety Framework domains, Meta’s product-system explanation cards, and Cohere’s enterprise technical reports resist comparison as entries in a single standardised table. Cross-lab comparison remains necessary and must proceed through careful translation. The reader has to ask what each category means, which evaluation supports it, whether the result is pre- or post-mitigation, and what governance consequence follows.

A final weakness lies in the displacement of accountability from public evidence to organisational trust. A system card may state that internal review occurred, that external partners were consulted, or that a safety board approved deployment. Such statements matter; their force depends on how much the public can know about the review criteria, the evidence considered, the dissenting views if any, and the conditions attached to approval. Process language can support accountability when it opens the review path to scrutiny. It can also replace evidence with institutional self-description. The document should give readers enough structure to distinguish between these two possibilities.

10. From Public Document to Professional Practice

System cards matter professionally because they reveal the work that high-stakes AI documentation actually requires. The document extends beyond explanation and becomes a structured negotiation among researchers, safety teams, policy leads, product teams, legal review, communications, and governance functions. Evaluation results have to be translated without losing their limits. Safety claims have to be supported without overclaiming. Mitigations have to be described without exposing dangerous operational details. Residual risk has to remain visible without turning the document into alarm or incoherence. The public language must be readable, accurate, internally consistent, and resistant to promotional drift.

For external-artifacts roles, the system card is a central proof of competence. It requires long-form documentation, multi-author integration, claim-level judgement, terminology discipline, and public-facing technical prose. A person working on such an artifact has to notice where a safety claim exceeds its evidence, where a table appears more decisive than its method warrants, where a mitigation has lost its object, where a risk category changes register, and where a public sentence has become too smooth for the uncertainty it carries.

For responsible AI governance and risk-documentation roles, system cards show how frameworks become public documents. A Responsible Scaling Policy or Preparedness Framework has limited public force until it is applied to a release, a model, a risk report, or a deployment decision. The system card becomes one of the places where policy language meets technical evaluation. Its categories, thresholds, and mitigation claims must therefore remain traceable to the governance commitments that give them authority.

For model-evaluation communication and safety-operations support, system cards show how technical results become public or operational claims. Benchmark scores, red-team findings, refusal metrics, cyber tasks, biological evaluations, persuasion studies, and autonomy assessments each require interpretation. The documentation specialist must know enough to ask what the evaluation tested, what it left outside the test, which conditions shaped the result, what mitigation changed, and how much confidence the prose is allowed to carry.

For research communications, system cards make visible the difference between clarity and simplification. The task is to keep difficult evidence accessible without letting it disappear into accessible language, so that non-specialist readers can understand the claim while specialist readers can still see the evidence, the method, and the uncertainty that govern it. That capacity becomes increasingly important as frontier AI systems move faster than public institutions can easily assess them.

The system card therefore has job-facing significance because it concentrates the practical skills this project aims to build. It trains the reader to follow public AI safety documents closely, trace claims back to evidence, detect terminology drift, interpret evaluation tables, preserve uncertainty, rewrite technical material without weakening it, and build documents that remain accountable under scrutiny. These skills belong to the public life of AI safety itself.

11. Conclusion: the document that must remain answerable

System cards and model-release documents have become necessary public forms in the governance of frontier AI, while sufficiency remains outside their documentary authority. They give external readers a view of capability, safety evaluation, mitigation, remaining exposure, and deployment reasoning at the moment when a system enters wider use. They make claims citeable, categories visible, and selected methods inspectable. They provide the documentary surface through which researchers, policymakers, journalists, civil society groups, users, and future auditors can begin to question the relation between what was tested and what was claimed.

Their authority remains conditional. A system card authored by the releasing organisation falls short of substitution for independent audit, regulatory access, third-party evaluation, incident reporting, or enforceable governance. Its public value lies in the measure of scrutiny it enables. When the card preserves the route from evidence to claim, distinguishes capability from safety, describes mitigation together with remaining exposure, marks uncertainty without hiding behind it, and connects deployment to a visible governance process, the document gives public accountability a stronger form. When it allows confidence to outrun evidence, the same genre becomes an instrument of institutional self-presentation.

The system card should therefore be read as a document under pressure, because model behaviour, evaluation method, safety judgement, governance procedure, and public language meet within it and alter one another’s force. Its best form lets the claim remain close enough to the evidence that trust can be measured, questioned, and revised, without requiring readers to accept safety as an abstract assurance. In that closeness, the system card acquires its public consequence: it gives frontier AI release a documentary order through which safety language can be made answerable before the systems it describes become part of ordinary life.

Notes

[1] Margaret Mitchell and others, ‘Model Cards for Model Reporting’, in Proceedings of the Conference on Fairness, Accountability, and Transparency (New York: ACM, 2019), pp. 220–29, https://doi.org/10.1145/3287560.3287596.

[2] Timnit Gebru and others, ‘Datasheets for Datasets’ (2018), arXiv:1803.09010, https://arxiv.org/abs/1803.09010; Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson, ‘Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI’, in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (New York: ACM, 2022), pp. 1776–1826, https://doi.org/10.1145/3531146.3533231.

[3] Meta AI, ‘System Cards, a New Resource for Understanding How AI Systems Work’, Meta AI (23 February 2022), https://ai.meta.com/blog/system-cards-a-new-resource-for-understanding-how-ai-systems-work/.

[4] Furkan Gursoy and Ioannis A. Kakadiaris, ‘System Cards for AI-Based Decision-Making for Public Policy’ (2022), arXiv:2203.04754, https://arxiv.org/abs/2203.04754.

[5] Huzaifa Sidhpurwala and others, ‘Blueprints of Trust: AI System Cards for End-to-End Transparency and Governance’ (2025), arXiv:2509.20394, https://arxiv.org/abs/2509.20394.

[6] OpenAI, ‘GPT-4 System Card’ (2023), https://cdn.openai.com/papers/gpt-4-system-card.pdf; OpenAI, ‘GPT-4o System Card’, OpenAI (8 August 2024), https://openai.com/index/gpt-4o-system-card/; OpenAI, ‘GPT-5 System Card’, OpenAI Deployment Safety Hub (7 August 2025), https://deploymentsafety.openai.com/gpt-5; Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’ (May 2025), https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf; Google DeepMind, ‘Gemini 3.1 Pro Model Card’, Google DeepMind (February 2026), https://deepmind.google/models/model-cards/gemini-3-1-pro; Meta AI, ‘System Cards’; Team Cohere, ‘Command A: An Enterprise-Ready Large Language Model’ (2025), arXiv:2504.00698, https://arxiv.org/abs/2504.00698.

[7] Mitchell and others, ‘Model Cards’.

[8] Gebru and others, ‘Datasheets for Datasets’.

[9] Pushkarna, Zaldivar, and Kjartansson, ‘Data Cards’.

[10] Meta AI, ‘System Cards’.

[11] Meta AI, ‘System Cards’.

[12] Gursoy and Kakadiaris, ‘System Cards’.

[13] Sidhpurwala and others, ‘Blueprints of Trust’.

[14] Rishi Bommasani and others, ‘The Foundation Model Transparency Index’ (2023), arXiv:2310.12941, https://arxiv.org/abs/2310.12941.

[15] National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (Gaithersburg, MD: National Institute of Standards and Technology, 2023), https://doi.org/10.6028/NIST.AI.100-1.

[16] OpenAI, ‘GPT-4 System Card’; OpenAI, ‘GPT-4o System Card’.

[17] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[18] Google DeepMind, ‘Gemini 3.1 Pro Model Card’.

[19] OpenAI, ‘GPT-4 System Card’.

[20] OpenAI, ‘GPT-4 System Card’.

[21] OpenAI, ‘GPT-4 System Card’.

[22] OpenAI, ‘GPT-4 System Card’.

[23] OpenAI, ‘GPT-4o System Card’.

[24] OpenAI, ‘GPT-4o System Card’.

[25] OpenAI, ‘GPT-4o System Card’.

[26] OpenAI, ‘OpenAI o3 and o4-mini System Card’, OpenAI (16 April 2025), https://openai.com/index/o3-o4-mini-system-card/.

[27] OpenAI, ‘OpenAI o3 and o4-mini System Card’, OpenAI Deployment Safety Hub, https://deploymentsafety.openai.com/o3.

[28] OpenAI, ‘GPT-5 System Card’.

[29] OpenAI, ‘GPT-5 System Card’.

[30] OpenAI, ‘GPT-5 System Card’; OpenAI, ‘Update to GPT-5 System Card: GPT-5.2’, OpenAI Deployment Safety Hub (11 December 2025), https://deploymentsafety.openai.com/gpt-5-2.

[31] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[32] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[33] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[34] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[35] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[36] Anthropic, ‘Anthropic Responsible Scaling Policy, Version 3.0’, Anthropic (24 February 2026), https://www.anthropic.com/responsible-scaling-policy/rsp-v3-0.

[37] Anthropic, ‘Responsible Scaling Policy, Version 3.0’.

[38] Google DeepMind, ‘Gemini 3.1 Pro Model Card’.

[39] Google DeepMind, ‘Gemini 3.1 Pro Model Card’.

[40] Google DeepMind, ‘Gemini 3.1 Pro Model Card’.

[41] Meta AI, ‘System Cards’.

[42] Meta AI, ‘System Cards’.

[43] Team Cohere, ‘Command A’.

[44] OpenAI, ‘GPT-4o System Card’.

[45] OpenAI, ‘Update to GPT-5 System Card: GPT-5.2’.

[46] OpenAI, ‘GPT-5.2’.

[47] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[48] OpenAI, ‘GPT-4 System Card’.

[49] NIST, AI RMF 1.0.

[50] OpenAI, ‘GPT-4 System Card’.

[51] OpenAI, ‘GPT-4o System Card’.

[52] OpenAI, ‘GPT-4o System Card’.

[53] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[54] Anthropic, ‘Responsible Scaling Policy, Version 3.0’.

[55] Anthropic, ‘Responsible Scaling Policy, Version 3.0’.

[56] Google DeepMind, ‘Gemini 3.1 Pro Model Card’.

[57] Google DeepMind, ‘Gemini 3.1 Pro Model Card’.

[58] Meta AI, ‘System Cards’.

[59] Bommasani and others, ‘Transparency Index’.

[60] Rishi Bommasani and others, ‘Foundation Model Transparency Index v1.1’ (2024), arXiv:2407.12929, https://arxiv.org/abs/2407.12929.

[61] NIST, AI RMF 1.0.

[62] NIST, AI RMF 1.0.

[63] OpenAI, ‘GPT-5 System Card’.

[64] Google DeepMind, ‘Gemini 3.1 Pro Model Card’.

[65] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[66] OpenAI, ‘GPT-4 System Card’.

[67] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[68] Google DeepMind, ‘Gemini 3.1 Pro Model Card’.

[69] OpenAI, ‘GPT-4o System Card’.

[70] Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’.

[71] OpenAI, ‘GPT-5 System Card’; OpenAI, ‘GPT-5.2’; Google DeepMind, ‘Gemini 3.1 Pro Model Card’.

[72] OpenAI, ‘GPT-4 System Card’; Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’; Google DeepMind, ‘Gemini 3.1 Pro Model Card’; OpenAI, ‘GPT-5.2’.

Bibliography

Anthropic, ‘Anthropic Responsible Scaling Policy, Version 3.0’, Anthropic (24 February 2026), https://www.anthropic.com/responsible-scaling-policy/rsp-v3-0

Anthropic, ‘Claude Opus 4 and Claude Sonnet 4 System Card’ (May 2025), https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf

Bommasani, Rishi, and others, ‘Foundation Model Transparency Index v1.1’ (2024), arXiv:2407.12929, https://arxiv.org/abs/2407.12929

Bommasani, Rishi, and others, ‘The Foundation Model Transparency Index’ (2023), arXiv:2310.12941, https://arxiv.org/abs/2310.12941

Team Cohere, ‘Command A: An Enterprise-Ready Large Language Model’ (2025), arXiv:2504.00698, https://arxiv.org/abs/2504.00698

Gebru, Timnit, and others, ‘Datasheets for Datasets’ (2018), arXiv:1803.09010, https://arxiv.org/abs/1803.09010

Google DeepMind, ‘Gemini 3.1 Pro Model Card’, Google DeepMind (19 February 2026), https://deepmind.google/models/model-cards/gemini-3-1-pro

Gursoy, Furkan, and Ioannis A. Kakadiaris, ‘System Cards for AI-Based Decision-Making for Public Policy’ (2022), arXiv:2203.04754, https://arxiv.org/abs/2203.04754

Meta AI, ‘System Cards, a New Resource for Understanding How AI Systems Work’, Meta AI (23 February 2022), https://ai.meta.com/blog/system-cards-a-new-resource-for-understanding-how-ai-systems-work/

Mitchell, Margaret, and others, ‘Model Cards for Model Reporting’, in Proceedings of the Conference on Fairness, Accountability, and Transparency (New York: ACM, 2019), pp. 220–29, https://doi.org/10.1145/3287560.3287596

National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (Gaithersburg, MD: National Institute of Standards and Technology, 2023), https://doi.org/10.6028/NIST.AI.100-1

OpenAI, ‘GPT-4 System Card’ (2023), https://cdn.openai.com/papers/gpt-4-system-card.pdf

OpenAI, ‘GPT-4o System Card’, OpenAI (8 August 2024), https://openai.com/index/gpt-4o-system-card/

OpenAI, ‘GPT-5 System Card’, OpenAI Deployment Safety Hub (7 August 2025), https://deploymentsafety.openai.com/gpt-5

OpenAI, ‘Update to GPT-5 System Card: GPT-5.2’, OpenAI Deployment Safety Hub (11 December 2025), https://deploymentsafety.openai.com/gpt-5-2

OpenAI, ‘OpenAI o3 and o4-mini System Card’, OpenAI (16 April 2025), https://openai.com/index/o3-o4-mini-system-card/

OpenAI, ‘OpenAI o3 and o4-mini System Card’, OpenAI Deployment Safety Hub, https://deploymentsafety.openai.com/o3

Pushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson, ‘Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI’, in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (New York: ACM, 2022), pp. 1776–1826, https://doi.org/10.1145/3531146.3533231

Sidhpurwala, Huzaifa, and others, ‘Blueprints of Trust: AI System Cards for End-to-End Transparency and Governance’ (2025), arXiv:2509.20394, https://arxiv.org/abs/2509.20394