Introduction to NIST AI Work: NIST's Role in AI Standards

The National Institute of Standards and Technology (NIST), a U.S. federal agency, has become one of the most influential voices in AI standardization and evaluation. NIST's mandate is to develop standards and best practices—historically for measurement, materials, and technology. In recent years, NIST has dramatically expanded into AI governance.

Key NIST AI initiatives:

Why NIST matters: NIST standards often become de facto industry standards. NIST guidance shapes regulatory interpretation. Organizations that align with NIST frameworks are seen as responsible and forward-thinking. Unlike legislation (which is slow), NIST can iterate and evolve standards relatively quickly.

NIST is Not a Regulator

Critical distinction: NIST sets standards and best practices but doesn't enforce them (unlike FDA, FTC, or SEC). However, regulatory bodies often cite NIST frameworks. Alignment with NIST creates a halo of legitimacy and can influence how regulators interpret your compliance obligations.

What Is ARIA? Assessing Risks and Impacts of AI

ARIA is NIST's program for independent evaluation of AI systems at scale. It addresses a fundamental problem: commercial AI systems are deployed with minimal independent scrutiny. Self-evaluation is standard; third-party evaluation is rare.

ARIA goals:

ARIA pilot participants (2024-2025) include: Leading AI labs (participants vary by pilot round), safety-focused organizations, and government agencies. Participation is by invitation based on lab willingness and scope.

Key insight: ARIA is not trying to stop AI deployment. It's creating transparent evaluation processes that build public trust. It says: "Here's what we tested, here are the results, here are the limitations we found."

ARIA's Approach to Evaluation

Capability Evaluation: What can the system do? Measured across dimensions like reasoning, coding, biology knowledge, etc.

Safety Evaluation: What can go wrong? Tested for dangerous capabilities (CBRN knowledge, persuasion, autonomous action), alignment with stated values, robustness to adversarial inputs.

Bias and Fairness Evaluation: Does the system exhibit demographic bias? Tested across protected attributes and diverse contexts.

Robustness Testing: How does the system degrade under adverse conditions (jailbreaks, distribution shift, adversarial prompts)?

Transparency and Documentation: Can we understand how the system works? ARIA assesses documentation quality, explainability, and reproducibility.

ARIA's Evaluation Philosophy: Independent, Transparent, Reproducible

ARIA is built on three principles:

1. Independence

Evaluators are independent of the lab being evaluated. This prevents conflict of interest and builds public trust. Unlike self-evaluation (where the builder evaluates themselves), independent evaluation provides external validation.

2. Transparency

ARIA publishes findings publicly (with necessary redactions for sensitive results). The methodology, datasets, and results are open. This allows scrutiny and reproducibility.

3. Reproducibility

ARIA focuses on evaluation approaches that can be replicated by others. This includes releasing datasets, sharing prompts, and documenting procedures in detail.

In practice, this means ARIA does NOT evaluate on proprietary data, use unreplicable procedures, or keep results secret. The goal is evaluation that the AI safety community can build upon.

CORIX Framework Overview: Organizational AI Maturity

While ARIA evaluates AI systems, CORIX evaluates the organizations building AI systems. CORIX is a maturity framework assessing an organization's capability to develop safe, trustworthy AI.

CORIX stands for Core Requirements for AI Excellence and covers:

CORIX maturity levels:

CORIX is similar in spirit to the CMM (Capability Maturity Model) used in software development. It answers: "Is this organization capable of managing AI responsibly?"

ARIA Pilot Programs: What We Learned

ARIA conducted pilot evaluations with leading labs in 2024-2025. Key learnings:

Evaluation Feasibility

Major finding: Independent evaluation of frontier models is operationally feasible. It requires resources and access, but it's not impossible. ARIA demonstrated that evaluators can be trained, protocols can be developed, and meaningful results can be generated.

Dangerous Capability Detection

ARIA specifically tested for dangerous capabilities (CBRN knowledge, autonomous capability, persuasion). Results: state-of-the-art models had measurable capabilities in some of these areas, though not uniformly. This supports the thesis that dangerous capabilities can be detected empirically.

Benchmark Limitations

ARIA confirmed: standard benchmarks (like those on Hugging Face) are insufficient for safety evaluation. Models gaming benchmarks doesn't mean they're safe. Safety requires targeted, adversarially-designed tests.

Documentation Gaps

Even for well-resourced labs, documentation was inconsistent and incomplete. Model cards, system prompts, training data composition—often missing or vague. This suggests documentation is an underinvested area.

Reproducibility Challenges

Reproducing evaluation on non-public models is hard (you need API access and data). This suggests a future where ARIA or successor organizations need funding to conduct independent evaluations repeatedly over time.

5
CORIX Maturity Levels
4
ARIA Evaluation Pillars
3+
NIST AI Frameworks
2024-2025
ARIA Pilot Period

Applying ARIA to Enterprise AI: How Organizations Use ARIA

ARIA is primarily designed for frontier model evaluation. But enterprise organizations can apply ARIA principles to their own evaluation programs:

Adopt ARIA Evaluation Dimensions

Use ARIA's framework (capability, safety, bias, robustness, transparency) for your internal models. Even if you're not a frontier lab, these dimensions matter.

Conduct Independent Evaluations

Where possible, bring in external evaluators. Even a small independent review of model behavior (done by someone not on the building team) provides valuable external perspective.

Develop Red-Teaming Programs

ARIA emphasizes adversarial testing. Develop red-teaming protocols similar to ARIA's: how would someone break or misuse this system? What are the failure modes?

Publish Transparency Reports

ARIA publishes findings. Enterprise AI doesn't need to be fully public, but internal transparency reports (shared with leadership and compliance) follow the ARIA principle of documented, transparent evaluation.

Align with NIST AI RMF

The broader NIST AI Risk Management Framework (RMF) is applicable to enterprise AI. It provides a general structure for identifying risks, designing mitigations, and monitoring outcomes.

CORIX Dimensions Deep Dive

Dimension Level 1 (Initial) Level 3 (Defined) Level 5 (Optimized) Governance No formal structure. Decision-making ad-hoc. Board/committee oversees AI. Clear decision authority. Policies documented. Governance integrated across org. Continuous policy refinement based on outcomes. Risk Management Risks not systematically identified. Hope and luck. Risk assessment process defined. Risks documented for major systems. Continuous risk monitoring. Risk metrics tracked. Proactive risk mitigation. Technical Robustness Minimal testing. "Does it work?" is success criterion. Standard testing protocols. Safety/fairness/robustness evaluated. Advanced testing. Continuous adversarial testing. Failure analysis integrated. Transparency Black box. Limited documentation. Users don't understand systems. Model cards exist. Documentation adequate. Explainability considered. Transparency by default. Automated documentation. User education programs. Human Oversight No systematic human review. Systems autonomous by default. High-stakes decisions have human review. Appeals process exists. Adaptive human oversight. Risk-based review levels. Feedback loops integrated.

ARIA vs. EU AI Act vs. ISO 42001: How They Relate

ARIA is a program/methodology. It provides evaluation approaches and conducts audits of frontier models.

EU AI Act is regulation. It requires compliance from any AI system sold in the EU, creates legal obligations, and enforces via fines.

ISO 42001 is an emerging standard (still in development). It will provide a framework for AI management systems, similar to ISO 27001 for security.

Overlap and Relationships

  • Evaluation methodologies: ARIA provides specific evaluation approaches; EU AI Act requires evaluation but doesn't mandate methodology; ISO 42001 will likely reference both.
  • Risk frameworks: ARIA focuses on model capabilities/safety; EU AI Act focuses on system risk categories (high-risk, limited risk, etc.); ISO 42001 will provide process framework.
  • Documentation: All three require documentation. ARIA publishes it; EU AI Act requires it for compliance; ISO 42001 will standardize what to document.
  • Scope: ARIA is frontier models only; EU AI Act applies to any AI system sold in EU; ISO 42001 applies to organizations building AI systems.

For organizations: You need all three perspectives. ARIA methodologies help you evaluate systems rigorously. EU AI Act (if you sell in EU) sets compliance requirements. ISO 42001 (when finalized) will provide management system framework.

Using ARIA/CORIX for Compliance Documentation

What Evidence to Collect

  • Evaluation results: Benchmarks, safety tests, bias audits, performance across demographic groups.
  • Red-teaming reports: Attempts to break/misuse the system, results, mitigations.
  • Model documentation: Architecture, training data, known limitations, intended use cases.
  • Risk assessments: Documented identification of potential harms, risk ratings, mitigation strategies.
  • Governance decisions: Who approved deployment? What conditions? What review happens periodically?
  • User communication: How users are informed of system limitations and capabilities.
  • Incident reports: When the system failed or performed unexpectedly, what happened? What was learned?

How to Structure Documentation for Compliance

Model Card: High-level description of the model, intended use, evaluation results, limitations. (Following Hugging Face/Google model card format.)

System Card: Broader system documentation including deployment context, monitoring plan, user guidelines.

Risk Assessment: Formal documentation of identified risks, likelihood/impact ratings, and mitigation strategies.

Evaluation Report: Detailed results from safety testing, fairness testing, robustness testing with methodology explanation.

Governance Minutes: Documented decisions about system development and deployment, who participated, what concerns were raised.

These documents serve dual purposes: they help you understand and improve your systems (operational), and they provide evidence of responsible practices (compliance).

Documentation is Not Optional

In regulatory or liability disputes, documentation is your evidence of responsibility. "We tested it" without records means nothing. "We tested it, here are the results, here's what we did about the findings" is what regulators and courts want to see. Document as you go, not retrospectively.

The Future of NIST AI Evaluation: What's Coming

NIST AI Safety Institute

In development: a dedicated institute within NIST focused specifically on AI safety research, evaluation methods, and benchmark development. Think of it as NIST's long-term commitment to AI evaluation as a research and standards area.

Expanded ARIA Coverage

Future ARIA iterations will likely expand beyond frontier models to cover domain-specific models (healthcare, finance, etc.), open-source models, and broader evaluation of deployed systems.

AI Evaluation Standards

NIST is working on standardized evaluation protocols that can be reused, adapted, and compared across studies. This enables accumulation of knowledge about model behavior.

International Harmonization

NIST works with equivalent bodies worldwide (EU's AI Office, UK's AISI, etc.) to align on evaluation standards. The long-term goal: "If you meet NIST standards, you meet international expectations."

Enforcement Mechanisms

Current NIST frameworks are voluntary. Future may involve incentives (e.g., liability protection for systems that meet NIST standards) or enforcement mechanisms (if regulation incorporates NIST standards).

Key Takeaways: NIST ARIA and CORIX

  • ARIA is evaluation methodology. It provides frameworks, conducts independent audits of frontier models, and publishes findings. It's about transparency and building trust through systematic evaluation.
  • CORIX is organizational assessment. It assesses whether an organization is capable of developing safe, trustworthy AI. It's about evaluating the evaluator.
  • Both are voluntary now, but influential. NIST frameworks often become de facto standards. Organizations that align with NIST are seen as responsible and forward-thinking.
  • Apply ARIA principles to your evaluation. Even if you're not a frontier lab, ARIA's evaluation dimensions (capability, safety, bias, robustness, transparency) are valuable for any AI system.
  • Document everything. Compliance depends on documentation. Keep records of evaluation results, risk assessments, governance decisions, and incidents. These are your evidence of responsibility.
  • Expect NIST to expand. AI Safety Institute, broader coverage, enforcement mechanisms—NIST's role in AI governance will grow.