The Compliance Imperative for AI Eval

Regulators worldwide are mandating AI evaluation as a compliance requirement, not a best practice. The EU AI Act requires high-risk AI systems to undergo rigorous evaluation before deployment. The SEC expects public companies to evaluate AI systems for material risks. The Federal Reserve's SR 11-7 guidance requires banks to validate AI-driven model decisions. The FTC increasingly scrutinizes AI product claims for accuracy and bias. Compliance with these regulations is impossible without robust evaluation frameworks.

For risk officers and compliance professionals, this creates an urgent imperative: you must develop evaluation expertise or delegate to teams that have it, but you cannot ignore evaluation governance. The financial penalties for non-compliance are enormous—the EU AI Act allows fines up to 6% of global turnover or €30 million, whichever is higher. For a company with €1 billion revenue, that's €60 million. For smaller violations, fines can still reach €10 million. These are not theoretical numbers.

This track teaches evaluation as a compliance and risk management discipline. You'll learn how to:

  • Map each major regulation to specific evaluation requirements
  • Implement model risk management frameworks that satisfy regulators
  • Document evaluation rigorously for audit purposes
  • Assess third-party AI vendors for compliance with your standards
  • Report AI risks to boards and executives in language they understand
  • Detect and respond to AI incidents using evaluation evidence

The organizations most prepared for regulatory scrutiny are not those with the best AI. They're the ones with the most rigorous evaluation programs that can prove quality and safety through documentation and evidence.

Regulatory Risk

Companies deploying AI without documented evaluation face enforcement action, fines, and reputational damage. Regulators are already investigating AI systems. Having evaluation evidence is a competitive advantage; lacking it is existential risk.

Compliance Track Curriculum

The Compliance and Risk Track contains eight modules specifically designed to address regulatory and governance requirements for AI evaluation. These modules focus on documentation, frameworks, and evidence portfolios that regulators, auditors, and boards require.

Module Core Focus Regulatory Context
Module 1: Regulatory Mapping Translating regulations into specific eval requirements and gap analysis EU AI Act, NIST AI RMF, ISO 42001, SEC, OCC SR 11-7
Module 2: Model Risk Management SR 11-7 validation, Tier classification, independent validation, back-testing Federal Reserve SR 11-7, bank model governance
Module 3: Audit Documentation Evidence portfolios, documentation standards, version control for audit purposes SOC 2, ISO 27001, audit standards
Module 4: Risk-Based Eval Design Tailoring evaluation intensity to risk level, resource allocation for compliance Risk management frameworks, proportionality principle
Module 5: Third-Party AI Vendor Assessment Due diligence questions, SLA requirements, ongoing vendor monitoring Vendor management, procurement standards
Module 6: Board Risk Reporting Translating eval results into board-level risk language, KRI definitions Governance best practices, board oversight
Module 7: Incident Response Using eval to detect, triage, and document AI incidents, post-incident reviews Incident reporting, post-mortems
Module 8: Global Compliance EU vs. US vs. UK vs. Singapore regulations, compliance for global deployments Multi-jurisdiction compliance strategy

Total curriculum time: 14-18 weeks for full mastery. Many organizations take modules sequentially as they encounter specific regulatory requirements (often starting with Module 1: Regulatory Mapping).

Regulatory Mapping to Eval Requirements

The first step in compliance evaluation is understanding what each regulation actually requires. This sounds straightforward but is remarkably complex because regulations use different terminology and frame requirements differently.

EU AI Act defines high-risk AI systems and requires them to undergo conformity assessment, including evaluation for accuracy, robustness, and cybersecurity. Article 9 requires risk management systems. Article 10 requires data governance and quality. Article 11 requires technical documentation including evaluation methodologies and results. Article 13 requires transparency about limitations. Article 17 requires quality management systems. Each article translates to specific evaluation activities.

NIST AI Risk Management Framework provides a governance approach: Map (identify AI system and context), Measure (assess AI system performance and risks), Manage (implement controls), and Monitor (track performance over time). Within each phase, evaluation requirements emerge: mapping requires hazard analysis, measuring requires performance benchmarking, managing requires control validation, and monitoring requires continuous evaluation.

ISO 42001 (AI Management Systems) requires organizations to establish processes for AI system governance, including evaluation. Unlike the prescriptive EU AI Act, ISO 42001 is principles-based: you must evaluate AI systems, but the specific evaluation methodology is your choice if you can justify it.

SEC guidance on AI risk disclosure expects public companies to disclose material AI risks, including model performance limitations, data quality issues, and potential for bias. This requires evaluation evidence to support disclosures.

OCC SR 11-7 specifically addresses bank model governance, including AI models. It requires independent validation of models before deployment, ongoing monitoring, documentation, and governance review. For AI systems, this translates to specific evaluation checkpoints and documentation requirements.

47
specific eval requirements extracted from EU AI Act Articles 9-17
6
major regulations requiring explicit eval documentation
84%
of evaluated eval requirements overlap across multiple regulations
16%
of requirements unique to specific regulations

Model Risk Management and Eval

Model risk management, pioneered in banking, is becoming the standard governance framework for AI systems. The key insight is that AI models are assets with quantifiable risk. You can measure, classify, and mitigate that risk through evaluation.

OCC SR 11-7 classifies models into three tiers based on complexity, criticality, and risk. Tier 1 models are critical (loan origination, fraud detection, market risk) and require the highest validation standards. Tier 2 models are important but less critical. Tier 3 models are lower risk. Each tier triggers different evaluation requirements:

  • Tier 1 (High Risk): Independent validation before deployment, quarterly performance monitoring, back-testing required, model change documentation, governance review, stress testing
  • Tier 2 (Medium Risk): Validation before deployment, semi-annual monitoring, annual governance review, documentation requirements
  • Tier 3 (Lower Risk): Validation before deployment, annual monitoring, documentation requirements

This framework applies to AI systems. A predictive AI system used for critical business decisions (credit approval, fraud detection, hiring) is Tier 1 and requires rigorous evaluation. An AI system used for non-critical decisions is lower tier.

Back-testing is a specific evaluation technique central to banking model governance. You deploy a model, track its predictions vs. actual outcomes, and compare performance to historical benchmarks. If performance degrades, you escalate to governance for review. This is continuous evaluation post-deployment.

Independent validation means evaluation conducted by someone who did not develop the model. Internal development teams are often optimistic about their models. Independent validators provide skeptical scrutiny, catching problems developers missed.

Audit-Ready Eval Documentation

Auditors (internal audit, external audit, regulatory examiners) will eventually ask about your evaluation processes. Preparing audit-ready documentation is not about creating bureaucracy—it's about proving you evaluated responsibly and made decisions with eyes open to risks.

What auditors want to see:

  • Evidence portfolio: Organized documentation of evaluation methodology, datasets used, evaluation results, confidence intervals, limitations, and how results informed decisions
  • Dataset documentation: What data was used for evaluation? How was it collected? What are known biases or limitations? How representative is it of production data?
  • Methodology documentation: How exactly did you evaluate the system? What metrics? What sample size? What statistical tests? Why these choices?
  • Results and limitations: What did evaluation show? What are confidence intervals? Where does the model fail? What are failure rates by segment?
  • Risk assessment: How did you assess risk based on evaluation results? What risks did you accept vs. mitigate?
  • Decision documentation: Who approved deployment and based on what evaluation evidence? What conditions trigger re-evaluation?
  • Version control: As models and evaluation methodologies evolve, documentation needs versioning. What changed? When? Why?
  • Ongoing monitoring: What metrics do you monitor post-deployment? What triggers escalation? What's your monitoring cadence?

Good audit documentation tells a clear story: "We evaluated the system rigorously, understood its limitations, made an informed decision to deploy, and we're monitoring its performance continuously. Here's the evidence."

Third-Party AI Vendor Assessment

Many organizations buy AI capabilities from vendors rather than building in-house. Vendor AI systems (APIs, SaaS products, purchased models) still require evaluation before deployment into critical workflows. Your evaluation standards cannot be lower for vendor systems than for internal systems—they're equally important for risk.

Vendor evaluation includes:

  • Vendor evaluation documentation: What evaluation has the vendor conducted? On what data? With what results? Ask for evidence, not marketing claims.
  • Custom evaluation on your data: Vendor evaluation uses vendor data. You need to evaluate the system on data representative of your actual use case. Request a trial or pilot program for testing.
  • SLA requirements: What performance guarantees does the vendor provide? Accuracy SLAs? Availability SLAs? Latency SLAs? Get commitments in contracts with financial penalties for breach.
  • Transparency commitments: Will the vendor provide evaluation methodology? Results? Limitations? Some vendors are transparent; others refuse. Transparency should be a vendor selection criterion.
  • Audit rights: Ensure your contract allows you to audit vendor evaluation claims and potentially conduct independent evaluation.
  • Ongoing monitoring: The vendor's evaluation was done once. You need ongoing monitoring of vendor system performance post-deployment. This requires instrumentation and defined escalation procedures.

Standard vendor assessment questions to ask before deploying third-party AI:

  1. What evaluation has the vendor conducted, and can they provide methodology and results?
  2. Can we conduct a pilot program on our actual use case to evaluate performance?
  3. Will the vendor commit to performance SLAs in our contract?
  4. How does the vendor detect and respond to performance degradation?
  5. What data governance safeguards does the vendor have?
  6. Can we audit vendor evaluation practices?

Board and Executive AI Risk Reporting

Boards and executives need to understand AI risks without getting lost in evaluation technical details. This requires translating evaluation results into risk language they understand: probability of harm, magnitude of potential impact, and management actions to mitigate risk.

Key Risk Indicators (KRIs) for AI systems typically include:

  • Accuracy degradation: Is model performance declining? Trigger for investigation and potential remediation.
  • Fairness metrics: Is the model making biased decisions across demographic groups? Trigger for fairness review.
  • Error incident rate: How often does the model make critical errors? Trigger for escalation if exceeding threshold.
  • Distribution shift detection: Is the input distribution changing, suggesting the model may be less effective? Trigger for re-evaluation.
  • Vendor performance: If using vendor AI, are they meeting SLA commitments?

Boards expect quarterly or annual AI risk reports summarizing: (1) AI systems deployed and their risk classification, (2) key risk indicators and their trends, (3) any incidents or near-misses, (4) remediation actions underway, (5) budget allocated to AI governance.

Executive-level reporting should emphasize: "Here's what could go wrong with our AI systems. Here's how likely it is. Here's what we're doing about it. Here's what the board should know about our AI risk posture."

AI Incident Response Through Eval

When an AI system makes a critical error, your evaluation framework informs how you respond. A model that confidently makes wrong medical recommendations, or fails to detect fraud, or incorrectly denies credit: these incidents require investigation. Evaluation helps you understand what happened and how to prevent recurrence.

AI incident response process:

  • Detection: Someone identifies the error. Effective monitoring systems catch this early; weak systems don't catch it until customers complain.
  • Triage: How severe is the incident? Critical (business-impacting, customer-harming) vs. minor (edge case, no impact)? What's the scope (how many decisions were affected)?
  • Immediate response: For critical incidents, disable the model or implement safeguards immediately. For less critical incidents, document and plan response.
  • Root cause investigation: Using evaluation methodology, investigate why the error occurred. Is it a fundamental model limitation? Distribution shift? Data quality issue? Implementation bug?
  • Documentation: Detailed incident report including what happened, root cause, impact, and remediation actions. This report may be required by regulators.
  • Remediation: Fix the issue. Retrain the model? Change the evaluation dataset? Implement additional guardrails? Add human review for certain decisions?
  • Post-incident evaluation: Re-run comprehensive evaluation to ensure remediation was successful and the system is safe to re-deploy.

Incident documentation is critical because regulators expect organizations to report serious AI incidents. The EU AI Act requires reporting of serious incidents to regulators. Under HIPAA, healthcare organizations must report breaches of protected health information (including AI errors). Under financial regulations, material errors must be reported. Evaluation-based incident investigation demonstrates you took the issue seriously and responded rigorously.

Global Regulatory Patchwork

Global organizations face a fragmented regulatory landscape. The EU AI Act applies to EU operations. The UK has its own AI Framework. Singapore is developing AI governance standards. China has its own AI safety requirements. The US has sector-specific regulation (FTC, SEC, OCC) but no unified AI law (yet). Compliance requires navigating this patchwork.

For organizations operating across multiple jurisdictions, the practical approach is building compliance to the most stringent standard (usually EU AI Act) and then scaling back for less stringent jurisdictions. This "compliance upward" approach ensures you meet all requirements simultaneously rather than maintaining different processes for different regions.

However, some jurisdictions have unique requirements. China requires AI systems to undergo security review before deployment. Singapore requires transparency in automated decision systems affecting individuals. The UK's AI Framework emphasizes proportionality and risk-based governance. Understanding these differences is essential for truly global compliance.

Multi-Jurisdictional Strategy

Build evaluation and documentation practices that satisfy the most stringent requirement (EU AI Act), document multi-jurisdictional compliance in your evidence portfolio, and adjust for jurisdiction-specific requirements. This is more efficient than maintaining completely separate processes.

Compliance Track Assessment

The Compliance and Risk Track assessment focuses on applied governance and documentation skills:

Assessment 1: Regulatory Gap Analysis (30%) — You receive a company profile and a list of AI systems deployed. You map each system against regulatory requirements (EU AI Act, NIST AI RMF, your jurisdiction's standards). You identify compliance gaps: "This system meets Article 9 requirements but fails Article 17 (quality management system requirements)." You propose remediation: what evaluation or documentation is needed to achieve compliance.

Assessment 2: Model Risk Classification (30%) — You receive descriptions of AI systems being deployed. You classify each by risk tier (Tier 1/2/3) and specify required evaluation for that tier. You document justification for your classification. You specify back-testing requirements, monitoring cadence, and governance review frequency.

Assessment 3: Audit Documentation Portfolio (40%) — You create comprehensive audit documentation for a deployed AI system. This includes: evaluation methodology, datasets, results, limitations, confidence intervals, failure modes, risk assessment, deployment decision documentation, monitoring plan, and incident response procedures. Your documentation must be clear enough that an external auditor could review it and understand your evaluation practices without asking questions.

All assessments require demonstrating understanding that compliance is about governance, documentation, and defensibility—not just having technically rigorous evaluation.