AI Governance vs. AI Evaluation: What's the Difference?
Evaluation measures whether AI systems perform as intended. Governance is the institutional framework ensuring evaluation happens, its results are acted upon, and accountability is maintained. Think of it this way: evaluation is the measurement layer. Governance is the institutional layer that makes measurement coherent and effective.
Evaluation without governance: Teams measure things in isolation. Some teams rigorously evaluate; others don't. Results sit in reports unread. No one enforces standards. Measurement changes nothing.
Governance without evaluation: Bureaucracy that measures nothing. Committees meet, policies exist, but they're divorced from actual measurement. This is box-ticking governance — all structure, no substance.
Governance WITH evaluation: Policies require evaluation. Committees review eval results and make decisions. Standards ensure eval is rigorous and consistent. Accountability flows from measurement.
The AI Governance Framework Components
AI Inventory
Every deployed AI system documented: name, deployment context (internal/customer-facing/critical), domain, autonomy level. This is foundational. You can't govern what you don't track.
Risk Stratification
Classify systems by risk tier. High-risk systems get stringent evaluation and monitoring. Low-risk systems get lighter governance. Resource constraints are real; strategic risk stratification optimizes governance effort.
Risk tier criteria: autonomy level (how much can the AI decide without human intervention), reversibility of decisions (are decisions hard to undo?), population affected (how many users?), regulated domain (healthcare, finance, etc.)?
Policy Documents
Clear policies covering: model selection (which models can be used?), data governance (how is training data sourced and maintained?), deployment authorization (who approves production deployment?), ongoing monitoring (what metrics must be tracked?), incident response (what constitutes an AI incident?)
Standards
Technical standards for eval methodology (CS-001 through CS-004 in the eval.qa framework). These ensure consistency across teams and domains.
Processes
Regular review cycles (when is eval done?), escalation paths (if eval uncovers problems, who decides remediation?), exception handling (when can policies be waived?)
Accountability
Named owners for each AI system. Clear escalation chains. Transparent decision-making. "Who approved this deployment?" must have a clear answer.
Audit Trails
Immutable logs of: evaluation results, deployment decisions, changes to models or data, incident reports. Critical for regulatory compliance and post-mortem analysis.
Risk Classification Frameworks
EU AI Act Risk Tiers
Unacceptable Risk: Banned. Examples: AI for mass scoring of social credit.
High Risk: Subject to strict requirements. Examples: hiring, credit decisions, medical diagnosis, law enforcement. Requires impact assessments, human oversight, clear documentation.
Limited Risk: Transparency obligations. Examples: chatbots must disclose they're AI.
Minimal Risk: No requirements.
NIST AI RMF Risk Categories
Not tied to specific risk tiers but categorizes risks: performance, security, resilience, privacy, fairness, accountability, transparency. Assess org risk tolerance for each.
eval.qa Internal Classification
Tier 1 (Critical): Mission-critical, regulated domains, large user populations. Examples: core product recommendation, compliance systems.
Tier 2 (Operational): Customer-facing, operational impact. Examples: customer support chatbot.
Tier 3 (Low-Stake): Internal tools, limited impact. Examples: internal documentation search.
Policy Architecture for AI Governance
Model Governance Policy
Defines: which AI models can be used, approval process for new models, model update procedures, vendor management for third-party models. Example: "Only models with documented training data and third-party safety audit approval can be deployed to production."
Data Governance for AI
Training data lineage, PII handling, data retention, handling of biased or problematic data. Example: "All training data must be documented with source, date, and any known limitations. Biased data subsets must be documented and handled explicitly."
Evaluation Policy
Minimum eval requirements before deployment, evaluation cadence in production, when to halt updates. Example: "Tier 1 systems require 80+ hour eval before deployment. Tier 2 require 30+ hours. Eval must cover core functionality, edge cases, and adversarial scenarios."
Incident Response Policy
What constitutes an AI incident (unintended behavior, security breach, performance degradation), escalation path, notification requirements, remediation timeline. Example: "AI errors affecting >100 customers = critical incident. Notify exec team within 1 hour. Remediate within 24 hours."
Vendor Management Policy
For third-party AI systems or models: SLAs, audit rights, data handling requirements, exit procedures. Example: "All AI vendor contracts must include 30-day wind-down clause and require vendors to provide model weights and training data upon contract termination."
Human Override Policy
When must humans be in the loop? What authority do they have? Can humans override AI recommendations? Example: "High-risk decisions must have human review before execution. Humans may override AI with documented justification."
The AI Governance Committee
Charter
Formal charter defining: authority (can the committee block deployments?), scope (all AI systems or only certain domains?), membership, meeting cadence, decision-making process.
Recommended Composition
CTO or CAIO (chair), Legal, Risk, Data Privacy, AI Engineering lead, Product, External ethics advisor. This mix balances technical expertise, business perspective, and governance concerns.
Responsibilities
- Approve/reject AI deployments
- Review eval results for high-risk systems
- Approve policy exceptions
- Respond to AI incidents
- Set strategic direction for AI governance
Documentation
Meeting minutes (decisions made, dissenting opinions), decision rationale (why was this deployment approved?), action items. This creates accountability and allows audit trails.
Audit-Ready AI Governance
If regulators audit your AI program, what will they want to see?
What Regulators Look For
EU AI Act (Articles 9-17): Risk management (was the system identified as high-risk and subjected to assessments?), technical documentation (is there clear documentation of training data, model architecture, eval results?), human oversight (are humans involved in key decisions?), transparency (are users told when interacting with AI?).
FDA SaMD Guidance: AI/ML-based medical software must have: performance specifications (how accurate is it?), benefit/risk analysis, validation evidence (testing and eval), post-market surveillance plan.
The 12-Document Governance Evidence Pack
Organizations audited should have these 12 documents ready:
- AI System Inventory and Risk Stratification
- AI Governance Policy Framework
- AI Governance Committee Charter
- Evaluation Standards and Procedures (CS-001 through CS-004)
- Sample Deployment Clearance Reports (DCRs)
- Incident Response Logs (last 12 months)
- Model Training Data Documentation
- Third-Party Vendor Contracts
- Human Override Audit Logs
- Audit Committee Meeting Minutes (last 12 months)
- Post-Deployment Monitoring Dashboards
- Training Materials for AI Users and Developers
Governance Maturity Model
Level 1 — Ad Hoc: No formal AI governance. Decisions made informally. No documentation. High risk.
Level 2 — Developing: Basic AI inventory exists. Some policies written. Inconsistent enforcement. Governance Committee meets irregularly.
Level 3 — Defined: Formal policies for all AI systems. Committee structure in place. Regular reviews. Documented decisions.
Level 4 — Managed: Metrics-driven governance. Quantitative oversight of AI system health. Integrated risk management with other enterprise risk frameworks.
Level 5 — Optimizing: Continuous improvement of governance. Predictive risk management (flagging problems before they emerge). Industry thought leadership.
Most organizations are at Level 1-2. Moving to Level 3 (defined) is achievable in 12-18 months with dedicated effort.
The AI Governance Evaluation Stack: Policies to Metrics
Layer 1: Policies — "What do we believe about AI quality? What are our principles?"
Example: "We believe all AI systems must be fair. Gender disparity <2pp is acceptable."
Layer 2: Processes — "How do we implement policies?"
Example: "Fairness audits conducted quarterly. Gender disparity measured on all systems."
Layer 3: Controls — "What gates prevent bad systems from reaching production?"
Example: "Systems with >2pp gender disparity blocked from deploy. Escalate to governance committee."
Layer 4: Metrics — "How do we measure if controls are working?"
Example: "% of systems passing fairness gate. Median disparity of deployed systems. Time to remediation for failed audits."
Layer 5: Reporting — "Who knows about this? What actions result?"
Example: "Quarterly governance report to board. Annual external audit. Public AI fairness commitment."
AI Governance Maturity Model: 5-Level Framework
Level 1: Initial / Ad-Hoc
- Governance happens reactively (after problems found)
- No formal process or committee
- No metrics tracking
- Typical: Startups, early-stage companies
Level 2: Developing / Partially Defined
- Basic governance committee exists
- Some documented policies (informal)
- Evaluation conducted but not systematic
- No public reporting
- Typical: Growth-stage companies, some structure
Level 3: Defined / Structured
- Formal governance charter and committee
- Clear policies documented
- Systematic evaluation (quarterly or more)
- Internal dashboards monitoring governance metrics
- Some public reporting of governance activities
- Typical: Enterprise companies, regulated industries
Level 4: Optimized / Advanced
- Integrated governance across organization
- Continuous evaluation (not just quarterly)
- Proactive risk identification
- Documented incident response procedures
- Regular third-party audits
- Public transparency reports on AI governance
- Typical: Large tech companies, high regulatory scrutiny
Level 5: Leading / Exemplary
- AI governance deeply embedded in culture
- Real-time governance monitoring
- Leading industry practices; publishing governance research
- Zero governance incidents (or near-zero)
- Partnerships with regulators, academia
- Public commitments exceeded regularly
- Typical: Industry leaders setting standards
Evaluating AI Governance Programs: Are You Actually Governing?
The Problem: Organizations claim good governance but don't actually enforce it. They have policies but no accountability.
Audit Questions:
- Do you have a documented AI governance policy? (Can you show it?)
- Who is accountable for governance? (Named person/committee?)
- What happens if a system violates policy? (Consequences?) — If answer is "nothing" or "unclear," governance is performative
- Do you measure governance metrics? (Dashboards?) — If no metrics, no governance
- Has a system ever been blocked from deployment due to governance? (Yes? Then governance is real. No? Then it's not.)
- Can you point to recent incidents and how you resolved them? (Documented?) — If no incident documentation, governance is missing
- Do you conduct annual independent audits? (Third-party validation?)
If you answer "no" to more than 2 questions, governance is weak/performative.
Incident Response Governance: When AI Fails in Production
Three-Phase Framework:
Phase 1: Detection & Containment (0-4 hours)
- Incident detected: Accuracy dropped, bias found, or customer complaint
- Immediate action: Notify governance committee; potentially roll back system
- Goal: Stop damage before it spreads
Phase 2: Investigation & Remediation (4-48 hours)
- Root cause analysis: What went wrong? Was it predicted by governance?
- Fix: Retrain, retune, or redesign
- Validation: Re-evaluate before redeploying
Phase 3: Learning & Prevention (1-4 weeks)
- Post-mortem: How could governance have prevented this?
- Policy update: What should we change to prevent recurrence?
- Monitoring: Add metrics to catch similar issues earlier next time
- Communication: Report to leadership, customers (if needed), public (if warranted)
AI Governance Audit Methodology: How Third Parties Evaluate Governance
Audit Scope: Documentation review, interviews, system testing, metrics analysis
Audit Questions:
- Are governance policies actually enforced? (Check recent decision logs)
- Are evaluation metrics accurate? (Validate against spot-checks)
- Are governance decisions documented? (See records)
- Is there a working escalation path? (Interview committee members)
- Are third parties involved appropriately? (Check for conflicts)
Audit Output: Report with findings, risks, recommendations. Remediation roadmap.
Building the AI Governance Committee
Ideal Composition (8-12 people):
- Chair: Executive sponsor (VP+ level; has authority)
- CTO/Chief ML Officer: Technical authority
- Head of Compliance/Legal: Regulatory knowledge
- Chief Data Officer: Data governance liaison
- Head of Ethics: Ethical review (if separate from compliance)
- VP Product: Business perspective (product roadmap tradeoffs)
- External advisor: Academic or industry expert (external perspective)
- Community representative: User/affected community perspective (if high-risk domain)
Committee Responsibilities:
- Quarterly governance reviews (any issues? any improvements?)
- System approval for deployment (Go/no-go decisions)
- Incident response decisions (Roll back? Fix forward?)
- Policy updates (Thresholds changing? New risks emerging?)
- Public reporting (Transparency on governance activities)
Operating Cadence: Monthly full committee meetings. Weekly chair + CTO syncs. Quarterly all-hands review.
