Introduction: Why Governance Matters
Without governance, evaluation becomes a Tower of Babel. Each team speaks a different language about quality. Each team makes different assumptions about what "good enough" means. Each team optimizes for different objectives. The result: chaos, inconsistency, and risk.
Governance is the institutional answer to the question: who decides what gets evaluated, when, by whom, and what we do with the results?
This article describes a governance framework that has worked for organizations from 5 to 500+ AI systems. It's not a one-size-fits-all solution—you'll need to adapt it to your context. But the structure is replicable.
What Governance Is NOT
Before we start, let's clarify: governance is not bureaucracy. It's not a checklist that slows down your organization. It's not a compliance theater that makes regulators happy but your engineers miserable.
Good governance accelerates deployment because it clarifies decisions. It reduces risk because it makes sure the right people are thinking about the right problems. It creates alignment because everyone understands the decision-making process.
The best governance is the minimal governance that prevents catastrophic failure without blocking progress.
The Governance Pyramid
Effective governance has three layers, from top to bottom:
Layer 1: Strategic Direction (Board/Executive Level)
The top of the pyramid sets strategic direction: How many AI systems should we have? How much are we willing to invest in AI quality? What are our non-negotiable values?
This is the domain of the CTO, VP Engineering, Chief Data Officer, and/or board members who care about AI. They set the budget, the organizational structure, and the strategic priorities.
Decision frequency: Quarterly or semi-annually.
Layer 2: Institutional Policies (Advisory Board Level)
The middle layer operationalizes strategic direction into policies: Every chatbot must have inter-rater agreement of at least 0.7. Every high-risk system must be evaluated continuously. Every evaluation decision must be auditable.
This is the domain of a cross-functional eval advisory board (see section below). They set the standards, define the exceptions, and manage the day-to-day governance.
Decision frequency: Monthly or bi-weekly.
Layer 3: Operational Execution (Team Level)
The bottom layer executes the policies: This system is classified as high-risk, so we'll do continuous evaluation. Here's the test set. Here's the team. Here's the timeline.
This is the domain of individual teams—ML engineers, eval engineers, product managers. They operationalize the policies for their specific systems.
Decision frequency: Weekly or continuously.
Policies: Who Decides What Gets Evaluated
Policy Structure
Effective eval policies have this structure:
POLICY ID: [e.g., EG-001]
POLICY NAME: [e.g., "AI System Classification"]
EFFECTIVE DATE: [date]
LAST REVIEWED: [date]
STATUS: [Active / Draft / Deprecated]
POLICY STATEMENT:
[The actual policy, 1-2 sentences]
RATIONALE:
[Why this policy exists, what problem it solves]
SCOPE:
[What systems does this apply to? All? Only production? Only high-risk?]
ROLES & RESPONSIBILITIES:
[Who owns this policy? Who enforces it?]
EXCEPTIONS:
[Under what conditions can this be waived?]
ENFORCEMENT:
[What happens if you violate this policy?]
Core Policies (Starter Set)
Policy EG-001: AI System Classification
Every AI system must be classified into one of four risk tiers before deployment. Classification determines evaluation requirements, review cadence, and governance oversight.
- Tier 1 (Low Risk): Internal tools, low-stakes predictions, high error tolerance. Evaluation before deployment sufficient.
- Tier 2 (Medium Risk): Production systems with moderate accuracy requirements. Continuous baseline testing required.
- Tier 3 (High Risk): Systems that could cause material business harm or user harm if they fail. Real-time monitoring required.
- Tier 4 (Critical Risk): Systems in regulated industries (financial, healthcare, legal) or systems that could cause serious legal/compliance harm. Compliance review required.
Policy EG-002: Evaluation Requirements by Tier
Evaluation requirements are determined by system tier and update frequency.
Policy EG-003: Human Evaluation Standards
When human judgment is required, human evaluation must follow these standards: clear rubrics, calibration sessions, inter-rater agreement measurement, and documented bias audits.
Policy EG-004: Audit Trail and Auditability
Every evaluation decision must be auditable. This means: what was evaluated? who evaluated it? what were the results? what decision was made based on the results? A record must be maintained for at least 3 years (or longer in regulated industries).
Standards Framework: CS-001 through CS-004
Standards are the specific, measurable requirements that operationalize policies. Here's a framework for defining them:
CS-001: Metric Definition Standard
Every metric must be defined using this template:
- Metric Name: [e.g., "F1 Score (Binary Classification)"]
- Definition: [Mathematical definition or reference to standard]
- Applicable To: [What types of systems use this metric?]
- Calculation Method: [Step-by-step calculation]
- Data Requirements: [What data is needed to calculate this?]
- Acceptable Range: [What's "good," "acceptable," and "unacceptable"]
- Review Frequency: [How often should this metric be recalculated?]
CS-002: Test Set Standard
Every AI system must have documented test sets that are:
- Representative: Test sets should reflect the distribution of real-world data.
- Diverse: Test sets should include edge cases, minority populations, and adversarial examples.
- Versioned: Changes to test sets must be tracked and justified.
- Documented: The source, creation date, and any known limitations must be recorded.
CS-003: Evaluation Report Standard
Every evaluation must produce a report that includes:
- System name, version, and date of evaluation
- Evaluation methodology (automated, human, hybrid)
- Test set details (source, size, composition)
- Metric results with confidence intervals
- Any anomalies or concerns
- Recommendation (approve, approve with conditions, reject)
- Approver signature
CS-004: Decision Documentation Standard
Every deployment decision must be documented:
- What system was evaluated and when?
- What were the evaluation results?
- Who made the decision?
- What was the decision (approve, reject, approve with conditions)?
- What was the rationale?
- If approved despite concerns, what's the mitigation plan?
Advisory Boards: Composition & Cadence
The Eval Advisory Board
Every organization at Level 3+ should have an eval advisory board. This is a cross-functional group that meets regularly to make governance decisions.
Composition (Typical)
- Chief Eval Officer or Head of Evaluation (chair)
- VP Engineering or Head of ML (represents engineering perspective)
- VP Product or Head of Data Science (represents product perspective)
- General Counsel or Chief Compliance Officer (represents legal/compliance perspective)
- CFO or Finance Lead (represents budget perspective)
- Security/Trust Officer (represents security perspective)
- Customer Success or Support Lead (represents customer perspective, especially for customer-facing systems)
Total: 6-8 people. Not too big to be indecisive, not so small that key perspectives are missing.
Cadence and Meeting Structure
Monthly Governance Meetings (90 minutes)
- First 20 minutes: Review escalated decisions from last month. Any problems?
- Middle 50 minutes: Review policies and standards. Do they need updates?
- Last 20 minutes: Look ahead to next month. What's coming up that we need to plan for?
Weekly Sync (20 minutes, async or quick sync)
- Current escalations that need immediate attention?
- Any critical incidents related to eval or AI quality?
Quarterly Deep-Dive (full day or half-day offsite)
- Review the entire eval portfolio. Are we on track?
- Revisit strategic priorities. Do they still make sense?
- Plan for next quarter. What new initiatives are needed?
Decision-Making Process
The board should have clear decision-making authority:
- Routine decisions (e.g., re-approving a system that hasn't changed): Async approval via email/Slack. If any member objects, escalate to next meeting.
- Policy decisions (e.g., changing the F1 threshold for classifiers): Discussed in monthly meeting. Majority vote. Chair has tiebreaker.
- Strategic decisions (e.g., building a new platform vs. using a vendor): Full discussion in quarterly deep-dive. Consensus-building, not voting.
Review Cycles and Escalation
Standard Review Cycles
Every AI system should have a defined review cycle based on its tier:
Escalation Protocol
When a metric is flagged, follow this protocol:
- Investigate (24 hours): Is the metric change real or a data artifact? What could have caused it?
- Escalate (24-48 hours): Based on your investigation, escalate to the appropriate stakeholders.
- Decide (48-72 hours): Is this a real problem? Does it require action?
- Act (24 hours): If action is needed, take it: hotfix, rollback, traffic shift, etc.
- Communicate: Inform relevant stakeholders about the incident and resolution.
Ethical Principles for Evaluation
Governance without ethics is just rule-following. These five principles should underpin your entire eval program:
Principle 1: Integrity
Evaluation must be honest and free from manipulation. You report what you find, even if it's bad news. You don't cherry-pick metrics to make a system look better than it is. You don't hide failures.
Principle 2: Stakeholder Focus
Evaluation is ultimately about serving end-users and stakeholders, not optimizing for system launch. If evaluation reveals a system will harm users, you speak up, even if it delays deployment.
Principle 3: Transparency
Evaluation methodology should be documented and defensible. You can explain why you chose certain test sets, how you calculated metrics, and what assumptions you made.
Principle 4: Bias Awareness
Evaluators recognize their own biases and actively work to mitigate them. This means diverse evaluation teams, blind evaluation where possible, and regular bias audits.
Principle 5: Continuous Improvement
Evaluation methodology should improve over time. You learn from past failures. You incorporate new evaluation techniques. You regularly audit your own eval program.
Template Policy Documents
Here are three templates you can customize for your organization:
Template 1: AI System Classification Framework
TITLE: AI System Classification Framework
PURPOSE: Establish consistent criteria for classifying AI systems by risk tier
CLASSIFICATION CRITERIA:
Tier 1 (Low Risk):
- Direct impact on user experience/revenue: None or minimal
- Potential for causing harm: Low
- Regulatory exposure: None
- Examples: Internal tools, low-stakes recommendations
Tier 2 (Medium Risk):
- Direct impact on user experience/revenue: Moderate
- Potential for causing harm: Moderate
- Regulatory exposure: Minimal
- Examples: Production classifiers, content filters
Tier 3 (High Risk):
- Direct impact on user experience/revenue: High
- Potential for causing harm: High
- Regulatory exposure: Moderate
- Examples: Credit decisions, content moderation
Tier 4 (Critical Risk):
- Direct impact on user experience/revenue: Critical
- Potential for causing harm: Severe
- Regulatory exposure: High
- Examples: Healthcare diagnostics, financial decisions
REVIEW PROCESS:
- Product manager proposes tier
- Eval manager assesses tier
- Board approves (or proposes alternative)
- Quarterly re-assessment
Template 2: Governance Escalation Playbook
TITLE: Eval Governance Escalation Playbook
PURPOSE: Define what events require escalation and to whom
ESCALATION LEVELS:
Level 1 (Eval Manager):
- Single metric drop 5-10%
- Minor test set concerns
- Documentation gaps
Level 2 (Eval Lead + Responsible Team Lead):
- Metric drop 10-20%
- Systematic evaluation gap
- Unexpected evaluation result
Level 3 (Eval Lead + VP Eng + VP Product):
- Metric drop >20%
- System reclass consideration
- Evaluation methodology failure
- Production incident linked to eval gap
Level 4 (Board):
- Critical production incident from eval gap
- Regulatory inquiry
- Strategic eval program change needed
RESPONSE TIME TARGETS:
Level 1: 48 hours
Level 2: 24 hours
Level 3: 6 hours
Level 4: Immediate
Template 3: Annual Eval Governance Audit
TITLE: Annual AI Evaluation Governance Audit
PURPOSE: Audit the eval program against policies and standards
AUDIT CHECKLIST:
Policy Compliance:
[ ] All AI systems classified by tier
[ ] All evaluation requirements met by tier
[ ] All evaluations documented and auditable
[ ] All escalations followed protocol
Standard Compliance:
[ ] All metrics defined per CS-001
[ ] All test sets documented per CS-002
[ ] All eval reports generated per CS-003
[ ] All decisions documented per CS-004
Board Effectiveness:
[ ] Board met required cadence
[ ] Board made timely decisions
[ ] Board decisions were executed
[ ] Board decisions improved outcomes
Program Quality:
[ ] Eval team capacity sufficient
[ ] Eval tooling working reliably
[ ] Eval methodology improving
[ ] Stakeholder satisfaction with eval function
Findings and Recommendations:
[Document any gaps or needed improvements]
Real-World Governance Failures
These are real (anonymized) stories of governance failures and what went wrong:
Case 1: The Missing Escalation
What happened: A mid-market SaaS company deployed a content moderation system (high-risk) without a formal governance review. The system had a silent failure mode: it was labeling 30% of valid content as spam. The issue went undetected for 3 months.
Why it happened: No AI system classification policy. The product manager didn't know they needed eval board approval. The eval team didn't know about the system.
The fix: Implement policy EG-001 (classification). Require all teams to notify the eval team when a new system is being built.
Case 2: The Conflicting Standards
What happened: A fintech company had two different teams evaluating similar systems. Team A used F1 score as the metric. Team B used precision. When comparing the systems, executives couldn't tell which was "better" because they were using different metrics.
Why it happened: No shared standards. Each team created their own evaluation methodology.
The fix: Implement CS-001 (metric definition standard). Establish a taxonomy of metrics. Require all teams to use the same metric for similar system types.
Case 3: The Governance Theater
What happened: A large enterprise created an "AI Governance Board" that met monthly. But the board had no decision-making authority. They reviewed decisions that had already been made. The board became seen as a compliance checkbox, not a decision-making body.
Why it happened: The board was created to satisfy a compliance requirement, not to genuinely govern.
The fix: Give the board real authority. Make it clear that certain decisions require board approval. Make the board responsible for strategic eval decisions, not just review.
Implementation Roadmap
Month 1: Foundation
- Draft policies EG-001 through EG-004
- Get leadership buy-in on governance model
- Identify board members and confirm commitment
Months 2-3: Board Establishment
- Conduct first monthly board meeting
- Finalize policies based on feedback
- Classify all existing AI systems by tier
Months 4-6: Standard Implementation
- Define all metrics per CS-001
- Audit all test sets per CS-002
- Generate eval reports per CS-003 for all recent evals
- Document all decisions per CS-004
Months 7-12: Operationalization
- Implement review cycles for all systems
- Train teams on escalation protocol
- Set up automated alerts for escalation triggers
- Conduct first annual audit
Key Takeaways
- Three Layers: Strategic direction → Institutional policies → Operational execution
- Core Policies: Classification, evaluation requirements, human eval standards, audit trail
- Clear Standards: Metrics, test sets, eval reports, decisions must all be documented consistently
- Active Board: 6-8 cross-functional members meeting monthly with real decision-making authority
- Ethical Foundation: Integrity, stakeholder focus, transparency, bias awareness, continuous improvement
- Escalation Protocol: Define what triggers escalation and to whom based on system tier
Ready to Build Your Governance Framework?
Learn how to design an institutional evaluation program with Level 4 exam modules on governance and organizational structure.
Exam Coming Soon