📚 KNOWLEDGE BASE

Learn AI Evaluation - The Complete Curriculum

The Complete AI Evaluation Curriculum

147 in-depth lessons across 5 mastery levels - from eval fundamentals to strategic leadership. Free to learn. Certified to prove it.

5

Mastery Levels

30

Chapters

147

Lessons

147

eval.qa URLs

🔍

L1

Level 1 - Eval Foundations EAF

Eval literacy for every worker. The gateway to eval-aware thinking. No prerequisites required.

📖 ~30 min study 📝 30 questions · 45 min exam 💰 Cert: $9.99 🏁 Prereq: None 📚 6 chapters · 27 lessons

Chapter 1 - The Eval Imperative

The Eval-Deployment Gap

Why AI benchmarks don't predict real-world performance. The gap between demo and production.

eval.qa/learn/eval-deployment-gap

The Cost of Unevaluated AI

Real-world failures: hallucinating legal citations, biased hiring tools, wrong medical advice.

eval.qa/learn/cost-of-unevaluated-ai

Eval Is Everyone's Job

Why evaluation isn't just for ML engineers. How PMs, marketers, ops, and execs all play a role.

eval.qa/learn/eval-is-everyones-job

The 5 Eval Moves Framework

Define → Measure → Test → Interpret → Decide. The mental model that structures all evaluation.

eval.qa/learn/5-eval-moves

Human Judgment Is Not Optional

Why automation can't replace human evaluation - and what happens when orgs try.

eval.qa/learn/human-judgment-not-optional

Chapter 2 - Metrics That Matter

Vanity Metrics vs. Outcome Metrics

Adoption rate is vanity. Resolution rate is outcome. The single most important distinction.

eval.qa/learn/vanity-vs-outcome-metrics

Metrics by System Type

What to measure for chatbots, RAG pipelines, code assistants, agents, and content generators.

eval.qa/learn/metrics-by-system-type

The Faithfulness - Correctness Distinction

A RAG answer can be faithful to stale docs and still wrong. Why this distinction is critical.

eval.qa/learn/faithfulness-vs-correctness

Leading vs. Lagging Indicators

Why tracking only accuracy misses the picture. Build a measurement system, not a single number.

eval.qa/learn/leading-vs-lagging

The Metric Selection Checklist

A practical framework for choosing which metrics to track. Covers relevance, measurability, actionability, and cost.

eval.qa/learn/metric-selection-checklist

Chapter 3 - Evaluation Methods 101

The Four Evaluation Types

Automated metrics, human evaluation, LLM-as-judge, and hybrid. When to use each and why.

eval.qa/learn/four-evaluation-types

LLM-as-Judge: Power and Pitfalls

Verbosity bias, position bias, self-preference, authority bias. How LLM judges fail and how to calibrate.

eval.qa/learn/llm-as-judge-biases

When Humans Are Irreplaceable

Tasks where automated evals fail: nuance, safety, cultural sensitivity, open-ended generation.

eval.qa/learn/when-humans-irreplaceable

Tagging Framework: Machine → Human-Verified → Human-Enhanced

A 3-tier content labeling system showing how much human judgment backs each eval result.

eval.qa/learn/tagging-framework

Chapter 4 - Reading Results Without Self-Deception

The 5 Interpretation Traps

Blended Average, Aggregate Confidence, Infrastructure-as-Quality, Faithfulness-as-Correctness, Threshold-in-Isolation.

eval.qa/learn/5-interpretation-traps

How Dashboards Lie

Why a green dashboard doesn't mean your AI is working. Aggregation hides failure modes.

eval.qa/learn/how-dashboards-lie

Segment Before You Celebrate

Slice by user type, query type, time window. Aggregate scores hide who's failing.

eval.qa/learn/segment-before-celebrate

Asking the Right Questions About Results

The 6 questions to ask before trusting any eval number: sample size, distribution, recency, segmentation, baseline, trend.

eval.qa/learn/questions-about-results

Chapter 5 - AI ROI: Measuring What Matters

The 4 Measurement Maturity Stages

Activity → Output → Outcome → Portfolio. Where most orgs are stuck and how to advance.

eval.qa/learn/measurement-maturity

Activity vs. Value

"We deployed 12 AI tools" is activity. "Support resolution time dropped 40%" is value.

eval.qa/learn/activity-vs-value

Building the Business Case for Eval

How to justify eval investment to leadership. ROI framing, risk reduction, competitive positioning.

eval.qa/learn/business-case-for-eval

Cost-per-Eval: Understanding the Economics

Human eval at $60/task vs. LLM-as-judge at $0.02. When cheap eval costs more in the long run.

eval.qa/learn/cost-per-eval

Chapter 6 - Domain-Specific Eval Awareness

Healthcare AI Evaluation

Clinical accuracy, patient safety, regulatory requirements (FDA, HIPAA). Eval stakes are life-or-death.

eval.qa/learn/eval-healthcare

Financial Services AI Evaluation

Risk assessment accuracy, regulatory compliance, audit trails. Where one wrong number can move millions.

eval.qa/learn/eval-finance

Legal AI Evaluation

Citation accuracy, precedent matching, hallucination detection in legal research tools.

eval.qa/learn/eval-legal

Customer Support AI Evaluation

Resolution rate, customer satisfaction, escalation accuracy. Measuring the full support journey.

eval.qa/learn/eval-customer-support

Creative & Content AI Evaluation

Originality, brand alignment, tone consistency. How do you evaluate what's inherently subjective?

eval.qa/learn/eval-creative

L1

L1 Eval Foundations Certification (EAF) 30 questions · 45 min · $9.99 · No prerequisites

Exam Coming Soon

L2

Level 2 - Eval Practitioner EAP

Applied evaluation methods. From awareness to hands-on execution. Prereq: L1 EAF.

📖 ~60 min study 📝 50 questions · applied scenarios 💰 Cert: $49.99 🏁 Prereq: L1 EAF 📚 6 chapters · 30 lessons

Chapter 1 - Evaluation Method Selection

The Method Selection Decision Tree

Choosing between automated, human, LLM-as-judge, and hybrid based on task characteristics.

eval.qa/learn/method-selection-tree

Code-Based Evaluation Pipelines

Unit testing for LLMs: DeepEval, pytest-style assertions, CI/CD integration, regression suites.

eval.qa/learn/code-based-eval-pipelines

Human Evaluation Design

Rubric construction, rater recruitment, calibration sessions, and compensation models.

eval.qa/learn/human-eval-design

LLM-as-Judge Calibration Techniques

Prompt engineering for judges, multi-judge panels, and bias mitigation strategies.

eval.qa/learn/llm-judge-calibration

Hybrid Evaluation Workflows

AI screens + human confirms. Confidence-based routing. When to escalate from auto to human.

eval.qa/learn/hybrid-eval-workflows

Chapter 2 - Metric Design for AI Systems

RAG Evaluation Metrics Deep Dive

Faithfulness, answer relevancy, contextual precision, contextual recall. The RAGAS framework.

eval.qa/learn/rag-eval-metrics

Agent Evaluation: Trajectory Analysis

Tool correctness, step efficiency, goal completion. Evaluating multi-step autonomous agents.

eval.qa/learn/agent-trajectory-eval

Chatbot Quality Metrics

Task completion, conversation flow, user satisfaction, escalation rate. End-to-end evaluation.

eval.qa/learn/chatbot-quality-metrics

Code Assistant Evaluation

Functional correctness, security, test coverage, code quality. Evaluating generated code beyond "does it run."

eval.qa/learn/code-assistant-eval

Custom Metric Design Workshop

Step-by-step process for creating domain-specific evaluation metrics. From definition to implementation.

eval.qa/learn/custom-metric-design

Chapter 3 - Inter-Rater Reliability

Why Agreement Matters

Low inter-rater consistency corrupts training data, misleads benchmarks, makes results meaningless.

eval.qa/learn/why-agreement-matters

Cohen's Kappa and Weighted Kappa

Calculating κ and κw. Interpreting: poor (<0.20) to near-perfect (0.81+). Python implementation.

eval.qa/learn/cohens-kappa

ICC for Multi-Rater Settings

Intraclass Correlation Coefficient for 3+ raters. ICC(2,1) vs ICC(3,1). When to use which.

eval.qa/learn/icc-multi-rater

Calibration Sessions: Running Them Right

Structure, frequency, gold standard items, adjudication process. The operational playbook.

eval.qa/learn/calibration-sessions

AI-Human Agreement Validation

Using QWK to validate when an LLM judge agrees with human evaluators. Trust threshold: 0.70.

eval.qa/learn/ai-human-agreement

Chapter 4 - Evaluation Pipeline Design

Evaluation Dataset Construction

Golden datasets, synthetic generation, stratified sampling, coverage analysis.

eval.qa/learn/eval-dataset-construction

CI/CD Eval Gates

Automated quality thresholds in your deployment pipeline. Fail a PR if eval scores regress.

eval.qa/learn/cicd-eval-gates

Eval Toolchain Overview

DeepEval, RAGAS, TruLens, MLflow, LangSmith, Arize, Confident AI compared.

eval.qa/learn/eval-toolchain-overview

Production-to-Test Feedback Loops

Turning production failures into test cases automatically. The closed-loop eval system.

eval.qa/learn/production-feedback-loops

Versioning and Traceability

Linking every eval score to the exact prompt, model, and dataset version. Reproducibility as practice.

eval.qa/learn/versioning-traceability

Chapter 5 - Communicating Eval Results

The Deployment Clearance Report (DCR)

The artifact that translates eval results into a go/no-go deployment recommendation.

eval.qa/learn/deployment-clearance-report

Writing for Different Audiences

Executives want ROI. Engineers want methodology. PMs want feature decisions. Tailor the message.

eval.qa/learn/eval-for-audiences

Dashboard Design for Eval Data

What to show, what to hide. Segmented views, trend lines, confidence intervals.

eval.qa/learn/eval-dashboard-design

Defending Your Eval Results

How to handle "but our accuracy is 95%!" pushback. Methodology defense, limitation disclosure.

eval.qa/learn/defending-eval-results

From Results to Recommendations

The bridge between "here are the numbers" and "here's what we should do."

eval.qa/learn/results-to-recommendations

Chapter 6 - Hands-On Eval Scenarios

Scenario: Evaluate a RAG Knowledge Base

Define metrics, run evaluation, identify failure modes, and recommend improvements. Full walkthrough.

eval.qa/learn/scenario-rag-knowledge-base

Scenario: Evaluate a Customer Support Agent

Multi-step agent handling refunds and escalations. Trajectory analysis, tool use correctness.

eval.qa/learn/scenario-support-agent

Scenario: Evaluate a Code Review Assistant

Does it catch real bugs? False positive rate. Security review quality assessment.

eval.qa/learn/scenario-code-review

Scenario: Evaluate a Content Generation Tool

Brand voice consistency, factual accuracy, originality scoring. Evaluating creative output systematically.

eval.qa/learn/scenario-content-gen

Practice: Write Your First DCR

Draft a complete Deployment Clearance Report with methodology, findings, and recommendation.

eval.qa/learn/practice-write-dcr

L2

L2 Eval Practitioner Certification (EAP)50 questions · Applied scenarios · $49.99 · Prereq: L1

Exam Coming Soon

L3

Level 3 - Eval Specialist EAS / CAEE

Deep technical evaluation. The 4-hour lab credential. Production-grade judgment. Prereq: L2 EAP.

📖 ~90 min study🔬 4-hour hands-on lab + DCR💰 Cert: $99.99🏁 Prereq: L2 EAP📚 6 chapters · 30 lessons

Chapter 1 - Advanced RAG Evaluation

Retrieval Quality Decomposition

Separating retrieval failures from generation failures. Context precision, recall, noise sensitivity in isolation.

eval.qa/learn/retrieval-quality-decomposition

Generation Faithfulness Under Pressure

Testing generation when context is contradictory, partial, or stale. Robustness testing.

eval.qa/learn/generation-faithfulness-pressure

Multi-Hop RAG Evaluation

Evaluating systems that synthesize across multiple documents. Reasoning chain validation.

eval.qa/learn/multi-hop-rag-eval

RAG Failure Taxonomy

12 failure modes: missed retrieval, wrong chunk, stale context, hallucinated source, and more.

eval.qa/learn/rag-failure-taxonomy

Production RAG Monitoring

Real-time faithfulness tracking, drift detection, embedding degradation, alerting thresholds.

eval.qa/learn/production-rag-monitoring

Chapter 2 - Safety & Adversarial Evaluation

AI Red Teaming Methodology

Structured adversarial testing: threat models, attack surfaces, scenario design. NIST AI RMF aligned.

eval.qa/learn/red-teaming-methodology

Prompt Injection Testing

Direct injection, indirect injection, jailbreaks, system prompt extraction. Testing boundaries.

eval.qa/learn/prompt-injection-testing

Safety Benchmarks & Vulnerability Scanning

40+ vulnerability categories. DeepEval's red-teaming suite. Building automated safety gates.

eval.qa/learn/safety-benchmarks

Data Leakage & PII Detection

Testing whether your AI system leaks training data, user data, or personally identifiable information.

eval.qa/learn/data-leakage-detection

NIST AI RMF & EU AI Act Alignment

How evaluation maps to regulatory frameworks. ARIA, CoRIx, compliance evidence generation.

eval.qa/learn/nist-eu-ai-act-alignment

Chapter 3 - Psychometric Rigor

Rubric Design with Behavioral Anchors

6-criteria rubric with exact weights. Writing behavioral anchors on a 1 - 4 scale.

eval.qa/learn/rubric-behavioral-anchors

Borderline-Group Standard Setting

The method for determining cut-scores. Mean score of borderline-competent candidates.

eval.qa/learn/borderline-standard-setting

Reliability Reporting & Transparency

Publishing κw, ICC with 95% CI, score distributions. Methodology page as a trust signal.

eval.qa/learn/reliability-reporting

Item Analysis & Difficulty Calibration

Discrimination index, difficulty level, distractor analysis. Ensuring each question does its job.

eval.qa/learn/item-analysis

Fairness & Bias in Assessment

DIF analysis, accommodation policies, cultural sensitivity review. Test measures skill, not background.

eval.qa/learn/fairness-bias-assessment

Chapter 4 - The Eval Lab Experience

Lab Scenario: RAG Pipeline for MedDocs

The flagship CAEE scenario. DocAssist retrieval system with medical documentation. Full walkthrough.

eval.qa/learn/lab-meddocs-scenario

Hidden Challenges & Stress Tests

10 stress test scenarios embedded in the lab. Contradictory evidence, missing context, adversarial inputs.

eval.qa/learn/lab-hidden-challenges

Writing the DCR Under Pressure

Time management for the 4-hour lab. What to include when you have 90 minutes left.

eval.qa/learn/dcr-under-pressure

Grading: How Your Lab Gets Scored

Dual-grading process, AI-screened → human-confirmed pipeline, rater calibration, appeals.

eval.qa/learn/lab-grading-process

Common Mistakes That Fail Candidates

Skipping segmentation, confusing faithfulness with correctness, ignoring edge cases, weak recommendations.

eval.qa/learn/common-lab-mistakes

Chapter 5 - Rater Operations & Quality Assurance

The 3-Phase Grading Model

Phase 1: full human dual-grading ($120). Phase 2: AI-screened ($70). Phase 3: AI primary ($35).

eval.qa/learn/three-phase-grading

Rater Training Protocol

5-hour onboarding, gold standard scoring, monthly calibration, quarterly re-calibration.

eval.qa/learn/rater-training-protocol

Drift Detection & Performance Management

Monitoring rater agreement over time. Every 25th submission triggers dual-grading.

eval.qa/learn/rater-drift-detection

Annotation Quality at Scale

Gold standard datasets, multi-layered QA, Krippendorff's alpha. Lessons from Apple and Google.

eval.qa/learn/annotation-quality-scale

Compensation & Evaluator Economics

$60/submission at CAEE. Industry benchmarks. Balancing quality incentives with cost control.

eval.qa/learn/evaluator-economics

Chapter 6 - Production Monitoring & Continuous Eval

Observability for AI Systems

Traces, spans, and eval hooks. LangSmith, Arize, Langfuse. Monitoring AI in the wild.

eval.qa/learn/ai-observability

Model Drift & Data Drift Detection

When performance degrades silently. Embedding drift, distribution shift, concept drift.

eval.qa/learn/drift-detection

A/B Testing for AI Systems

Prompt variants, model swaps, retrieval strategies. Statistical significance in AI context.

eval.qa/learn/ab-testing-ai

User Feedback Integration

Thumbs up/down, explicit corrections, implicit signals. Turning user behavior into eval data.

eval.qa/learn/user-feedback-integration

Meta-Evaluation: Evaluating Your Evaluators

Are your metrics measuring what you think? Correlation analysis, counterfactual testing.

eval.qa/learn/meta-evaluation

L3

L3 Eval Specialist / CAEE Certification4-hour hands-on lab + DCR · $99.99 · Prereq: L2

Exam Coming Soon

L4

Level 4 - Eval Architect EAA

Org-wide evaluation strategy. Design eval systems for teams and enterprises. Prereq: L3 EAS.

📖 ~120 min study📋 Portfolio + case study + peer defense💰 Cert: $149.99🏁 Prereq: L3 EAS📚 6 chapters · 30 lessons

Chapter 1 - Eval Strategy & Governance

The Org-Wide Eval Maturity Assessment

Ad-hoc → Structured → Production → Continuous → Portfolio. Gap analysis and advancement roadmap.

eval.qa/learn/org-eval-maturity

Eval Governance Framework

Policies, standards (CS-001 - CS-004), advisory boards, review cycles. Institutional backbone.

eval.qa/learn/eval-governance-framework

Building the Eval Team

Eval Engineer, Reliability Engineer, Quality Manager, ROI Analyst, Trust & Safety Lead. Hiring and structuring.

eval.qa/learn/building-eval-team

Eval Ethics & Code of Conduct

Evaluate with integrity, serve the stakeholder, disclose conflicts, protect data. The 8 principles.

eval.qa/learn/eval-ethics-code

Budgeting for Eval Programs

Cost modeling: human eval, compute, tooling, training. Making the financial case to leadership.

eval.qa/learn/budgeting-eval-programs

Chapter 2 - Multi-System Evaluation Architecture

Portfolio-Level Eval Strategy

When you have 20+ AI systems: prioritization, shared infrastructure, cross-system metrics, risk ranking.

eval.qa/learn/portfolio-eval-strategy

Eval Platform Architecture

Task routing, evaluator matching, result storage, API design. Build vs. buy decision framework.

eval.qa/learn/eval-platform-architecture

Cross-System Benchmarking

Comparing AI systems against each other and baselines. Normalization, fairness, leaderboard design.

eval.qa/learn/cross-system-benchmarking

Eval Orchestration at Scale

Smart routing, confidence-based escalation, parallel evaluation, queue management.

eval.qa/learn/eval-orchestration-scale

Vendor Evaluation & Tool Selection

Comparing DeepEval, RAGAS, Arize, Confident AI, Galileo, LangSmith against your needs.

eval.qa/learn/vendor-eval-tool-selection

Chapter 3 - Department-Specific Eval Tracks

Engineering Eval Track

ML/AI engineers: code eval, model benchmarking, pipeline testing, CI/CD integration.

eval.qa/learn/track-engineering

Product & Analytics Eval Track

PMs and analysts: feature impact, user experience evaluation, A/B testing for AI features.

eval.qa/learn/track-product

Go-to-Market Eval Tracks

Marketing, sales, customer support. Content quality, lead scoring, resolution rate tracking.

eval.qa/learn/track-go-to-market

Compliance & Risk Eval Tracks

Legal, finance, operations. Regulatory compliance evaluation, audit trail, risk scoring.

eval.qa/learn/track-compliance-risk

Executive Eval Track

CTOs, VPs, board members: portfolio ROI dashboards, risk heat maps, strategic recommendations.

eval.qa/learn/track-executive

Chapter 4 - Advanced Assessment Design

Designing Lab-Based Assessments

Scenario design, hidden challenges, time constraints, variant rotation. Measuring judgment.

eval.qa/learn/designing-lab-assessments

Cheating Defense Systems

Scenario variants, similarity detection, integrity statements, LLM-output detection.

eval.qa/learn/cheating-defense

Eval Gym: Training Evaluator Workforces

Skill trees, progression paths, qualification gates, performance analytics, leaderboards.

eval.qa/learn/eval-gym-training

NCCA & ISO 17024 Accreditation Path

What it takes to get your certification accredited. Standards, evidence, timelines, cost.

eval.qa/learn/ncca-iso-accreditation

Continuing Education & Eval Hours

The EHCS system: Eval Craft (40%), Eval Impact (25%), Eval Leadership (15%). Earning paths A - D.

eval.qa/learn/continuing-education-ehcs

Chapter 5 - Employer Recognition & Market Strategy

The Employer Recognition Program

Bronze → Silver → Gold → Platinum tiers. Discount structures, adoption incentives.

eval.qa/learn/employer-recognition

Letters of Intent & Demand Proof

Gate 1: 30+ deposits. Gate 2: 5 employer LOIs. Gate 3: κ ≥ 0.60 + target pass rate.

eval.qa/learn/demand-proof-protocol

Competitive Moat Design

Lab difficulty, published methodology, portfolio artifact, tool-agnostic positioning.

eval.qa/learn/competitive-moat

Enterprise Sales & Pricing Architecture

Individual vs. corporate vs. enterprise pricing. Volume discounts, team packages, subscriptions.

eval.qa/learn/enterprise-pricing

University & CE Credit Partnerships

Getting academic recognition. CE credits, university partnerships, academic advisory boards.

eval.qa/learn/university-partnerships

Chapter 6 - Case Studies & Portfolio Projects

Case Study: Enterprise RAG Eval at Scale

Fortune 500 evaluating 15 RAG systems. Architecture, tooling, governance, results.

eval.qa/learn/case-enterprise-rag

Case Study: Healthcare AI Safety Program

Building an eval program for clinical AI. FDA alignment, patient safety, continuous monitoring.

eval.qa/learn/case-healthcare-safety

Case Study: Financial Services Eval Governance

Audit-ready evaluation for banking AI. Regulatory compliance, explainability, bias testing.

eval.qa/learn/case-finserv-governance

Portfolio Project: Design an Eval Program

Capstone: given a fictional company with 8 AI systems, design a complete eval architecture.

eval.qa/learn/portfolio-eval-program

Peer Defense: Presenting Your Architecture

The L4 peer defense format. Presenting your portfolio to a panel, handling questions.

eval.qa/learn/peer-defense-guide

L4

L4 Eval Architect Certification (EAA)Portfolio + case study + peer defense · $149.99 · Prereq: L3

Exam Coming Soon

L5

Level 5 - Eval Commander EAC

Strategic evaluation leadership. Shape industry standards, mentor the next generation. Prereq: L4 EAA.

📖 ~150 min study🎓 Portfolio + oral defense + industry contribution💰 Cert: $199.99🏁 Prereq: L4 EAA📚 6 chapters · 30 lessons

Chapter 1 - Eval as Strategic Advantage

The Chief AI Evaluation Officer

The emerging C-suite role ($200K - $350K+). Scope, responsibilities, reporting, and the case.

eval.qa/learn/chief-eval-officer

Eval-Driven Business Strategy

How organizations that evaluate well outperform. Evaluation as decision infrastructure, not overhead.

eval.qa/learn/eval-driven-strategy

Board-Level AI Risk Reporting

Translating eval data into board decks. Risk heat maps, portfolio health, investment justification.

eval.qa/learn/board-risk-reporting

The Authority-First Philosophy

Revenue follows authority, not the reverse. Building a hard certification creates brand power.

eval.qa/learn/authority-first

Eval Market Landscape & Future Trends

Agent evaluation, multimodal eval, real-time eval, eval-as-a-service. Where the market is heading.

eval.qa/learn/eval-market-trends

Chapter 2 - Regulatory & Standards Leadership

NIST AI RMF Deep Dive

Govern, Map, Measure, Manage. How the risk management framework structures evaluation.

eval.qa/learn/nist-ai-rmf-deep-dive

EU AI Act Compliance Mapping

Articles 9 - 72: testing, evaluation, documentation. How eval certification provides compliance evidence.

eval.qa/learn/eu-ai-act-compliance

Building Standards Bodies

Creating and publishing evaluation standards. Governance, public comment, revision cycles.

eval.qa/learn/building-standards-bodies

Credential Reciprocity

Partnering with PMI, IAPP, IEEE. Cross-recognition frameworks and mutual benefit structures.

eval.qa/learn/credential-reciprocity

The ARIA & CoRIx Programs

NIST's Assessing Risks and Impacts of AI. Contextual Robustness Index. How to participate.

eval.qa/learn/nist-aria-corix

Chapter 3 - Building Eval Culture

The Eval Culture Maturity Model

From "nobody evaluates" to "evaluation is how we make decisions." 5 stages with diagnostic indicators.

eval.qa/learn/eval-culture-maturity

Change Management for Eval Adoption

Overcoming resistance: "our AI is fine" syndrome. Stakeholder mapping, quick wins, champions.

eval.qa/learn/change-management-eval

Mentoring the Next Generation

EAC mentorship model: pairing Commanders with Specialist candidates. Structured programs.

eval.qa/learn/mentoring-eval

Eval Challenges & Community Building

Monthly community challenges on real AI systems. Leaderboards, shared playbooks, frameworks.

eval.qa/learn/eval-challenges-community

Publishing & Thought Leadership

Writing for the eval community: case studies, methodology papers, open-source contributions.

eval.qa/learn/publishing-thought-leadership

Chapter 4 - Frontier Evaluation Challenges

Multimodal AI Evaluation

Text, images, audio, video. Cross-modal faithfulness, grounding, and coherence evaluation.

GUIDEEXAMBLOGNEW

eval.qa/learn/multimodal-eval

Autonomous Agent Evaluation

Multi-step planning, tool use chains, error recovery, goal completion. Independent systems.

GUIDEEXAMBLOGNEW

eval.qa/learn/autonomous-agent-eval

Real-Time Evaluation Systems

Evaluating AI responses as they happen. Streaming eval, latency constraints, online learning.

eval.qa/learn/real-time-eval

Evaluating AI That Evaluates AI

The recursive challenge. When your eval pipeline uses the same technology it's evaluating.

eval.qa/learn/evaluating-evaluators

Counterfactual & Causal Evaluation

"Would the outcome have been different without AI?" Causal inference applied to evaluation.

eval.qa/learn/counterfactual-causal-eval

Chapter 5 - Multi-Domain Eval Expansion

Domain 2: AI Safety Certification (CS-002)

Safety evaluation specialization: red teaming, vulnerability assessment, safety culture, incident response.

eval.qa/learn/domain-ai-safety

Domain 3: AI Governance Certification (CS-003)

Policy, standards, compliance, audit. The governance layer above technical evaluation.

eval.qa/learn/domain-ai-governance

Domain 4: Human Judgment Certification (CS-004)

Cognitive bias, calibration, structured analytical techniques, expert elicitation.

eval.qa/learn/domain-human-judgment

Stackable Credentials & Badge Architecture

Open Badges 3.0. Core + specialty badges. Stacking rules, display, and verification.

eval.qa/learn/stackable-credentials

Eval Marketplace Design

Community-contributed frameworks, evaluator placements, custom eval services. Platform economy.

eval.qa/learn/eval-marketplace

Chapter 6 - The Commander's Portfolio & Oral Defense

Portfolio Requirements

3 artifacts required: eval program design, published contribution, mentorship evidence.

eval.qa/learn/commander-portfolio-reqs

Industry Contribution Project

Publish methodology, open-source a framework, contribute to standards. Making eval better.

eval.qa/learn/industry-contribution

The Oral Defense Format

60-minute panel: 20 min presentation, 25 min Q&A, 15 min deliberation. Scoring criteria.

eval.qa/learn/oral-defense-format

Exemplar Commander Portfolios

Anonymized examples of outstanding L5 portfolios. What excellence looks like at the top.

eval.qa/learn/exemplar-portfolios

Life After L5: The Eval Advisory Path

Advisory board membership, industry consulting, standards committee leadership. Where Commanders go next.

eval.qa/learn/eval-advisory-path

L5

L5 Eval Commander Certification (EAC)Portfolio + oral defense + industry contribution · $199.99 · Prereq: L4

Exam Coming Soon