Level 1 — Eval Foundations EAF
Eval literacy for every worker. The gateway to eval-aware thinking. No prerequisites required.
Chapter 1 — The Eval Imperative
L1 · CH1
The Eval-Deployment Gap
Why AI benchmarks don't predict real-world performance. The gap between demo and production.
eval.qa/learn/eval-deployment-gap
L1 · CH1
The Cost of Unevaluated AI
Real-world failures: hallucinating legal citations, biased hiring tools, wrong medical advice.
eval.qa/learn/cost-of-unevaluated-ai
L1 · CH1
Eval Is Everyone's Job
Why evaluation isn't just for ML engineers. How PMs, marketers, ops, and execs all play a role.
eval.qa/learn/eval-is-everyones-job
L1 · CH1
The 5 Eval Moves Framework
Define → Measure → Test → Interpret → Decide. The mental model that structures all evaluation.
eval.qa/learn/5-eval-moves
L1 · CH1
Human Judgment Is Not Optional
Why automation can't replace human evaluation — and what happens when orgs try.
eval.qa/learn/human-judgment-not-optional
Chapter 2 — Metrics That Matter
L1 · CH2
Vanity Metrics vs. Outcome Metrics
Adoption rate is vanity. Resolution rate is outcome. The single most important distinction.
eval.qa/learn/vanity-vs-outcome-metrics
L1 · CH2
Metrics by System Type
What to measure for chatbots, RAG pipelines, code assistants, agents, and content generators.
eval.qa/learn/metrics-by-system-type
L1 · CH2
The Faithfulness–Correctness Distinction
A RAG answer can be faithful to stale docs and still wrong. Why this distinction is critical.
eval.qa/learn/faithfulness-vs-correctness
L1 · CH2
Leading vs. Lagging Indicators
Why tracking only accuracy misses the picture. Build a measurement system, not a single number.
eval.qa/learn/leading-vs-lagging
L1 · CH2
The Metric Selection Checklist
A practical framework for choosing which metrics to track. Covers relevance, measurability, actionability, and cost.
eval.qa/learn/metric-selection-checklist
Chapter 3 — Evaluation Methods 101
L1 · CH3
The Four Evaluation Types
Automated metrics, human evaluation, LLM-as-judge, and hybrid. When to use each and why.
eval.qa/learn/four-evaluation-types
L1 · CH3
LLM-as-Judge: Power and Pitfalls
Verbosity bias, position bias, self-preference, authority bias. How LLM judges fail and how to calibrate.
eval.qa/learn/llm-as-judge-biases
L1 · CH3
When Humans Are Irreplaceable
Tasks where automated evals fail: nuance, safety, cultural sensitivity, open-ended generation.
eval.qa/learn/when-humans-irreplaceable
L1 · CH3
Tagging Framework: Machine → Human-Verified → Human-Enhanced
A 3-tier content labeling system showing how much human judgment backs each eval result.
eval.qa/learn/tagging-framework
Chapter 4 — Reading Results Without Self-Deception
L1 · CH4
The 5 Interpretation Traps
Blended Average, Aggregate Confidence, Infrastructure-as-Quality, Faithfulness-as-Correctness, Threshold-in-Isolation.
eval.qa/learn/5-interpretation-traps
L1 · CH4
How Dashboards Lie
Why a green dashboard doesn't mean your AI is working. Aggregation hides failure modes.
eval.qa/learn/how-dashboards-lie
L1 · CH4
Segment Before You Celebrate
Slice by user type, query type, time window. Aggregate scores hide who's failing.
eval.qa/learn/segment-before-celebrate
L1 · CH4
Asking the Right Questions About Results
The 6 questions to ask before trusting any eval number: sample size, distribution, recency, segmentation, baseline, trend.
eval.qa/learn/questions-about-results
Chapter 5 — AI ROI: Measuring What Matters
L1 · CH5
The 4 Measurement Maturity Stages
Activity → Output → Outcome → Portfolio. Where most orgs are stuck and how to advance.
eval.qa/learn/measurement-maturity
L1 · CH5
Activity vs. Value
"We deployed 12 AI tools" is activity. "Support resolution time dropped 40%" is value.
eval.qa/learn/activity-vs-value
L1 · CH5
Building the Business Case for Eval
How to justify eval investment to leadership. ROI framing, risk reduction, competitive positioning.
eval.qa/learn/business-case-for-eval
L1 · CH5
Cost-per-Eval: Understanding the Economics
Human eval at $60/task vs. LLM-as-judge at $0.02. When cheap eval costs more in the long run.
eval.qa/learn/cost-per-eval
Chapter 6 — Domain-Specific Eval Awareness
L1 · CH6
Healthcare AI Evaluation
Clinical accuracy, patient safety, regulatory requirements (FDA, HIPAA). Eval stakes are life-or-death.
eval.qa/learn/eval-healthcare
L1 · CH6
Financial Services AI Evaluation
Risk assessment accuracy, regulatory compliance, audit trails. Where one wrong number can move millions.
eval.qa/learn/eval-finance
L1 · CH6
Legal AI Evaluation
Citation accuracy, precedent matching, hallucination detection in legal research tools.
eval.qa/learn/eval-legal
L1 · CH6
Customer Support AI Evaluation
Resolution rate, customer satisfaction, escalation accuracy. Measuring the full support journey.
eval.qa/learn/eval-customer-support
L1 · CH6
Creative & Content AI Evaluation
Originality, brand alignment, tone consistency. How do you evaluate what's inherently subjective?
eval.qa/learn/eval-creative
L1 Eval Foundations Certification (EAF)
30 questions · 45 min · $9.99 · No prerequisites
Level 2 — Eval Practitioner EAP
Applied evaluation methods. From awareness to hands-on execution. Prereq: L1 EAF.
Chapter 1 — Evaluation Method Selection
L2 · CH1
The Method Selection Decision Tree
Choosing between automated, human, LLM-as-judge, and hybrid based on task characteristics.
eval.qa/learn/method-selection-tree
L2 · CH1
Code-Based Evaluation Pipelines
Unit testing for LLMs: DeepEval, pytest-style assertions, CI/CD integration, regression suites.
eval.qa/learn/code-based-eval-pipelines
L2 · CH1
Human Evaluation Design
Rubric construction, rater recruitment, calibration sessions, and compensation models.
eval.qa/learn/human-eval-design
L2 · CH1
LLM-as-Judge Calibration Techniques
Prompt engineering for judges, multi-judge panels, and bias mitigation strategies.
eval.qa/learn/llm-judge-calibration
L2 · CH1
Hybrid Evaluation Workflows
AI screens + human confirms. Confidence-based routing. When to escalate from auto to human.
eval.qa/learn/hybrid-eval-workflows
Chapter 2 — Metric Design for AI Systems
L2 · CH2
RAG Evaluation Metrics Deep Dive
Faithfulness, answer relevancy, contextual precision, contextual recall. The RAGAS framework.
eval.qa/learn/rag-eval-metrics
L2 · CH2
Agent Evaluation: Trajectory Analysis
Tool correctness, step efficiency, goal completion. Evaluating multi-step autonomous agents.
eval.qa/learn/agent-trajectory-eval
L2 · CH2
Chatbot Quality Metrics
Task completion, conversation flow, user satisfaction, escalation rate. End-to-end evaluation.
eval.qa/learn/chatbot-quality-metrics
L2 · CH2
Code Assistant Evaluation
Functional correctness, security, test coverage, code quality. Evaluating generated code beyond "does it run."
eval.qa/learn/code-assistant-eval
L2 · CH2
Custom Metric Design Workshop
Step-by-step process for creating domain-specific evaluation metrics. From definition to implementation.
eval.qa/learn/custom-metric-design
Chapter 3 — Inter-Rater Reliability
L2 · CH3
Why Agreement Matters
Low inter-rater consistency corrupts training data, misleads benchmarks, makes results meaningless.
eval.qa/learn/why-agreement-matters
L2 · CH3
Cohen's Kappa and Weighted Kappa
Calculating κ and κw. Interpreting: poor (<0.20) to near-perfect (0.81+). Python implementation.
eval.qa/learn/cohens-kappa
L2 · CH3
ICC for Multi-Rater Settings
Intraclass Correlation Coefficient for 3+ raters. ICC(2,1) vs ICC(3,1). When to use which.
eval.qa/learn/icc-multi-rater
L2 · CH3
Calibration Sessions: Running Them Right
Structure, frequency, gold standard items, adjudication process. The operational playbook.
eval.qa/learn/calibration-sessions
L2 · CH3
AI-Human Agreement Validation
Using QWK to validate when an LLM judge agrees with human evaluators. Trust threshold: 0.70.
eval.qa/learn/ai-human-agreement
Chapter 4 — Evaluation Pipeline Design
L2 · CH4
Evaluation Dataset Construction
Golden datasets, synthetic generation, stratified sampling, coverage analysis.
eval.qa/learn/eval-dataset-construction
L2 · CH4
CI/CD Eval Gates
Automated quality thresholds in your deployment pipeline. Fail a PR if eval scores regress.
eval.qa/learn/cicd-eval-gates
L2 · CH4
Eval Toolchain Overview
DeepEval, RAGAS, TruLens, MLflow, LangSmith, Arize, Confident AI compared.
eval.qa/learn/eval-toolchain-overview
L2 · CH4
Production-to-Test Feedback Loops
Turning production failures into test cases automatically. The closed-loop eval system.
eval.qa/learn/production-feedback-loops
L2 · CH4
Versioning and Traceability
Linking every eval score to the exact prompt, model, and dataset version. Reproducibility as practice.
eval.qa/learn/versioning-traceability
Chapter 5 — Communicating Eval Results
L2 · CH5
The Deployment Clearance Report (DCR)
The artifact that translates eval results into a go/no-go deployment recommendation.
eval.qa/learn/deployment-clearance-report
L2 · CH5
Writing for Different Audiences
Executives want ROI. Engineers want methodology. PMs want feature decisions. Tailor the message.
eval.qa/learn/eval-for-audiences
L2 · CH5
Dashboard Design for Eval Data
What to show, what to hide. Segmented views, trend lines, confidence intervals.
eval.qa/learn/eval-dashboard-design
L2 · CH5
Defending Your Eval Results
How to handle "but our accuracy is 95%!" pushback. Methodology defense, limitation disclosure.
eval.qa/learn/defending-eval-results
L2 · CH5
From Results to Recommendations
The bridge between "here are the numbers" and "here's what we should do."
eval.qa/learn/results-to-recommendations
Chapter 6 — Hands-On Eval Scenarios
L2 · CH6
Scenario: Evaluate a RAG Knowledge Base
Define metrics, run evaluation, identify failure modes, and recommend improvements. Full walkthrough.
eval.qa/learn/scenario-rag-knowledge-base
L2 · CH6
Scenario: Evaluate a Customer Support Agent
Multi-step agent handling refunds and escalations. Trajectory analysis, tool use correctness.
eval.qa/learn/scenario-support-agent
L2 · CH6
Scenario: Evaluate a Code Review Assistant
Does it catch real bugs? False positive rate. Security review quality assessment.
eval.qa/learn/scenario-code-review
L2 · CH6
Scenario: Evaluate a Content Generation Tool
Brand voice consistency, factual accuracy, originality scoring. Evaluating creative output systematically.
eval.qa/learn/scenario-content-gen
L2 · CH6
Practice: Write Your First DCR
Draft a complete Deployment Clearance Report with methodology, findings, and recommendation.
eval.qa/learn/practice-write-dcr
L2 Eval Practitioner Certification (EAP)50 questions · Applied scenarios · $49.99 · Prereq: L1
Level 3 — Eval Specialist EAS / CAEE
Deep technical evaluation. The 4-hour lab credential. Production-grade judgment. Prereq: L2 EAP.
Chapter 1 — Advanced RAG Evaluation
L3 · CH1
Retrieval Quality Decomposition
Separating retrieval failures from generation failures. Context precision, recall, noise sensitivity in isolation.
eval.qa/learn/retrieval-quality-decomposition
L3 · CH1
Generation Faithfulness Under Pressure
Testing generation when context is contradictory, partial, or stale. Robustness testing.
eval.qa/learn/generation-faithfulness-pressure
L3 · CH1
Multi-Hop RAG Evaluation
Evaluating systems that synthesize across multiple documents. Reasoning chain validation.
eval.qa/learn/multi-hop-rag-eval
L3 · CH1
RAG Failure Taxonomy
12 failure modes: missed retrieval, wrong chunk, stale context, hallucinated source, and more.
eval.qa/learn/rag-failure-taxonomy
L3 · CH1
Production RAG Monitoring
Real-time faithfulness tracking, drift detection, embedding degradation, alerting thresholds.
eval.qa/learn/production-rag-monitoring
Chapter 2 — Safety & Adversarial Evaluation
L3 · CH2
AI Red Teaming Methodology
Structured adversarial testing: threat models, attack surfaces, scenario design. NIST AI RMF aligned.
eval.qa/learn/red-teaming-methodology
L3 · CH2
Prompt Injection Testing
Direct injection, indirect injection, jailbreaks, system prompt extraction. Testing boundaries.
eval.qa/learn/prompt-injection-testing
L3 · CH2
Safety Benchmarks & Vulnerability Scanning
40+ vulnerability categories. DeepEval's red-teaming suite. Building automated safety gates.
eval.qa/learn/safety-benchmarks
L3 · CH2
Data Leakage & PII Detection
Testing whether your AI system leaks training data, user data, or personally identifiable information.
eval.qa/learn/data-leakage-detection
L3 · CH2
NIST AI RMF & EU AI Act Alignment
How evaluation maps to regulatory frameworks. ARIA, CoRIx, compliance evidence generation.
eval.qa/learn/nist-eu-ai-act-alignment
Chapter 3 — Psychometric Rigor
L3 · CH3
Rubric Design with Behavioral Anchors
6-criteria rubric with exact weights. Writing behavioral anchors on a 1–4 scale.
eval.qa/learn/rubric-behavioral-anchors
L3 · CH3
Borderline-Group Standard Setting
The method for determining cut-scores. Mean score of borderline-competent candidates.
eval.qa/learn/borderline-standard-setting
L3 · CH3
Reliability Reporting & Transparency
Publishing κw, ICC with 95% CI, score distributions. Methodology page as a trust signal.
eval.qa/learn/reliability-reporting
L3 · CH3
Item Analysis & Difficulty Calibration
Discrimination index, difficulty level, distractor analysis. Ensuring each question does its job.
eval.qa/learn/item-analysis
L3 · CH3
Fairness & Bias in Assessment
DIF analysis, accommodation policies, cultural sensitivity review. Test measures skill, not background.
eval.qa/learn/fairness-bias-assessment
Chapter 4 — The Eval Lab Experience
L3 · CH4
Lab Scenario: RAG Pipeline for MedDocs
The flagship CAEE scenario. DocAssist retrieval system with medical documentation. Full walkthrough.
eval.qa/learn/lab-meddocs-scenario
L3 · CH4
Hidden Challenges & Stress Tests
10 stress test scenarios embedded in the lab. Contradictory evidence, missing context, adversarial inputs.
eval.qa/learn/lab-hidden-challenges
L3 · CH4
Writing the DCR Under Pressure
Time management for the 4-hour lab. What to include when you have 90 minutes left.
eval.qa/learn/dcr-under-pressure
L3 · CH4
Grading: How Your Lab Gets Scored
Dual-grading process, AI-screened → human-confirmed pipeline, rater calibration, appeals.
eval.qa/learn/lab-grading-process
L3 · CH4
Common Mistakes That Fail Candidates
Skipping segmentation, confusing faithfulness with correctness, ignoring edge cases, weak recommendations.
eval.qa/learn/common-lab-mistakes
Chapter 5 — Rater Operations & Quality Assurance
L3 · CH5
The 3-Phase Grading Model
Phase 1: full human dual-grading ($120). Phase 2: AI-screened ($70). Phase 3: AI primary ($35).
eval.qa/learn/three-phase-grading
L3 · CH5
Rater Training Protocol
5-hour onboarding, gold standard scoring, monthly calibration, quarterly re-calibration.
eval.qa/learn/rater-training-protocol
L3 · CH5
Drift Detection & Performance Management
Monitoring rater agreement over time. Every 25th submission triggers dual-grading.
eval.qa/learn/rater-drift-detection
L3 · CH5
Annotation Quality at Scale
Gold standard datasets, multi-layered QA, Krippendorff's alpha. Lessons from Apple and Google.
eval.qa/learn/annotation-quality-scale
L3 · CH5
Compensation & Evaluator Economics
$60/submission at CAEE. Industry benchmarks. Balancing quality incentives with cost control.
eval.qa/learn/evaluator-economics
Chapter 6 — Production Monitoring & Continuous Eval
L3 · CH6
Observability for AI Systems
Traces, spans, and eval hooks. LangSmith, Arize, Langfuse. Monitoring AI in the wild.
eval.qa/learn/ai-observability
L3 · CH6
Model Drift & Data Drift Detection
When performance degrades silently. Embedding drift, distribution shift, concept drift.
eval.qa/learn/drift-detection
L3 · CH6
A/B Testing for AI Systems
Prompt variants, model swaps, retrieval strategies. Statistical significance in AI context.
eval.qa/learn/ab-testing-ai
L3 · CH6
User Feedback Integration
Thumbs up/down, explicit corrections, implicit signals. Turning user behavior into eval data.
eval.qa/learn/user-feedback-integration
L3 · CH6
Meta-Evaluation: Evaluating Your Evaluators
Are your metrics measuring what you think? Correlation analysis, counterfactual testing.
eval.qa/learn/meta-evaluation
L3 Eval Specialist / CAEE Certification4-hour hands-on lab + DCR · $99.99 · Prereq: L2
Level 4 — Eval Architect EAA
Org-wide evaluation strategy. Design eval systems for teams and enterprises. Prereq: L3 EAS.
Chapter 1 — Eval Strategy & Governance
L4 · CH1
The Org-Wide Eval Maturity Assessment
Ad-hoc → Structured → Production → Continuous → Portfolio. Gap analysis and advancement roadmap.
eval.qa/learn/org-eval-maturity
L4 · CH1
Eval Governance Framework
Policies, standards (CS-001–CS-004), advisory boards, review cycles. Institutional backbone.
eval.qa/learn/eval-governance-framework
L4 · CH1
Building the Eval Team
Eval Engineer, Reliability Engineer, Quality Manager, ROI Analyst, Trust & Safety Lead. Hiring and structuring.
eval.qa/learn/building-eval-team
L4 · CH1
Eval Ethics & Code of Conduct
Evaluate with integrity, serve the stakeholder, disclose conflicts, protect data. The 8 principles.
eval.qa/learn/eval-ethics-code
L4 · CH1
Budgeting for Eval Programs
Cost modeling: human eval, compute, tooling, training. Making the financial case to leadership.
eval.qa/learn/budgeting-eval-programs
Chapter 2 — Multi-System Evaluation Architecture
L4 · CH2
Portfolio-Level Eval Strategy
When you have 20+ AI systems: prioritization, shared infrastructure, cross-system metrics, risk ranking.
eval.qa/learn/portfolio-eval-strategy
L4 · CH2
Eval Platform Architecture
Task routing, evaluator matching, result storage, API design. Build vs. buy decision framework.
eval.qa/learn/eval-platform-architecture
L4 · CH2
Cross-System Benchmarking
Comparing AI systems against each other and baselines. Normalization, fairness, leaderboard design.
eval.qa/learn/cross-system-benchmarking
L4 · CH2
Eval Orchestration at Scale
Smart routing, confidence-based escalation, parallel evaluation, queue management.
eval.qa/learn/eval-orchestration-scale
L4 · CH2
Vendor Evaluation & Tool Selection
Comparing DeepEval, RAGAS, Arize, Confident AI, Galileo, LangSmith against your needs.
eval.qa/learn/vendor-eval-tool-selection
Chapter 3 — Department-Specific Eval Tracks
L4 · CH3
Engineering Eval Track
ML/AI engineers: code eval, model benchmarking, pipeline testing, CI/CD integration.
eval.qa/learn/track-engineering
L4 · CH3
Product & Analytics Eval Track
PMs and analysts: feature impact, user experience evaluation, A/B testing for AI features.
eval.qa/learn/track-product
L4 · CH3
Go-to-Market Eval Tracks
Marketing, sales, customer support. Content quality, lead scoring, resolution rate tracking.
eval.qa/learn/track-go-to-market
L4 · CH3
Compliance & Risk Eval Tracks
Legal, finance, operations. Regulatory compliance evaluation, audit trail, risk scoring.
eval.qa/learn/track-compliance-risk
L4 · CH3
Executive Eval Track
CTOs, VPs, board members: portfolio ROI dashboards, risk heat maps, strategic recommendations.
eval.qa/learn/track-executive
Chapter 4 — Advanced Assessment Design
L4 · CH4
Designing Lab-Based Assessments
Scenario design, hidden challenges, time constraints, variant rotation. Measuring judgment.
eval.qa/learn/designing-lab-assessments
L4 · CH4
Cheating Defense Systems
Scenario variants, similarity detection, integrity statements, LLM-output detection.
eval.qa/learn/cheating-defense
L4 · CH4
Eval Gym: Training Evaluator Workforces
Skill trees, progression paths, qualification gates, performance analytics, leaderboards.
eval.qa/learn/eval-gym-training
L4 · CH4
NCCA & ISO 17024 Accreditation Path
What it takes to get your certification accredited. Standards, evidence, timelines, cost.
eval.qa/learn/ncca-iso-accreditation
L4 · CH4
Continuing Education & Eval Hours
The EHCS system: Eval Craft (40%), Eval Impact (25%), Eval Leadership (15%). Earning paths A–D.
eval.qa/learn/continuing-education-ehcs
Chapter 5 — Employer Recognition & Market Strategy
L4 · CH5
The Employer Recognition Program
Bronze → Silver → Gold → Platinum tiers. Discount structures, adoption incentives.
eval.qa/learn/employer-recognition
L4 · CH5
Letters of Intent & Demand Proof
Gate 1: 30+ deposits. Gate 2: 5 employer LOIs. Gate 3: κ ≥ 0.60 + target pass rate.
eval.qa/learn/demand-proof-protocol
L4 · CH5
Competitive Moat Design
Lab difficulty, published methodology, portfolio artifact, tool-agnostic positioning.
eval.qa/learn/competitive-moat
L4 · CH5
Enterprise Sales & Pricing Architecture
Individual vs. corporate vs. enterprise pricing. Volume discounts, team packages, subscriptions.
eval.qa/learn/enterprise-pricing
L4 · CH5
University & CE Credit Partnerships
Getting academic recognition. CE credits, university partnerships, academic advisory boards.
eval.qa/learn/university-partnerships
Chapter 6 — Case Studies & Portfolio Projects
L4 · CH6
Case Study: Enterprise RAG Eval at Scale
Fortune 500 evaluating 15 RAG systems. Architecture, tooling, governance, results.
eval.qa/learn/case-enterprise-rag
L4 · CH6
Case Study: Healthcare AI Safety Program
Building an eval program for clinical AI. FDA alignment, patient safety, continuous monitoring.
eval.qa/learn/case-healthcare-safety
L4 · CH6
Case Study: Financial Services Eval Governance
Audit-ready evaluation for banking AI. Regulatory compliance, explainability, bias testing.
eval.qa/learn/case-finserv-governance
L4 · CH6
Portfolio Project: Design an Eval Program
Capstone: given a fictional company with 8 AI systems, design a complete eval architecture.
eval.qa/learn/portfolio-eval-program
L4 · CH6
Peer Defense: Presenting Your Architecture
The L4 peer defense format. Presenting your portfolio to a panel, handling questions.
eval.qa/learn/peer-defense-guide
L4 Eval Architect Certification (EAA)Portfolio + case study + peer defense · $149.99 · Prereq: L3
Level 5 — Eval Commander EAC
Strategic evaluation leadership. Shape industry standards, mentor the next generation. Prereq: L4 EAA.
Chapter 1 — Eval as Strategic Advantage
L5 · CH1
The Chief AI Evaluation Officer
The emerging C-suite role ($200K–$350K+). Scope, responsibilities, reporting, and the case.
eval.qa/learn/chief-eval-officer
L5 · CH1
Eval-Driven Business Strategy
How organizations that evaluate well outperform. Evaluation as decision infrastructure, not overhead.
eval.qa/learn/eval-driven-strategy
L5 · CH1
Board-Level AI Risk Reporting
Translating eval data into board decks. Risk heat maps, portfolio health, investment justification.
eval.qa/learn/board-risk-reporting
L5 · CH1
The Authority-First Philosophy
Revenue follows authority, not the reverse. Building a hard certification creates brand power.
eval.qa/learn/authority-first
L5 · CH1
Eval Market Landscape & Future Trends
Agent evaluation, multimodal eval, real-time eval, eval-as-a-service. Where the market is heading.
eval.qa/learn/eval-market-trends
Chapter 2 — Regulatory & Standards Leadership
L5 · CH2
NIST AI RMF Deep Dive
Govern, Map, Measure, Manage. How the risk management framework structures evaluation.
eval.qa/learn/nist-ai-rmf-deep-dive
L5 · CH2
EU AI Act Compliance Mapping
Articles 9–72: testing, evaluation, documentation. How eval certification provides compliance evidence.
eval.qa/learn/eu-ai-act-compliance
L5 · CH2
Building Standards Bodies
Creating and publishing evaluation standards. Governance, public comment, revision cycles.
eval.qa/learn/building-standards-bodies
L5 · CH2
Credential Reciprocity
Partnering with PMI, IAPP, IEEE. Cross-recognition frameworks and mutual benefit structures.
eval.qa/learn/credential-reciprocity
L5 · CH2
The ARIA & CoRIx Programs
NIST's Assessing Risks and Impacts of AI. Contextual Robustness Index. How to participate.
eval.qa/learn/nist-aria-corix
Chapter 3 — Building Eval Culture
L5 · CH3
The Eval Culture Maturity Model
From "nobody evaluates" to "evaluation is how we make decisions." 5 stages with diagnostic indicators.
eval.qa/learn/eval-culture-maturity
L5 · CH3
Change Management for Eval Adoption
Overcoming resistance: "our AI is fine" syndrome. Stakeholder mapping, quick wins, champions.
eval.qa/learn/change-management-eval
L5 · CH3
Mentoring the Next Generation
EAC mentorship model: pairing Commanders with Specialist candidates. Structured programs.
eval.qa/learn/mentoring-eval
L5 · CH3
Eval Challenges & Community Building
Monthly community challenges on real AI systems. Leaderboards, shared playbooks, frameworks.
eval.qa/learn/eval-challenges-community
L5 · CH3
Publishing & Thought Leadership
Writing for the eval community: case studies, methodology papers, open-source contributions.
eval.qa/learn/publishing-thought-leadership
Chapter 4 — Frontier Evaluation Challenges
L5 · CH4
Multimodal AI Evaluation
Text, images, audio, video. Cross-modal faithfulness, grounding, and coherence evaluation.
eval.qa/learn/multimodal-eval
L5 · CH4
Autonomous Agent Evaluation
Multi-step planning, tool use chains, error recovery, goal completion. Independent systems.
eval.qa/learn/autonomous-agent-eval
L5 · CH4
Real-Time Evaluation Systems
Evaluating AI responses as they happen. Streaming eval, latency constraints, online learning.
eval.qa/learn/real-time-eval
L5 · CH4
Evaluating AI That Evaluates AI
The recursive challenge. When your eval pipeline uses the same technology it's evaluating.
eval.qa/learn/evaluating-evaluators
L5 · CH4
Counterfactual & Causal Evaluation
"Would the outcome have been different without AI?" Causal inference applied to evaluation.
eval.qa/learn/counterfactual-causal-eval
Chapter 5 — Multi-Domain Eval Expansion
L5 · CH5
Domain 2: AI Safety Certification (CS-002)
Safety evaluation specialization: red teaming, vulnerability assessment, safety culture, incident response.
eval.qa/learn/domain-ai-safety
L5 · CH5
Domain 3: AI Governance Certification (CS-003)
Policy, standards, compliance, audit. The governance layer above technical evaluation.
eval.qa/learn/domain-ai-governance
L5 · CH5
Domain 4: Human Judgment Certification (CS-004)
Cognitive bias, calibration, structured analytical techniques, expert elicitation.
eval.qa/learn/domain-human-judgment
L5 · CH5
Stackable Credentials & Badge Architecture
Open Badges 3.0. Core + specialty badges. Stacking rules, display, and verification.
eval.qa/learn/stackable-credentials
L5 · CH5
Eval Marketplace Design
Community-contributed frameworks, evaluator placements, custom eval services. Platform economy.
eval.qa/learn/eval-marketplace
Chapter 6 — The Commander's Portfolio & Oral Defense
L5 · CH6
Portfolio Requirements
3 artifacts required: eval program design, published contribution, mentorship evidence.
eval.qa/learn/commander-portfolio-reqs
L5 · CH6
Industry Contribution Project
Publish methodology, open-source a framework, contribute to standards. Making eval better.
eval.qa/learn/industry-contribution
L5 · CH6
The Oral Defense Format
60-minute panel: 20 min presentation, 25 min Q&A, 15 min deliberation. Scoring criteria.
eval.qa/learn/oral-defense-format
L5 · CH6
Exemplar Commander Portfolios
Anonymized examples of outstanding L5 portfolios. What excellence looks like at the top.
eval.qa/learn/exemplar-portfolios
L5 · CH6
Life After L5: The Eval Advisory Path
Advisory board membership, industry consulting, standards committee leadership. Where Commanders go next.
eval.qa/learn/eval-advisory-path
L5 Eval Commander Certification (EAC)Portfolio + oral defense + industry contribution · $199.99 · Prereq: L4
