What the Portfolio Is and Why It Exists

The L5 Commander portfolio exists because passing the exam is not enough to demonstrate strategic leadership in AI evaluation. The exam tests knowledge. The portfolio tests judgment, execution, and impact.

You cannot reach Commander level by pure study. You must demonstrate that you can:

The portfolio is evaluated by a panel of 3-5 expert evaluators, not just the automated exam system. They look for evidence of strategic thinking, not just technical competence.

68%
Pass rate on L5 exam (written knowledge)
42%
Pass rate on portfolio (demonstrated impact)
1,200 hours
Average preparation time for L5 (exam + portfolio)

The portfolio is harder than the exam. This is intentional. The Commander credential should be rare and valuable.

The 3 Mandatory Artifacts

You must submit all three artifacts. You cannot skip one or "compensate" with a stronger version of another. The portfolio is holistic.

Artifact Scope Format Target Length Weight
Eval Program Design Design for a real evaluation initiative Document (PDF or markdown) 15-25 pages 35%
Published Contribution Public intellectual contribution Blog, paper, talk, or tool Varies 35%
Mentorship Evidence Documentation of mentoring 1-2 evaluators Structured documentation 10-15 pages + artifacts 30%

Artifact 1: Eval Program Design (35% weight)

This is a comprehensive design document for an evaluation program. Not a proposal; a design. The program should be:

Required Sections

1. Problem Statement (2-3 pages)

What problem does this evaluation solve? Examples:

The problem statement should feel urgent. Why does this matter now? What's the risk of inaction?

2. Evaluation Architecture (3-4 pages)

High-level design of your evaluation approach:

3. Governance and Ownership (2 pages)

Who is responsible for what? Examples:

Include a RACI matrix (Responsible, Accountable, Consulted, Informed) for key decisions.

4. Rollout Timeline (2-3 pages)

Month-by-month or phase-by-phase plan. Example:

5. Budget and Resource Requirements (2 pages)

Be specific:

Frame it as investment, with expected ROI.

6. Success Criteria and Measurement (2 pages)

How will you know the evaluation program succeeded? Examples:

Quality Criteria for Artifact 1

Evaluation perspective
Reviewers ask: "If I handed this to a new team, could they implement it?" If the answer is yes, it's well-written. If they'd need to come back to you with questions, it needs clarity.
Criterion Acceptable (70+) Strong (85+) Exceptional (95+)
Problem clarity Problem is clear; urgency is implied Problem is clear and urgent; quantified Problem is quantified with stakeholder validation
Architecture feasibility Design is reasonable; some gaps exist Design is detailed and implementable Design is implementable with evidence of piloting
Metric selection Metrics are chosen; some lack justification All metrics are justified; align with problem Metrics are justified, weighted, and validated
Governance clarity Ownership is defined but vague Clear ownership with decision rights RACI matrix; escalation paths defined
Timeline realism Timeline is provided but aggressive Timeline is realistic with milestones Timeline based on actual implementation experience

Artifact 2: Published Contribution (35% weight)

You must make a public intellectual contribution to the field of AI evaluation. This is not internal documentation; it's external, for the community.

What Counts as Published Contribution

Quality Bar: What Doesn't Count

Examples of Strong Contributions

Blog Post Example: "Multi-Dimensional Evaluation of RAG Systems: Beyond BLEU Scores" — 4,500 words, published on eval.qa blog, 5,000+ reads, cited by others

Tool Example: "EvalMetrics" — Open-source Python library for domain-specific evaluation metrics, 800 GitHub stars, 50 organizations using it

Paper Example: "Causal Inference in AI Evaluation: Separating Correlation from Impact" — Published at FAccT 2025, 20+ citations

Talk Example: "Building Evaluation Culture: A 500-Person Case Study" — 35-minute technical talk at NeurIPS 2025, 200+ attendees

Minimum Thresholds

If you're borderline (e.g., blog with 2,000 words), you may be asked to strengthen it before final acceptance.

Artifact 3: Mentorship Evidence (30% weight)

Strategic leadership means developing others. This artifact documents your mentorship of 1-2 evaluators.

What Mentorship Means

Not giving a presentation or writing a tutorial. Actual 1-on-1 development of another evaluator, where they:

Required Documentation

For each mentee, provide:

Quality Examples

Strong mentorship evidence:

I mentored Sarah, a junior ML engineer with 2 years experience. She had zero evaluation background. Over 6 months, I guided her through:
  • Month 1: Basics of eval, human evaluation design, annotation rubrics
  • Month 2: Designing eval for her team's recommendation system
  • Month 3: Running first evaluation, interpreting results
  • Month 4: Automation, building eval pipeline
  • Month 5-6: Leading eval for a new project, mentoring a junior intern herself
She now leads evaluation for 2 ML systems independently. She passed the L2 evaluation exam and is a core member of our eval center of excellence.

Weak mentorship evidence:

I mentored 5 people in evaluation. I gave them access to my notes and recommended they take the eval.qa courses. They were interested in learning about evaluation metrics.

The second example is not mentorship; it's resource sharing. Mentorship is intentional, structured, and demonstrates growth.

Portfolio Evaluation Rubric

Your portfolio is graded on a 0-100 scale across 5 dimensions. You need 75+ overall to pass.

Dimension Excellent (95-100) Strong (85-94) Adequate (75-84) Weak (<75)
Strategic Thinking Design solves a critical problem; shows sophisticated understanding of org context Design solves a real problem; good understanding of constraints Design solves a defined problem; some gaps in sophistication Problem is vague; limited strategic insight
Execution Quality Evidence of successful execution; metrics show documented impact Design is detailed and implementable; some execution evidence Design is clear; limited execution evidence Design lacks clarity or feasibility
Field Contribution Published work that's novel and widely adopted; 1000s cite or use it Published work that's solid; adopted by meaningful audience Published work that's competent; limited adoption Work is not published or published but barely adopted
Leadership & Mentorship Mentees have progressed multiple levels; are now mentoring others Mentees have progressed 1-2 levels; are independent evaluators Mentees have gained skills; limited independence demonstrated Mentorship is not structured or evidence is weak
Communication All artifacts are exceptionally well-written; ideas are crystalline Artifacts are well-written; ideas are clear Artifacts are readable; some clarity issues Artifacts are hard to follow; unclear writing

Timeline and Submission Process

Submission Windows

Portfolios are reviewed in three windows per year:

Panels review and respond within 4 weeks of the deadline. You'll hear: Accepted, Revision Required, or Rejected.

Format Requirements

All files must be submitted via the eval.qa portal. Max file size: 100MB total.

Common Rejection Reasons (and How to Avoid Them)

28%
Rejected for weak eval program design
22%
Rejected for insufficient published contribution
31%
Rejected for weak mentorship evidence
19%
Rejected for poor writing/clarity

Top Rejection Reasons

1. "Problem Statement Is Too Vague" (12% of rejections)

Weak: "We need better evaluation of our AI models."

Strong: "Our recommendation engine serves 50M users across 3 geographies. We found a 40% performance gap between desktop and mobile users, but only after we deployed a new ranking algorithm. We need systematic evaluation to catch such gaps before deployment."

Fix: Be specific. Use numbers. Describe the failure mode.

2. "Design Is Proposed, Not Implemented" (14%)

Acceptable: "We designed and implemented this evaluation program. It's now running in production."

Weak: "We proposed this evaluation framework, but it was never implemented."

Fix: If it's not implemented, provide a detailed plan for implementation and pilot it yourself if possible. Or choose a different project that you've actually executed.

3. "Contribution Is Too Niche or Too Basic" (18%)

Too niche: "I wrote a blog post on evaluating small language models for 3-word summaries." (Audience: ~10 people)

Too basic: "I wrote a blog post explaining what BLEU scores are." (Competent but not novel)

Strong: "I developed a new methodology for counterfactual evaluation of recommendation systems, published at FAccT, now used by 15+ companies."

Fix: Choose a contribution that's both novel and relevant to the broader community. Aim for 1000+ potential readers/users.

4. "Mentorship Is Not Structured" (16%)

Weak: "I helped several people learn about evaluation. They found it helpful."

Strong: "I mentored Alice from L1 to L3 over 6 months. We had bi-weekly sessions. She now leads evaluation for 2 products and mentors others."

Fix: Focus on 1-2 mentees. Document everything. Get their written feedback. Show concrete outcomes.

5. "Unclear Writing; Hard to Follow" (11%)

Reviewers are expert evaluators, but they shouldn't need to hunt for meaning in your writing.

Fix: Have someone outside your field review it. Are ideas clear? Do claims have evidence? Is the structure logical?

Adequate vs. Exceptional Portfolios

Passing is 75+. But portfolios that score 90+ are memorable and set you apart.

An Adequate Portfolio (75-84)

You pass. You earn the credential. You can lead evaluation programs. But you're not making headlines.

An Exceptional Portfolio (90-100)

You pass with distinction. You're considered a thought leader. You get recruited for advisor roles, speaking invitations, consulting opportunities.

6-Week Preparation Roadmap

If you're starting from scratch, here's a realistic timeline:

Week 1: Artifact Selection

Weeks 2-3: Eval Program Design

Week 4: Published Contribution

Week 5: Mentorship Evidence

Week 6: Polish & Submit

Reality check: If you don't have published work or structured mentorship yet, add 3-6 months to this timeline. Don't rush creating artifacts; quality matters more than speed.

What "Industry-Level" Contribution Means

For the published contribution, the bar is "industry-level." What does this mean exactly?

Not Industry-Level

Industry-Level

The key test: Would someone outside your organization find this valuable? If yes, it's industry-level.

Anonymization and Confidentiality

Some portfolios contain proprietary information. You can anonymize while keeping substance.

What You Can Anonymize

What You Cannot Anonymize (or it weakens the submission)

Fill out the anonymization form and the review panel will keep it confidential. You won't be penalized for reasonable anonymization.

The Revision Process

If you receive "Revision Required," you're not rejected. You have a clear path to pass.

The panel will specify what needs strengthening:

You have 12 weeks to resubmit revised artifacts. The same panel reviews your revision.

Revision Success Rate

78%
Of revision-required submissions pass on second attempt
45%
Of outright rejections that reapply pass (typically much stronger second portfolio)

Revision is not a death sentence. Most who revise successfully pass.

FAQ: 15+ Common Questions

Q: Can I use internal work for my eval program design if it's been published externally (e.g., as a case study)?

A: Yes. As long as you have permission and anonymize proprietary information appropriately, you can use real work.

Q: Can I use the same published contribution for another credential (like a portfolio for an academic PhD or teaching role)?

A: Yes. A piece of work can serve multiple purposes. You don't need separate publications.

Q: What if I co-authored my published work? Does that disqualify me?

A: No. Co-authorship is fine. Specify your contribution (1st author, 2nd author, equal contribution, etc.). You should be able to speak to your specific role.

Q: Can I use a product I built that's not open-source as my published contribution?

A: It depends on impact. If it's closed-source, the bar is higher: documented case studies, user testimonials, clear evidence of adoption. Open-source projects get credit for transparency and community.

Q: If I mentor someone who fails their certification exam, does that hurt my portfolio?

A: No. Your job is to develop them; their job is to pass the exam. Mentee growth is measured by progression, not exam results.

Q: How do I find someone to mentor if I don't already have mentees?

A: You have options: (1) Mentor an existing junior evaluator at your org. (2) Volunteer as a mentor through eval.qa's mentorship program. (3) Find someone in your professional network interested in learning evaluation. You have time to find them; don't wait.

Q: Can I mentor someone remotely?

A: Yes. Document your sessions (Zoom, Slack, email). Remote mentorship is just as valid as in-person.

Q: Is 6 weeks enough to mentor someone if I'm just starting?

A: No. You should have 6+ months of mentoring history by the time you submit. Plan for this when you're starting your eval journey.

Q: Can my portfolio evaluation program be the same as my day job?

A: Yes. If you designed and led an evaluation program at work, that's valid. You don't need side projects.

Q: If my eval program design was rejected by my organization for implementation, can I still use it in my portfolio?

A: You can, but frame it as a "design proposal" not "executed program." The panel will ask why it wasn't implemented. Be honest. If it was rejected for budget reasons, that's understandable. If it was rejected because the design was flawed, use something else.

Q: How much detail should I include if I'm anonymizing?

A: Enough detail that an expert evaluator can understand the sophistication of your approach. If anonymization makes it too vague, you've anonymized too much.

Q: Can I update my portfolio after submission (before the panel reviews it)?

A: You have 5 days after submission to make minor updates (clarifications, broken links, etc.). After that, it's locked for review.

Q: What if I think a reviewer was unfair in their assessment?

A: You can appeal. You'll get a second review by a different panel member. Appeals are rare but possible if you can point to specific errors in assessment.