Scenario Setup
Your company, BrandForce, a digital marketing agency, has deployed an AI content generation system called ContentAI. The system generates marketing copy for multiple channels: blog posts (800-2000 words), social media posts (280 characters), email campaigns (subject lines + 150-word body), and landing page copy. The goal is to help your client teams generate 1000+ pieces of content per week without sacrificing quality or brand consistency.
ContentAI is powered by a fine-tuned GPT model, trained on your agency's past successful campaigns, client brand guidelines, and proven marketing copywriting patterns. The system can generate content in minutes rather than hours, dramatically accelerating campaign timelines.
Your challenge: Evaluate whether ContentAI produces content that's good enough for client delivery. You must answer:
- Does ContentAI maintain consistent brand voice across all outputs?
- Is the generated content factually accurate?
- How often is content original vs. plagiarized from training data?
- Is SEO quality sufficient (readability, keyword density, structure)?
- What percentage of generated content is deployment-ready vs. requires human revision?
- How do we quality-assure 1000 pieces/day without becoming a bottleneck?
Content Generation Eval Dimensions
Content generation quality has seven distinct dimensions. Unlike code review, these dimensions are partly objective and partly subjective:
Notice that some dimensions (factual accuracy, SEO quality, format compliance) are mostly objective and can be evaluated automatically. Others (brand voice, persuasiveness) are subjective and require human expert judgment. The challenge is combining both to produce reliable quality signals.
Building a Content Eval Dataset
A good content eval dataset has stratified representation across all dimensions that matter:
Step 1: Collect Reference Content
- Golden examples: 50-100 pieces of high-performing client content (measured by engagement, conversions, or editor approval). This is your quality baseline.
- Competitor content: Similar content from competitors. This helps evaluate originality and positioning.
- Poor examples: Content that performed badly or required heavy revision. Understand what to avoid.
Step 2: Define Content Distribution
Build eval dataset with these proportions:
- 40%: Blog posts (multiple industries, lengths, topics)
- 30%: Social media posts (Twitter, LinkedIn, Instagram-style)
- 20%: Email campaigns (promotional, educational, re-engagement)
- 10%: Landing page copy (high-stakes, conversion-focused)
Step 3: Generate Candidate Content
Run ContentAI on 500+ content generation requests spanning your typical workload. Cover multiple clients, industries, and topics. Save all outputs for evaluation.
Brand Voice Rubric Design
The trickiest dimension to evaluate is brand voice alignment. You need to operationalize vague concepts like "professional but approachable" or "edgy and disruptive" into scoreable criteria.
Step 1: Document Brand Voice Attributes
For each client, extract brand voice attributes from their guidelines:
CLIENT: TechStartup Inc. (Sample)
TONE: Casual, conversational, expert-friendly
- Uses "we" and "you" (inclusive)
- Avoids corporate jargon
- Explains technical concepts simply
- Injects humor where appropriate
VOCABULARY: Modern, startup-friendly
- Prefers "help" over "assist"
- Uses "product" not "offering"
- Mentions "users" not "customers"
- Technical terms explained on first use
STRUCTURE: Inverted pyramid for blog, punchy for social
- Blog: Lead with key insight, support with details
- Social: Hook in first line, one clear call-to-action
- Email: Subject line under 50 chars, single topic
VALUES REFLECTED: Innovation, transparency, customer focus
- Mentions user benefits not features
- Admits product limitations transparently
- Celebrates user success stories
Step 2: Build Scoreable Criteria
Convert attributes into concrete evaluation criteria:
Automated Quality Checks
Before human review, run all generated content through automated quality checks. This filters out obvious failures and measures objective dimensions:
Automated Metrics
AUTOMATED QUALITY CHECKS FOR CONTENT GENERATION
================================================================================
1. FORMAT COMPLIANCE
✓ Length within target range (blog: 800-2000 words, etc.)
✓ Required sections present (H1, H2s for blog; CTA for email)
✓ No incomplete sentences or fragments
✓ Proper heading hierarchy
2. GRAMMAR & SPELLING
✓ Zero typos/spelling errors (Grammarly API)
✓ No grammatical errors
✓ Sentence complexity score (Flesch-Kincaid)
3. READABILITY
✓ Flesch Reading Ease score 60+ (12th grade level)
✓ Average sentence length 15-20 words
✓ Paragraph length 3-5 sentences max
✓ No walls of text (visible formatting breaks)
4. SEO BASICS
✓ Keyword density 1-3% (target keyword)
✓ Meta description present and <160 chars
✓ Internal/external links present
✓ H1 contains primary keyword
5. PLAGIARISM & ORIGINALITY
✓ Copyscape/Turnitin check < 10% matching text
✓ No exact sentences from training data
✓ Unique angle vs. competitor content
6. BRAND TERMINOLOGY
✓ Uses correct product names (not variations)
✓ Company name capitalized correctly
✓ No contradictions to brand guidelines
Flag content that fails any check for human review before deployment.
Sample Automated Report
Generated Content: "10 Ways AI Transforms Marketing"
Blog Post for TechStartup Inc.
AUTOMATED QUALITY CHECK RESULTS:
✓ Length: 1,420 words (target: 800-2000) - PASS
✓ Grammar: 0 errors (Grammarly) - PASS
✓ Readability: Flesch 72 (target: 60+) - PASS
✗ Format: Missing H2 subheadings (required 4+, found 2) - FAIL
✓ SEO: Keyword density "AI marketing" 1.8% - PASS
⚠ Plagiarism: 8% matching text (competitor blog post) - PASS but flag for review
✓ Brand terminology: All correct - PASS
OVERALL: 5/6 checks pass. Content requires human review for: (1) Add missing H2 subheadings, (2) Verify originality vs. competitor content despite passing plagiarism threshold.
Human Review Protocol
Not all content can be auto-checked. Subjective dimensions require human expert reviewers.
Reviewer Selection
- Brand specialists: Hire 2-3 senior copywriters who know each client's brand deeply
- Subject matter experts: For technical content, SMEs validate factual accuracy
- Marketing analysts: Evaluate persuasiveness based on marketing best practices
Review Process
For each piece of content:
- Read full content with client brand guidelines open
- Score each dimension (factual accuracy, brand voice, readability, etc.) on 1-5 scale
- Provide revision notes if score is 1-3 (what needs to change)
- Mark deployment-ready if score 4-5 on all dimensions; deployable-with-minor-edits if 3-4 on most
Evaluating at Scale (1000 pieces/day)
BrandForce generates 1000 content pieces daily. You can't manually review all of them. Scale requires a hybrid approach:
Sampling Strategy
- 100% automated checks: All 1000 pieces get automated quality screening
- Random sample for human review: 5% sample (50 pieces/day) reviewed by humans for subjective dimensions
- Risk-based sampling: 100% human review for: landing pages, high-stakes campaigns, new client work, content that failed automated checks
- Stratified sampling: Ensure human sample covers all content types, clients, and topics proportionally
Quality Control Workflow
CONTENT GENERATION PIPELINE WITH QUALITY GATES
================================================================================
1. ContentAI generates content request
↓
2. Automated quality checks (grammar, format, plagiarism, brand terminology)
├─ PASS? → goes to QA sampling queue
└─ FAIL? → flagged for human revision
↓
3. Human QA sample (5%) reviews subjective dimensions
├─ Score 4-5 on all dims? → DEPLOY
├─ Score 3-4 on most dims? → Deploy with minor edit suggestion
└─ Score 1-2 on any dim? → REJECT, request regeneration
↓
4. Non-sampled content (95%) gets released based on automated checks only
(risk: not catching subjective quality issues, but necessary for scale)
↓
5. Post-deployment monitoring
├─ Track engagement metrics (clicks, conversions)
├─ Collect client feedback
└─ If engagement drops, increase sampling percentage
Running the Scenario Step-by-Step
Week 1: Setup & Baseline
- Collect 50 high-performing client pieces as golden examples
- Document brand voice attributes for 5 major clients
- Build scoreable brand voice rubric for each client
- Configure automated quality checks (grammar, plagiarism, SEO tools)
Week 2: Dataset & Calibration
- Run ContentAI on 500 content generation requests
- Conduct rater calibration session (2 hours) with 3 expert reviewers
- Each reviewer scores 30 sample pieces independently
- Discuss disagreements; refine rubric until inter-rater agreement ≥70%
Week 3: Full Evaluation
- 300 pieces undergo automated quality checks
- 150 pieces (50% sample) undergo human expert review on all dimensions
- Reviewers score each piece on 7 dimensions (factual accuracy, brand voice, originality, SEO, readability, persuasiveness, format)
- 2-3 raters per piece for inter-rater agreement validation
Week 4: Analysis & Reporting
- Aggregate results across all 300 pieces
- Break down by content type (blog, social, email, landing page)
- Break down by client
- Identify failure mode patterns
- Calculate what % of generated content is deployment-ready (4-5/5) vs. needs revision (2-3) vs. should be rejected (1-2)
Content Failure Mode Taxonomy
When content fails evaluation, categorize failures to understand where ContentAI struggles. Here are common failure modes:
If hallucinated stats are 12% of failures, that's your biggest problem to fix. If brand voice misses are 20%, focus improvement effort there. This categorization drives product development priorities.
Writing the Content AI Eval Report
Package evaluation results in an executive report for marketing leadership:
Executive Summary
Can we use ContentAI for production content at scale?
"ContentAI is production-ready for 60-70% of requested content, particularly blog posts and social media. However, it struggles with brand voice consistency (15% miss rate) and occasionally hallucinate statistics (8% error rate). Recommendation: Deploy ContentAI for blog and social content with 5% human QA sampling. Continue manual writing for landing pages and high-stakes campaigns. Invest in prompt engineering to reduce brand voice misses and fact-checking to eliminate hallucinations."
Results by Content Type
| Content Type | Deployment-Ready (%) | Needs Minor Edits (%) | Reject & Regenerate (%) | Recommendation |
|---|---|---|---|---|
| Blog Posts (200) | 68% | 22% | 10% | DEPLOY with 5% QA sampling |
| Social Posts (150) | 72% | 18% | 10% | DEPLOY with 5% QA sampling |
| Email Campaigns (100) | 58% | 32% | 10% | DEPLOY with 10% QA sampling |
| Landing Pages (50) | 42% | 32% | 26% | MANUAL ONLY (too risky) |
Dimension Performance
How did ContentAI perform on each quality dimension?
- Factual Accuracy: 85% of content accurate. 8% hallucinated statistics. 7% needed fact-checking.
- Brand Voice Alignment: 72% on-brand. 15% tone misses, 13% terminology issues.
- Originality: 91% original. 7% flagged as derivative of competitor content. 2% plagiarism concerns.
- SEO Quality: 78% SEO optimized. Common misses: keyword density, internal linking.
- Readability: 84% readable. Flesch score average 71 (target 60+).
- Persuasiveness: 68% persuasive. Missing clear CTAs; weak value propositions.
- Format Compliance: 92% correct format. Rare errors but automatable.
Failure Mode Breakdown
- Brand voice misses (15%): Solution: Improve prompt with brand guidelines; fine-tune model on client examples
- Hallucinated statistics (8%): Solution: Add fact-checking requirement; provide source documents in prompt
- Weak persuasiveness (10%): Solution: A/B test CTA variations; optimize for conversion metrics
- Readability issues (6%): Solution: Add readability score constraint in prompt
Deployment Plan
- Phase 1 (Immediate): Deploy for blog and social content (70%+ deployment ready). Start with 100% human QA sampling to build confidence. After 2 weeks, reduce to 5% sampling.
- Phase 2 (Month 2): Add email campaigns to deployment (with 10% QA sampling due to lower deployment-ready rate).
- Phase 3 (Month 3): Continue manual-only for landing pages. Revisit after 6 weeks if hallucination rate drops below 5%.
- Phase 4 (Months 2-3): Improve model: fine-tune on brand voice examples, add fact-checking, optimize persuasiveness.
The biggest mistake with content AI eval is collapsing all dimensions into a single accuracy score. "ContentAI is 73% accurate" is useless. What matters is: 70% deployment-ready for blogs, 42% for landing pages, 8% hallucination rate, 15% brand voice misses. These specific metrics drive deployment decisions and improvement priorities. Always report dimension-level and content-type-level results.
Key Takeaways
- Content generation eval has seven dimensions: factual accuracy, brand voice, originality, SEO quality, readability, persuasiveness, format compliance
- Mix automated and human evaluation: Objective dimensions (format, plagiarism, grammar) → automated. Subjective dimensions (brand voice, persuasiveness) → human experts
- Brand voice requires calibrated rubric: Document client attributes, translate to scoreable criteria, validate inter-rater agreement before full evaluation
- Scale through sampling: 100% automated checks for all content, 5-10% human QA sampling, 100% review for high-risk content (landing pages, new clients)
- Categorize failures: Hallucinated stats, brand voice misses, weak persuasiveness, etc. Use failure patterns to drive product improvement priorities
- Report by segment: Break down results by content type, client, and quality dimension. Enable conditional deployment decisions (deploy blogs, hold landing pages)
- Deployment readiness depends on failure tolerance: Social media can tolerate 10% failures. Landing pages can't. Tailor sampling and rollout strategy to risk tolerance
Ready to Advance Your Skills?
Test your understanding with the L2 certification exam.
Exam Coming Soon