Lab Scenario: Evaluating Content Generation

Scenario Setup

Your company, BrandForce, a digital marketing agency, has deployed an AI content generation system called ContentAI. The system generates marketing copy for multiple channels: blog posts (800-2000 words), social media posts (280 characters), email campaigns (subject lines + 150-word body), and landing page copy. The goal is to help your client teams generate 1000+ pieces of content per week without sacrificing quality or brand consistency.

ContentAI is powered by a fine-tuned GPT model, trained on your agency's past successful campaigns, client brand guidelines, and proven marketing copywriting patterns. The system can generate content in minutes rather than hours, dramatically accelerating campaign timelines.

Your challenge: Evaluate whether ContentAI produces content that's good enough for client delivery. You must answer:

Does ContentAI maintain consistent brand voice across all outputs?
Is the generated content factually accurate?
How often is content original vs. plagiarized from training data?
Is SEO quality sufficient (readability, keyword density, structure)?
What percentage of generated content is deployment-ready vs. requires human revision?
How do we quality-assure 1000 pieces/day without becoming a bottleneck?

Content Generation Eval Dimensions

Content generation quality has seven distinct dimensions. Unlike code review, these dimensions are partly objective and partly subjective:

Dimension Definition 1-5 Scale Objective or Subjective? Factual Accuracy Are claims in the content factually true? Are stats and data correct? 1=Multiple false claims | 5=Completely accurate Objective + manual check Brand Voice Alignment Does tone, style, personality match the client's brand guidelines? 1=Completely off-brand | 5=Perfect brand fit Subjective, requires calibration Originality Is content original or derivative of training data? 1=Likely plagiarized | 5=Completely original Automated plagiarism check + manual SEO Quality Readability score, keyword optimization, heading structure, length 1=Poor SEO fundamentals | 5=Optimized Mostly objective Readability Clarity, flow, sentence variety, jargon appropriateness 1=Confusing, hard to read | 5=Clear, engaging Subjective Persuasiveness Does the copy persuade the target audience to act (buy, sign up, etc.)? 1=Unconvincing | 5=Highly persuasive Subjective + A/B testing Format Compliance Correct length, structure, format for the intended channel 1=Wrong format | 5=Perfect format Objective

Notice that some dimensions (factual accuracy, SEO quality, format compliance) are mostly objective and can be evaluated automatically. Others (brand voice, persuasiveness) are subjective and require human expert judgment. The challenge is combining both to produce reliable quality signals.

Building a Content Eval Dataset

A good content eval dataset has stratified representation across all dimensions that matter:

Step 1: Collect Reference Content

Golden examples: 50-100 pieces of high-performing client content (measured by engagement, conversions, or editor approval). This is your quality baseline.
Competitor content: Similar content from competitors. This helps evaluate originality and positioning.
Poor examples: Content that performed badly or required heavy revision. Understand what to avoid.

Step 2: Define Content Distribution

Build eval dataset with these proportions:

40%: Blog posts (multiple industries, lengths, topics)
30%: Social media posts (Twitter, LinkedIn, Instagram-style)
20%: Email campaigns (promotional, educational, re-engagement)
10%: Landing page copy (high-stakes, conversion-focused)

Step 3: Generate Candidate Content

Run ContentAI on 500+ content generation requests spanning your typical workload. Cover multiple clients, industries, and topics. Save all outputs for evaluation.

Brand Voice Rubric Design

The trickiest dimension to evaluate is brand voice alignment. You need to operationalize vague concepts like "professional but approachable" or "edgy and disruptive" into scoreable criteria.

Step 1: Document Brand Voice Attributes

For each client, extract brand voice attributes from their guidelines:

CLIENT: TechStartup Inc. (Sample)

TONE: Casual, conversational, expert-friendly
  - Uses "we" and "you" (inclusive)
  - Avoids corporate jargon
  - Explains technical concepts simply
  - Injects humor where appropriate

VOCABULARY: Modern, startup-friendly
  - Prefers "help" over "assist"
  - Uses "product" not "offering"
  - Mentions "users" not "customers"
  - Technical terms explained on first use

STRUCTURE: Inverted pyramid for blog, punchy for social
  - Blog: Lead with key insight, support with details
  - Social: Hook in first line, one clear call-to-action
  - Email: Subject line under 50 chars, single topic

VALUES REFLECTED: Innovation, transparency, customer focus
  - Mentions user benefits not features
  - Admits product limitations transparently
  - Celebrates user success stories

Step 2: Build Scoreable Criteria

Convert attributes into concrete evaluation criteria:

Attribute 1 (Off-Brand) 3 (Acceptable) 5 (Perfect) Tone/Voice Overly formal or jokey; doesn't fit guidelines Mostly fits tone but occasionally misses mark Perfectly captures intended tone throughout Vocabulary/Terminology Uses wrong terminology; contradicts brand language Mostly correct vocabulary with 1-2 deviations Exact brand terminology throughout Target Audience Fit Feels written for wrong audience Generally appropriate but could be more personalized Perfectly speaks to target audience Values Alignment Contradicts stated brand values Mentions values but doesn't emphasize them Clearly reflects brand values throughout

Automated Quality Checks

Before human review, run all generated content through automated quality checks. This filters out obvious failures and measures objective dimensions:

Automated Metrics

AUTOMATED QUALITY CHECKS FOR CONTENT GENERATION
================================================================================

1. FORMAT COMPLIANCE
   ✓ Length within target range (blog: 800-2000 words, etc.)
   ✓ Required sections present (H1, H2s for blog; CTA for email)
   ✓ No incomplete sentences or fragments
   ✓ Proper heading hierarchy

2. GRAMMAR & SPELLING
   ✓ Zero typos/spelling errors (Grammarly API)
   ✓ No grammatical errors
   ✓ Sentence complexity score (Flesch-Kincaid)

3. READABILITY
   ✓ Flesch Reading Ease score 60+ (12th grade level)
   ✓ Average sentence length 15-20 words
   ✓ Paragraph length 3-5 sentences max
   ✓ No walls of text (visible formatting breaks)

4. SEO BASICS
   ✓ Keyword density 1-3% (target keyword)
   ✓ Meta description present and <160 chars
   ✓ Internal/external links present
   ✓ H1 contains primary keyword

5. PLAGIARISM & ORIGINALITY
   ✓ Copyscape/Turnitin check < 10% matching text
   ✓ No exact sentences from training data
   ✓ Unique angle vs. competitor content

6. BRAND TERMINOLOGY
   ✓ Uses correct product names (not variations)
   ✓ Company name capitalized correctly
   ✓ No contradictions to brand guidelines

Flag content that fails any check for human review before deployment.

Sample Automated Report

Generated Content: "10 Ways AI Transforms Marketing"
Blog Post for TechStartup Inc.

AUTOMATED QUALITY CHECK RESULTS:
✓ Length: 1,420 words (target: 800-2000) - PASS
✓ Grammar: 0 errors (Grammarly) - PASS
✓ Readability: Flesch 72 (target: 60+) - PASS
✗ Format: Missing H2 subheadings (required 4+, found 2) - FAIL
✓ SEO: Keyword density "AI marketing" 1.8% - PASS
⚠ Plagiarism: 8% matching text (competitor blog post) - PASS but flag for review
✓ Brand terminology: All correct - PASS

OVERALL: 5/6 checks pass. Content requires human review for: (1) Add missing H2 subheadings, (2) Verify originality vs. competitor content despite passing plagiarism threshold.

Human Review Protocol

Not all content can be auto-checked. Subjective dimensions require human expert reviewers.

Reviewer Selection

Brand specialists: Hire 2-3 senior copywriters who know each client's brand deeply
Subject matter experts: For technical content, SMEs validate factual accuracy
Marketing analysts: Evaluate persuasiveness based on marketing best practices

Review Process

For each piece of content:

Read full content with client brand guidelines open
Score each dimension (factual accuracy, brand voice, readability, etc.) on 1-5 scale
Provide revision notes if score is 1-3 (what needs to change)
Mark deployment-ready if score 4-5 on all dimensions; deployable-with-minor-edits if 3-4 on most

Evaluating at Scale (1000 pieces/day)

BrandForce generates 1000 content pieces daily. You can't manually review all of them. Scale requires a hybrid approach:

Sampling Strategy

100% automated checks: All 1000 pieces get automated quality screening
Random sample for human review: 5% sample (50 pieces/day) reviewed by humans for subjective dimensions
Risk-based sampling: 100% human review for: landing pages, high-stakes campaigns, new client work, content that failed automated checks
Stratified sampling: Ensure human sample covers all content types, clients, and topics proportionally

Quality Control Workflow

CONTENT GENERATION PIPELINE WITH QUALITY GATES
================================================================================

1. ContentAI generates content request
   ↓
2. Automated quality checks (grammar, format, plagiarism, brand terminology)
   ├─ PASS? → goes to QA sampling queue
   └─ FAIL? → flagged for human revision
   ↓
3. Human QA sample (5%) reviews subjective dimensions
   ├─ Score 4-5 on all dims? → DEPLOY
   ├─ Score 3-4 on most dims? → Deploy with minor edit suggestion
   └─ Score 1-2 on any dim? → REJECT, request regeneration
   ↓
4. Non-sampled content (95%) gets released based on automated checks only
   (risk: not catching subjective quality issues, but necessary for scale)
   ↓
5. Post-deployment monitoring
   ├─ Track engagement metrics (clicks, conversions)
   ├─ Collect client feedback
   └─ If engagement drops, increase sampling percentage

Running the Scenario Step-by-Step

Week 1: Setup & Baseline

Collect 50 high-performing client pieces as golden examples
Document brand voice attributes for 5 major clients
Build scoreable brand voice rubric for each client
Configure automated quality checks (grammar, plagiarism, SEO tools)

Week 2: Dataset & Calibration

Run ContentAI on 500 content generation requests
Conduct rater calibration session (2 hours) with 3 expert reviewers
Each reviewer scores 30 sample pieces independently
Discuss disagreements; refine rubric until inter-rater agreement ≥70%

Week 3: Full Evaluation

300 pieces undergo automated quality checks
150 pieces (50% sample) undergo human expert review on all dimensions
Reviewers score each piece on 7 dimensions (factual accuracy, brand voice, originality, SEO, readability, persuasiveness, format)
2-3 raters per piece for inter-rater agreement validation

Week 4: Analysis & Reporting

Aggregate results across all 300 pieces
Break down by content type (blog, social, email, landing page)
Break down by client
Identify failure mode patterns
Calculate what % of generated content is deployment-ready (4-5/5) vs. needs revision (2-3) vs. should be rejected (1-2)

Content Failure Mode Taxonomy

When content fails evaluation, categorize failures to understand where ContentAI struggles. Here are common failure modes:

Failure Mode Description Frequency (typical) Fix Strategy Hallucinated Stats Made-up statistics ("94% of users reported improvement") 8-12% Add fact-checking step; provide source documents in prompt Brand Voice Miss Tone/style doesn't match client brand guidelines 15-20% Improve prompt with more brand voice examples; fine-tune model Off-Topic Tangents Content drifts away from main topic mid-paragraph 5-8% Constrain output length; provide topic outline in prompt Weak Persuasiveness No clear call-to-action; doesn't persuade audience to act 10-15% Add CTA requirement to prompt; A/B test variations Readability Issues Run-on sentences, unclear explanations, jargon 6-10% Add readability constraint (target 60+ Flesch score) in prompt Format Errors Missing required sections, wrong length, improper structure 4-7% Enforce format checklist in automated validation; constrain prompt output Plagiarism/Derivative Content too similar to training data or competitor content 3-6% Add originality requirement; flag competitor URLs to avoid

If hallucinated stats are 12% of failures, that's your biggest problem to fix. If brand voice misses are 20%, focus improvement effort there. This categorization drives product development priorities.

Writing the Content AI Eval Report

Package evaluation results in an executive report for marketing leadership:

Executive Summary

Can we use ContentAI for production content at scale?

"ContentAI is production-ready for 60-70% of requested content, particularly blog posts and social media. However, it struggles with brand voice consistency (15% miss rate) and occasionally hallucinate statistics (8% error rate). Recommendation: Deploy ContentAI for blog and social content with 5% human QA sampling. Continue manual writing for landing pages and high-stakes campaigns. Invest in prompt engineering to reduce brand voice misses and fact-checking to eliminate hallucinations."

Results by Content Type

Content Type	Deployment-Ready (%)	Needs Minor Edits (%)	Reject & Regenerate (%)	Recommendation
Blog Posts (200)	68%	22%	10%	DEPLOY with 5% QA sampling
Social Posts (150)	72%	18%	10%	DEPLOY with 5% QA sampling
Email Campaigns (100)	58%	32%	10%	DEPLOY with 10% QA sampling
Landing Pages (50)	42%	32%	26%	MANUAL ONLY (too risky)

Dimension Performance

How did ContentAI perform on each quality dimension?

Factual Accuracy: 85% of content accurate. 8% hallucinated statistics. 7% needed fact-checking.
Brand Voice Alignment: 72% on-brand. 15% tone misses, 13% terminology issues.
Originality: 91% original. 7% flagged as derivative of competitor content. 2% plagiarism concerns.
SEO Quality: 78% SEO optimized. Common misses: keyword density, internal linking.
Readability: 84% readable. Flesch score average 71 (target 60+).
Persuasiveness: 68% persuasive. Missing clear CTAs; weak value propositions.
Format Compliance: 92% correct format. Rare errors but automatable.

Failure Mode Breakdown

Brand voice misses (15%): Solution: Improve prompt with brand guidelines; fine-tune model on client examples
Hallucinated statistics (8%): Solution: Add fact-checking requirement; provide source documents in prompt
Weak persuasiveness (10%): Solution: A/B test CTA variations; optimize for conversion metrics
Readability issues (6%): Solution: Add readability score constraint in prompt

Deployment Plan

Phase 1 (Immediate): Deploy for blog and social content (70%+ deployment ready). Start with 100% human QA sampling to build confidence. After 2 weeks, reduce to 5% sampling.
Phase 2 (Month 2): Add email campaigns to deployment (with 10% QA sampling due to lower deployment-ready rate).
Phase 3 (Month 3): Continue manual-only for landing pages. Revisit after 6 weeks if hallucination rate drops below 5%.
Phase 4 (Months 2-3): Improve model: fine-tune on brand voice examples, add fact-checking, optimize persuasiveness.

Key Insight

The biggest mistake with content AI eval is collapsing all dimensions into a single accuracy score. "ContentAI is 73% accurate" is useless. What matters is: 70% deployment-ready for blogs, 42% for landing pages, 8% hallucination rate, 15% brand voice misses. These specific metrics drive deployment decisions and improvement priorities. Always report dimension-level and content-type-level results.

Lab Scenario: Evaluating a Content Generation System

Scenario Setup

Content Generation Eval Dimensions

Building a Content Eval Dataset

Step 1: Collect Reference Content

Step 2: Define Content Distribution

Step 3: Generate Candidate Content

Brand Voice Rubric Design

Step 1: Document Brand Voice Attributes

Step 2: Build Scoreable Criteria

Automated Quality Checks

Automated Metrics

Sample Automated Report

Human Review Protocol

Reviewer Selection

Review Process

Evaluating at Scale (1000 pieces/day)

Sampling Strategy

Quality Control Workflow

Running the Scenario Step-by-Step

Week 1: Setup & Baseline

Week 2: Dataset & Calibration

Week 3: Full Evaluation

Week 4: Analysis & Reporting

Content Failure Mode Taxonomy

Writing the Content AI Eval Report

Executive Summary

Results by Content Type

Dimension Performance

Failure Mode Breakdown

Deployment Plan

Key Takeaways

Ready to Advance Your Skills?

Scenario Setup

Content Generation Eval Dimensions

Building a Content Eval Dataset

Step 1: Collect Reference Content

Step 2: Define Content Distribution

Step 3: Generate Candidate Content

Brand Voice Rubric Design

Step 1: Document Brand Voice Attributes

Step 2: Build Scoreable Criteria

Automated Quality Checks

Automated Metrics

Sample Automated Report

Human Review Protocol

Reviewer Selection

Review Process

Evaluating at Scale (1000 pieces/day)

Sampling Strategy

Quality Control Workflow

Running the Scenario Step-by-Step

Week 1: Setup & Baseline

Week 2: Dataset & Calibration

Week 3: Full Evaluation

Week 4: Analysis & Reporting

Content Failure Mode Taxonomy

Writing the Content AI Eval Report

Executive Summary

Results by Content Type

Dimension Performance

Failure Mode Breakdown

Deployment Plan

Key Takeaways

Ready to Advance Your Skills?

Related Lessons