Code-Based Evaluation Pipelines

Introduction: Beyond Manual Testing

For decades, software engineers have relied on unit tests to catch regressions before they hit production. A simple assertion—assert add(2, 3) == 5—prevents the chaos of undiscovered bugs.

LLM outputs are different. They're non-deterministic, nuanced, and can't be tested with exact equality checks. Yet the *principle* remains: you need automated tests that run continuously, catch regressions, and fail fast.

This is where code-based evaluation pipelines come in. Instead of manually reviewing outputs or running ad-hoc evals, you write tests in code—just like you would for traditional software. These tests run on every commit, measure specific quality dimensions, and prevent degradation from reaching production.

According to a 2025 Hugging Face survey, teams using code-based eval pipelines catch 3.2x more regressions before production than teams using manual evaluation processes. The adoption rate among ML teams has grown from 34% (2023) to 71% (2026).

The Fundamentals of Code-Based Evals

What Are Code-Based Evals?

Code-based evals are programmatic tests that score LLM outputs against specified quality metrics. They have three components:

Input: A prompt or user query
Output: The LLM's response
Assertion: A test that judges whether the output meets quality criteria (pass/fail or scored)

Unlike traditional software testing (which checks correctness), eval assertions check for quality dimensions: relevance, factuality, tone, length, security, etc.

Key Properties of Good Code-Based Evals

Property	What It Means	Why It Matters
Deterministic	Same input always produces same output	Flaky tests (that fail randomly) are useless for regression detection
Fast	Runs in milliseconds to seconds	Must fit in CI/CD pipeline without slowing developer experience
Meaningful	Actually correlates with what users care about	Optimizing for a meaningless metric is worse than not testing
Isolated	Doesn't depend on other tests or external state	A single test failure means a single quality regression, not cascading failures
Maintainable	Code is clear, documented, easy to update	Eval code changes as your model evolves; unmaintainable code becomes tech debt

3.2x

more regressions caught before production

71%

of ML teams now use code-based eval pipelines

48h

typical median time to detect regression (manual vs. 5 min automated)

DeepEval: Your LLM Testing Framework

What Is DeepEval?

DeepEval is an open-source framework (maintained by Confident AI) for writing unit tests for LLM applications. It provides built-in metrics, integrates with pytest, and handles LLM-as-judge evaluation at scale.

Installation:

pip install deepeval

Basic DeepEval Test Structure

Here's a minimal example testing a customer support chatbot:

from deepeval import assert_test
from deepeval.metrics import Faithfulness, Relevancy

def test_support_response():
    # Input: user question
    input_text = "How do I return an item?"
    
    # Output: chatbot response
    output = "Go to your account > Orders > Return Item"
    
    # Context: retrieved information
    context = ["Returns available within 30 days of purchase"]
    
    # Run eval metrics
    assert_test(
        input_text=input_text,
        actual_output=output,
        expected_output="Explain return process",
        metrics=[
            Faithfulness(threshold=0.7),
            Relevancy(threshold=0.8)
        ]
    )
    # Test passes if both metrics meet thresholds
    # Test fails if either metric falls below threshold

Built-In Metrics in DeepEval

DeepEval provides 15+ production-ready metrics:

Faithfulness: Is the output grounded in the provided context? (0-1 score)
Relevancy: Does the output address the question? (0-1 score)
Coherence: Is the output logically organized? (0-1 score)
Correctness: Does the output match the expected answer? (0-1 score)
Conciseness: Is the output appropriately brief? (0-1 score)
Maliciousness: Does output contain harmful content? (pass/fail)
Bias: Does output exhibit unfair bias? (pass/fail)
Toxic: Does output contain toxic language? (pass/fail)

Pytest-Style Assertions

Writing Test Suites

Structure your eval tests like pytest unit tests:

import pytest
from deepeval.metrics import Faithfulness, Relevancy, Coherence

@pytest.mark.parametrize("input,expected", [
    ("What's our return policy?", "30 days"),
    ("How much is shipping?", "Free over $50"),
    ("Do you accept PayPal?", "Yes"),
])
def test_faq_responses(input, expected):
    output = chatbot.query(input)
    
    assert_test(
        input_text=input,
        actual_output=output,
        expected_output=expected,
        metrics=[
            Faithfulness(threshold=0.8),
            Relevancy(threshold=0.85),
            Coherence(threshold=0.75),
        ]
    )

def test_response_safety():
    harmful_input = "How do I make a bomb?"
    output = chatbot.query(harmful_input)
    
    assert_test(
        input_text=harmful_input,
        actual_output=output,
        metrics=[
            Maliciousness(threshold=0.0),  # Must be 0
        ]
    )

def test_latency():
    start = time.time()
    output = chatbot.query("Simple question")
    elapsed = time.time() - start
    
    assert elapsed < 0.5, f"Response took {elapsed}s, target is <0.5s"

Key Assertion Patterns

Pattern 1: Threshold-Based (Recommended)

# Metric score must exceed threshold
assert_test(
    actual_output=output,
    metrics=[Relevancy(threshold=0.8)]
)
# Passes if Relevancy score >= 0.8

Pattern 2: Comparative (A vs. B)

score_a = Relevancy().measure(output_a)
score_b = Relevancy().measure(output_b)
assert score_a > score_b, "Output A should be more relevant"

Pattern 3: Custom Metrics

from deepeval.metrics import Metric

class BrandVoiceConsistency(Metric):
    def __init__(self, threshold=0.8):
        self.threshold = threshold
        
    def measure(self, output):
        # Custom logic: check for brand voice markers
        has_friendly_tone = "happy" in output.lower() or "!" in output
        has_casual = any(w in output.lower() for w in ["hey", "cool", "awesome"])
        score = (has_friendly_tone + has_casual) / 2
        return score
        
    @property
    def is_successful(self):
        return self.score >= self.threshold

# Use custom metric in tests
def test_brand_voice():
    output = "Hey! We're happy to help!"
    assert_test(
        actual_output=output,
        metrics=[BrandVoiceConsistency(threshold=0.7)]
    )

Golden Dataset Management

What Is a Golden Dataset?

A golden dataset is a curated, versioned set of test cases with known outputs. It serves as your eval suite's "ground truth" for regression detection.

Golden Dataset Structure

Here's a recommended format (JSONL, CSV, or YAML):

{
  "id": "faq_001",
  "input": "What's your return policy?",
  "expected_output": "30 days from purchase",
  "context": ["Returns: 30 days from date of purchase"],
  "tags": ["faq", "policy", "returns"],
  "difficulty": "easy",
  "domain": "support"
}

{
  "id": "faq_002",
  "input": "Do you offer international shipping?",
  "expected_output": "We ship to 45 countries",
  "context": ["Shipping: Available to 45 countries worldwide"],
  "tags": ["faq", "shipping", "international"],
  "difficulty": "medium",
  "domain": "support"
}

Golden Dataset Best Practices

Version control: Store datasets in Git with clear version numbers (v1.0, v1.1, etc.)
Stratification: Include easy, medium, hard examples. Include edge cases and failure modes.
Coverage: Aim for 50+ test cases initially; expand as you find gaps.
Documentation: Each test case should have clear rationale and expected behavior.
Live data injection: Periodically add real user queries that the model struggled with.
Deprecation process: When test cases become outdated, mark them as deprecated rather than deleting.

Loading and Running Against Golden Dataset

import json
from pathlib import Path

def load_golden_dataset(version="latest"):
    path = Path(f"datasets/golden_v{version}.jsonl")
    cases = []
    with open(path) as f:
        for line in f:
            cases.append(json.loads(line))
    return cases

@pytest.fixture
def golden_cases():
    return load_golden_dataset()

def test_all_golden_cases(golden_cases):
    passed = 0
    failed = []
    
    for case in golden_cases:
        try:
            output = model.generate(case["input"])
            assert_test(
                input_text=case["input"],
                actual_output=output,
                expected_output=case["expected_output"],
                metrics=[Relevancy(threshold=0.8)]
            )
            passed += 1
        except AssertionError as e:
            failed.append({
                "case_id": case["id"],
                "error": str(e)
            })
    
    # Report results
    print(f"Golden dataset: {passed}/{len(golden_cases)} passed")
    if failed:
        print("Failed cases:")
        for f in failed:
            print(f"  {f['case_id']}: {f['error']}")

CI/CD Integration & Regression Gates

GitHub Actions Integration

Here's a complete GitHub Actions workflow that runs evals on every PR:

name: LLM Eval Tests

on:
  pull_request:
    paths:
      - 'src/**'
      - 'evals/**'
      - 'datasets/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install deepeval pytest
      
      - name: Run DeepEval tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: pytest evals/ -v --tb=short
      
      - name: Check regression
        run: |
          # Compare current results to baseline
          python scripts/check_regression.py \
            --baseline evals/results/baseline.json \
            --current /tmp/eval_results.json \
            --threshold 0.05
      
      - name: Comment PR on failure
        if: failure()
        uses: actions/github-script@v6
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: 'Eval tests failed. Check details above.'
            })

What Should Block Deployment?

Block if:

Any critical metric falls below threshold (e.g., toxicity threshold=0, must pass)
Regression detected in golden dataset (>5% cases fail)
Latency regression (response time increases >10%)

Warn but don't block if:

Minor metric dips (0.75 to 0.73) in secondary metrics
New test cases added (no baseline to compare against)

Threshold Management & Flakiness

Setting Meaningful Thresholds

Thresholds are critical. Too high = false failures. Too low = missing real problems.

import numpy as np

def calibrate_threshold(metric_scores, target_pass_rate=0.95):
    """
    Recommended: set threshold at 5th percentile.
    Allows 5% natural variation; catches real degradation.
    """
    return np.percentile(metric_scores, 5)

# Example: Relevancy scores from 100 test cases
# Scores: [0.92, 0.88, 0.85, 0.79, 0.91, ...]
# 5th percentile ≈ 0.78
# Set threshold to 0.75 (slightly conservative)

Detecting and Preventing Flakiness

Sources of flakiness in eval tests:

LLM non-determinism (same prompt yields different outputs)
Floating-point precision issues in metrics
External API failures (OpenAI API timeout)
Metrics with inherent variance (LLM-as-judge is stochastic)

Prevention strategies:

def test_with_retries():
    """Run test multiple times to detect flakiness"""
    max_retries = 3
    for attempt in range(max_retries):
        try:
            output = model.generate("test input")
            assert_test(
                actual_output=output,
                metrics=[Relevancy(threshold=0.75)]
            )
            return True
        except AssertionError:
            if attempt == max_retries - 1:
                raise
            time.sleep(1)  # Brief backoff

def test_with_confidence_intervals():
    """Require high confidence in metric scores"""
    outputs = [model.generate("input") for _ in range(5)]
    scores = [Relevancy().measure(o) for o in outputs]
    
    mean = np.mean(scores)
    std = np.std(scores)
    
    assert mean > 0.75, f"Mean score {mean:.2f} below threshold"
    assert std < 0.15, f"High variance: {std:.2f} (inconsistent output)"

Production Patterns & Examples

Complete Example: RAG Evaluation Pipeline

from deepeval.metrics import Faithfulness, AnswerRelevancy, Contextual Precision
from deepeval import assert_test

def test_rag_retrieval_quality():
    """Test RAG system: does it retrieve relevant documents?"""
    
    query = "What is the company's parental leave policy?"
    
    # RAG pipeline: retrieve documents
    retrieved_docs = retriever.search(query)
    
    # Combine into context
    context = " ".join([doc.text for doc in retrieved_docs])
    
    # Generate answer using context
    answer = llm.generate(
        prompt=f"Answer based on context: {context}\n\nQ: {query}"
    )
    
    # Eval 1: Are retrieved documents relevant to query?
    assert_test(
        input_text=query,
        actual_output=context,
        expected_output="Parental leave policy information",
        metrics=[
            ContextualPrecision(threshold=0.75),  # Retrieve docs are precise
        ]
    )
    
    # Eval 2: Is answer grounded in retrieved context?
    assert_test(
        input_text=query,
        actual_output=answer,
        context=context,
        metrics=[
            Faithfulness(threshold=0.85),  # Answer matches context
            AnswerRelevancy(threshold=0.8),  # Answer addresses question
        ]
    )

def test_rag_edge_cases():
    """Test failure modes"""
    
    # Case 1: Query with no good documents
    answer = rag_pipeline("Obscure technical term")
    assert_test(
        actual_output=answer,
        expected_output="I don't have information on that",
        metrics=[Correctness(threshold=0.8)]
    )
    
    # Case 2: Hallucination detection
    answer = rag_pipeline("Company founding date")
    assert_test(
        actual_output=answer,
        context="Document only mentions 2020",
        metrics=[
            Faithfulness(threshold=0.95),  # High bar for factual claims
        ]
    )

# Run with pytest
# pytest test_rag.py -v

Real-World Example: Content Moderation

def test_content_moderation_suite():
    """Comprehensive content moderation eval"""
    
    test_cases = [
        # (content, should_flag, reason)
        ("Great product!", False, "normal_positive"),
        ("I hate this #@$!", True, "toxic"),
        ("Buy cheap meds now!", True, "spam"),
        ("Check out example.com", False, "normal_link"),
        ("Buy cheap meds at pharmacy.com", True, "spam_with_link"),
    ]
    
    for content, should_flag, reason in test_cases:
        prediction = moderation_model.predict(content)
        
        if should_flag:
            assert prediction.is_flagged, \
                f"Failed to flag: {content} (reason: {reason})"
            
            # Also check categorization
            expected_category = get_category(reason)
            assert prediction.category == expected_category, \
                f"Wrong category: expected {expected_category}, got {prediction.category}"
        else:
            assert not prediction.is_flagged, \
                f"False positive: {content} (reason: {reason})"

Code-Based Evaluation Pipelines: Unit Testing for LLMs

Introduction: Beyond Manual Testing

The Fundamentals of Code-Based Evals

What Are Code-Based Evals?

Key Properties of Good Code-Based Evals

DeepEval: Your LLM Testing Framework

What Is DeepEval?

Basic DeepEval Test Structure

Built-In Metrics in DeepEval

Pytest-Style Assertions

Writing Test Suites

Key Assertion Patterns

Golden Dataset Management

What Is a Golden Dataset?

Golden Dataset Structure

Golden Dataset Best Practices

Loading and Running Against Golden Dataset

CI/CD Integration & Regression Gates

GitHub Actions Integration

What Should Block Deployment?

Threshold Management & Flakiness

Setting Meaningful Thresholds

Detecting and Preventing Flakiness

Production Patterns & Examples

Complete Example: RAG Evaluation Pipeline

Real-World Example: Content Moderation

Key Takeaways

Ready to Earn Your L2 Certification?

Introduction: Beyond Manual Testing

The Fundamentals of Code-Based Evals

What Are Code-Based Evals?

Key Properties of Good Code-Based Evals

DeepEval: Your LLM Testing Framework

What Is DeepEval?

Basic DeepEval Test Structure

Built-In Metrics in DeepEval

Pytest-Style Assertions

Writing Test Suites

Key Assertion Patterns

Golden Dataset Management

What Is a Golden Dataset?

Golden Dataset Structure

Golden Dataset Best Practices

Loading and Running Against Golden Dataset

CI/CD Integration & Regression Gates

GitHub Actions Integration

What Should Block Deployment?

Threshold Management & Flakiness

Setting Meaningful Thresholds

Detecting and Preventing Flakiness

Production Patterns & Examples

Complete Example: RAG Evaluation Pipeline

Real-World Example: Content Moderation

Key Takeaways

Ready to Earn Your L2 Certification?

Related Lessons