Level 5 • Advanced

The Eval Engineering Track: Building the Infrastructure of AI Quality

Chapter 12: Career Tracks • 20 min read

Table of Contents

Eval Infrastructure Stack
Core Technical Skills
System Design Interview
Building Eval Tooling
Career Ladder & Promotions
Compensation & Equity

Eval Infrastructure Stack: Storage, Compute, APIs, Observability

The eval engineering track differs fundamentally from general ML engineering. While traditional ML engineers optimize models and training pipelines, eval engineers build the systems that measure quality. This requires deep expertise in data pipelines, distributed systems, statistical computing, and observability—plus unique domain knowledge about evaluation itself. You're not building the AI; you're building the quality assurance infrastructure for AI.

The core eval infrastructure stack consists of four pillars: storage (how eval results are persisted), compute (how eval jobs are scheduled and executed), APIs (how systems query and integrate eval data), and observability (how you monitor and debug eval systems). Each pillar has distinct design challenges.

Eval Results Database Design with Full Schema

Most organizations start with a flat file system (JSON lines, CSVs) for eval results. This works for 5-10 evals. At 100+ evals with millions of examples, it becomes impossible to manage. A proper eval database must support multiple access patterns and maintain data integrity at scale.

Your database must handle:

Hierarchical result storage: eval → run → task → example → metric, with full lineage and provenance tracking
Flexible schema: different evals produce different metrics, different metadata. Strict schema breaks down quickly.
Time-series data: eval results over time are the primary signal for quality trends. You need efficient range queries.
Association with deployments: which model version, which prompt, which dataset version produced these results
Metadata attachment: experiment IDs, commit hashes, human annotations, rater profiles, annotation timing data
Fast aggregation: "average F1 across all examples" should be instant, not a 5-minute scan
Filtering and slicing: "F1 score for language=EN and domain=legal" should work instantly

A reference schema structure for PostgreSQL:

CREATE TABLE evals (
  id UUID PRIMARY KEY,
  name VARCHAR NOT NULL UNIQUE,
  description TEXT,
  eval_type VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  owner_team VARCHAR,
  config JSONB,
  status VARCHAR
);

CREATE TABLE eval_runs (
  id UUID PRIMARY KEY,
  eval_id UUID NOT NULL REFERENCES evals(id),
  deployment_id UUID,
  model_version VARCHAR,
  model_config JSONB,
  dataset_name VARCHAR,
  dataset_version VARCHAR,
  dataset_size INT,
  started_at TIMESTAMP,
  completed_at TIMESTAMP,
  duration_seconds INT,
  status VARCHAR,
  error_message TEXT,
  config JSONB,
  metadata JSONB
);

CREATE INDEX ON eval_runs(eval_id, started_at DESC);
CREATE INDEX ON eval_runs(model_version, started_at DESC);

CREATE TABLE eval_results (
  id UUID PRIMARY KEY,
  run_id UUID NOT NULL REFERENCES eval_runs(id),
  example_id VARCHAR NOT NULL,
  metric_name VARCHAR NOT NULL,
  metric_value FLOAT,
  metric_category VARCHAR,
  metadata JSONB,
  created_at TIMESTAMP
);

CREATE INDEX ON eval_results(run_id, metric_name);
CREATE INDEX ON eval_results(example_id);
CREATE INDEX ON eval_results(metric_name, metric_value);

This schema allows complex queries: "Show me all instances where BLEU score is below 0.7 for the past 30 days," or "Compare token-accuracy across model versions for language=Spanish," or "Find the 99th percentile of response times for the support agent eval." The indexes make these queries fast.

For analytical queries and dashboards, use a columnar database or data warehouse (DuckDB, Apache Iceberg, BigQuery, Snowflake) for efficient aggregations and joins. Sync data from PostgreSQL to your warehouse nightly or in real-time using a CDC tool like Kafka or Debezium.

Compute: Eval Job Scheduling Architecture

Eval jobs can take seconds (quick metrics like BLEU) to hours (human annotation studies) to days (large-scale manual evaluation campaigns). You need a job scheduler that handles backpressure, retries, resource allocation, and task dependencies. Most companies use one of: Apache Airflow, Prefect, Argo Workflows, Temporal, or Kubernetes CronJobs with custom controllers.

Key design considerations when choosing or building your eval scheduler:

Task parallelism: Can you run independent eval tasks in parallel? How fine-grained can parallelism be? (Ideally: per-example for fast metrics, per-batch for slower ones)
Fan-out patterns: One eval run typically produces many metrics. If you have 10,000 examples and 20 metrics per example, that's 200,000 result rows. Can the scheduler fan out to 200K tasks?
Dependency graphs: Some metrics depend on others. Your scheduler must resolve the DAG correctly and not start tasks until dependencies are satisfied.
Resource allocation: Some evals (like human evaluation) are expensive. You need cost controls and priority queues so high-priority evals run first.
Incremental evaluation: If data hasn't changed, can you skip re-evaluation? Caching is critical for efficiency. If you evaluated a model yesterday on dataset version 1.2, and dataset version hasn't changed, reuse those results.
Artifact management: Eval jobs produce large outputs (annotation spreadsheets, detailed error analyses, debug logs). Where do these live? S3? GCS? Your database?
Monitoring and alerting: If a job stalls, who gets notified? If it fails, can it retry intelligently?

A production eval scheduling system typically includes: (1) a task queue (Redis, RabbitMQ, or Kafka), (2) workers (containers or serverless functions that execute evals), (3) a state store (what's running, what's done, what failed—usually a database), and (4) a control plane (decides what to schedule next based on dependencies, resources, priority).

Example architecture: Use Airflow DAGs to define eval workflows. Each eval run triggers a DAG. Tasks fan out to compute metrics in parallel. Failed tasks retry with exponential backoff. Results are written to PostgreSQL. Large artifacts go to S3. Monitor task duration and alert if a task takes 10x longer than expected.

Eval Results API Design: Multi-Pattern Access

Once you have results stored, every downstream system needs access. Your API must support multiple distinct access patterns, and optimizing for one pattern means pessimizing for others. You need to understand your users and design accordingly.

Common access patterns:

Point queries: "What was the BLEU score on eval X, run Y?" Response: single number with confidence interval
Time series queries: "Show me BLEU trend over the last 30 days" Response: time series of numbers, typically daily or weekly aggregates
Slice queries: "Break down BLEU by language and domain" Response: multi-dimensional breakdown, e.g., {en: 0.81, es: 0.75, fr: 0.72}
Example-level queries: "Show me the 100 hardest examples for this eval" Response: detailed example-level data with predictions and metrics
Comparison queries: "Compare eval results between model A and model B" Response: delta analysis, statistical significance
Trend queries: "Is this metric improving or degrading?" Response: trend direction and confidence

A RESTful API design:

GET /api/evals/{eval_id}/runs/{run_id}/summary
  → { 
      metric_name: "bleu", 
      value: 0.742, 
      confidence_interval: [0.738, 0.746],
      sample_size: 1000
    }

GET /api/evals/{eval_id}/timeseries?metric=bleu&days=30
  → [
      { date: "2025-02-01", value: 0.725, n: 1000 },
      { date: "2025-02-02", value: 0.728, n: 980 },
      ...
    ]

GET /api/evals/{eval_id}/runs/{run_id}/breakdown?metric=bleu&dimension=language
  → { 
      en: { value: 0.81, n: 500 },
      es: { value: 0.75, n: 300 },
      fr: { value: 0.72, n: 200 }
    }

GET /api/evals/{eval_id}/runs/{run_id}/examples?metric=bleu&percentile=hardest&limit=10
  → [{ 
      example_id: "123", 
      input: "...", 
      prediction: "...", 
      reference: "...",
      metric_value: 0.42,
      metrics: { bleu: 0.42, rouge: 0.38 }
    }, ...]

POST /api/evals/compare
  → {
      runs: [run_1_id, run_2_id],
      metrics: ["bleu", "rouge"],
      results: [
        { metric: "bleu", run_1: 0.74, run_2: 0.79, p_value: 0.002 }
      ]
    }

Key API design principles: (1) Read-optimized: use materialized views or caches for complex queries, (2) Batch support: fetching 1000 metrics should not require 1000 HTTP calls, (3) Versioning: support /v1/, /v2/ endpoints to evolve the API without breaking clients, (4) Filtering: always support filtering by time range, model version, dataset, and other dimensions, (5) Pagination: large result sets should be paginated, (6) Caching: aggressive caching for time-series and summary queries, (7) Rate limiting: to prevent abuse and ensure fair access.

Observability: Logging, Tracing, and Alerting for Eval Systems

When an eval produces unexpected results, you need to understand why. This requires three layers of observability: structured logging, distributed tracing, and automated alerting.

Structured Logging: Don't just log "eval finished." Log: "processed 1000 examples in 45s, 2 examples timed out, 15 examples hit API rate limits, largest example had 500 tokens." Include: checkpoint times (data loading took 5s, metrics took 30s), example-level logs (example 123 failed with error X), error traces with context, resource usage (memory peaked at 4GB), network activity (made 1500 API calls).

logger.info("eval_finished", {
  "eval_id": "abc123",
  "run_id": "xyz789",
  "examples_processed": 1000,
  "duration_seconds": 45,
  "memory_peak_mb": 4096,
  "api_calls": 1500,
  "timeouts": 2,
  "rate_limit_hits": 15,
  "metrics": {"bleu": 0.742, "rouge": 0.691}
})

Distributed Tracing: Trace the flow of one example through the eval pipeline. If an example fails, you want to see exactly where: data loading, model inference, metric computation, result serialization? Include latency at each step. Use OpenTelemetry or Jaeger for this.

# One trace for one example
trace_id: "abc-123-xyz"
  span: data_load (duration: 10ms)
  span: model_inference (duration: 350ms, model=gpt4)
  span: metric_computation (duration: 45ms, metrics=[bleu, rouge])
  span: result_write (duration: 5ms)

Automated Alerting: Define alerts for known failure modes: When eval results regress unexpectedly (e.g., accuracy dropped 20%), when an annotation study stalls (no progress for 2 hours), when eval duration suddenly spikes (10x normal), when rater agreement drops below threshold, when data quality metrics (missing fields, malformed JSON) exceed a threshold. For each alert, define: who gets notified (Slack, PagerDuty), the runbook for fixing it, and the resolution time SLA.

Most teams use: Datadog, New Relic, or Prometheus for metrics; ELK stack (Elasticsearch, Logstash, Kibana) or Grafana Loki for logs; Jaeger, Zipkin, or Datadog APM for traces. The critical requirement is that eval engineers can quickly diagnose problems without having to instrument code manually.

Core Technical Skills for Eval Engineers

What separates a junior eval engineer from a staff-level one? Deep technical skills. You need to be able to implement complex evals from scratch, debug performance bottlenecks, and architect large-scale evaluation systems.

Python for Eval Pipelines

You need deep Python expertise. Not just syntax, but: async/await for concurrent eval execution, type hints for data validation and IDE support, testing (pytest, hypothesis for property testing), profiling (cProfile, memory_profiler) for optimization. Most eval code is Python: data loading, metric computation, annotation parsing, result aggregation.

Skills you should master:

Concurrency: asyncio for I/O-bound operations (API calls), multiprocessing for CPU-bound operations (metric computation), thread pools for mixed workloads. Know when to use which.
Type hints: Write code that mypy can verify. This catches bugs early and makes refactoring safer.
Testing: Pytest basics, fixtures for reusable test setup, parametrized tests for testing across multiple inputs, property-based testing with hypothesis.
Profiling: Identify bottlenecks with cProfile. Is your eval slow because of Python code or because you're waiting for API responses? Memory profile to find leaks.
Debugging: pdb, ipdb for interactive debugging. Use breakpoints strategically. Learn to read Python stack traces.

Common libraries and when to use them:

Pandas: Loading, transforming, and aggregating eval data. Every eval engineer needs Pandas mastery: groupby (group results by language, model, etc.), merge (join eval results with metadata), pivot tables (convert long format to wide), efficient string operations (vectorized, not loops).
NumPy: Numerical computing. Computing percentiles, correlations, transformations. For very large arrays, NumPy is much faster than Python loops.
SciPy: Statistical computing. scipy.stats for hypothesis tests, scipy.special for special functions.
Hugging Face Datasets: Loading benchmark datasets. Most public benchmarks (SQuAD, GLUE, MMLU, etc.) are available through this library. Good alternative to downloading CSVs manually.
Transformers: If you're doing LLM-as-judge evals, you'll load models from HuggingFace. Also useful for embedding-based metrics.
Tenacity/Backoff: Handling API rate limits and transient failures. Many evals call external APIs (LLM APIs, commercial metric services). Retry logic with exponential backoff is crucial.
Pydantic: Data validation. Define your eval output schema, and Pydantic will validate it automatically. Catches bugs early.

Statistical Testing Libraries and Deep Statistical Knowledge

You must understand statistical significance deeply. Not just "p < 0.05," but effect sizes, multiple comparisons correction, power analysis, assumptions of tests, and when hypothesis tests are appropriate.

Core libraries: SciPy (scipy.stats for t-tests, Mann-Whitney, chi-squared, Kolmogorov-Smirnov), statsmodels (linear regression, ANOVA, Bayesian models), pingouin (comprehensive stats, effect sizes), scikit-posthocs (post-hoc tests for pairwise comparisons).

Questions you should be able to answer:

"Is this 2% improvement real or noise?" (Effect size vs. statistical significance)
"Do I need to test more examples?" (Power analysis)
"Should I use a parametric or non-parametric test?" (Check assumptions: normality, equal variances)
"How do I correct for multiple comparisons?" (Bonferroni, FDR)
"When does a time-series comparison make sense?" (Autocorrelation, stationarity)
"How do I construct a confidence interval?" (Bootstrap, asymptotic, Bayesian)

Data Pipeline Tools: Airflow, Prefect, Argo

You need to orchestrate complex multi-step evals. Tools: Apache Airflow (industry standard, mature, complex), Prefect (more Pythonic, dynamic DAGs, growing adoption), Dagster (data-aware orchestration, great error handling), or Temporal (for long-running workflows with state machines).

Expertise here means: designing DAGs for your eval workflows, handling failures and retries intelligently (exponential backoff, max retries), managing backpressure (don't submit 10K tasks at once, use task pools), monitoring pipeline health (tasks stuck, tasks failing more than expected), debugging data quality issues (why are results missing?), and cost optimization (run expensive tasks in parallel, batch small tasks together).

ML Experiment Tracking: MLflow, Weights & Biases, Neptune

Track which dataset version, which model version, which eval config produced these results. When you change anything, you want to see exactly what changed and what impact it had.

For eval-specific use cases: tracking annotation studies (which raters participated, what was their agreement, inter-rater reliability), tracking eval iterations (how we refined the eval over time, how metrics changed as we fixed bugs), comparing baseline vs. improved versions of metrics, and correlating eval metrics with downstream business metrics.

The Broader Stack

You also need familiarity with: Docker (containerizing eval jobs for reproducibility and scaling), Kubernetes (if running eval infrastructure at scale), SQL (querying your eval database for analysis and debugging), Git (version control for code and configs, understanding CI/CD), and cloud infrastructure (AWS, GCP, or Azure SDKs for data access and compute resources).

Additionally: understanding of distributed systems (what happens when tasks fail? how do you coordinate?), networking (why are API calls slow?), and basic DevOps (monitoring, logging, alerting).

Eval System Design Interview: 5 Common Questions with Detailed Solutions

Eval engineering roles often include system design rounds. Here are 5 common questions and detailed solution sketches. The goal is to show clear thinking, understand tradeoffs, and ask good clarifying questions.

Question 1: Design an Eval Results Storage and Retrieval System

Problem: You need to store eval results from 100+ different evals, each producing different metrics, and support fast queries like "what was the BLEU score for model X on eval Y?" and "show me all examples where the model failed" and "break down accuracy by language and domain."

Clarifying questions: How many results per day? (Order of magnitude.) What's the latency requirement? Do you need real-time or is hourly OK? What's the scale of queries? Thousands per day or millions?

Solution sketch:

OLTP layer (operational database): PostgreSQL for storing results. Schema as described earlier: evals, eval_runs, eval_results tables with appropriate indexes. Use JSONB columns for flexible metadata that doesn't fit the relational schema. Partition results by time (monthly or yearly) to keep tables performant.

Caching layer: Redis for frequently accessed summaries. Cache: (eval_id, run_id) → summary stats, (eval_id) → recent runs, popular queries by language/domain.

OLAP layer (analytical database): For complex queries (breakdowns, correlations, comparisons), use a columnar database like DuckDB or a cloud data warehouse like BigQuery. Sync PostgreSQL → DuckDB nightly via SQL dumps. This gives you the best of both worlds: fast operational queries and fast analytical queries.

API layer: REST API abstracting storage details. Query returns: metric value, confidence interval, sample size, last updated timestamp. Cache API responses aggressively (1 hour for summaries, 1 day for historical).

Artifact storage: S3 or GCS for large artifacts (annotation files, detailed error analyses, visualizations). Database stores reference to S3 keys.

Tradeoffs: PostgreSQL is simpler to set up but slower for analytical queries. DuckDB requires nightly syncs (not real-time) but much faster analytics. Redis caching adds complexity but improves response time. The multi-layer approach balances simplicity, cost, and performance.

Question 2: Design an Evaluation Job Scheduler at Scale

Problem: You have 1000 examples to evaluate, 50 different metrics to compute on each, and metrics have dependencies (some depend on others). Some metrics are expensive (human evaluation costs $10/example, takes hours). Others are cheap (BLEU costs 1 cent, takes milliseconds). You need to schedule this efficiently, handle failures, manage costs, and provide progress visibility.

Clarifying questions: How often do evals run? (Daily, on-demand, when a new model is deployed?) What's acceptable latency? (Hours? Days?) Budget constraints?

Solution sketch:

Orchestration: Use Airflow DAGs. One DAG per eval type. Tasks fan out to compute metrics in parallel.

load_data → compute_metric_1 → compute_metric_2 → ... → aggregate → store_results

Task pools for cost control: Create a task pool called "human_eval_expensive" with max 5 concurrent tasks. Assign human eval tasks to this pool. This ensures you're spending money in a controlled way (at most 5 evaluations in parallel = $50/hour cost cap).

Caching for efficiency: Before running a metric, check if it was computed before with the same inputs (example, metric, model). If yes and nothing has changed, reuse the result. Save 95% of compute time on repeated evals.

Incremental re-runs: If a run partially fails (100 of 1000 examples failed), support restarting from the failure point without re-computing successful examples. Make this transparent to the user.

Error handling: Use task retries with exponential backoff for transient failures. For permanent failures, move to a dead-letter queue. Alert the user about failures but don't block the entire run.

Progress visibility: Provide a status dashboard: X% of tasks done, estimated time remaining, which tasks failed, which tasks are retrying. Emit events to Slack/email when major milestones complete.

Resource management: Track resource usage (CPU, memory, API calls) per task. Alert if a task is using abnormally high resources (might indicate a bug or infinite loop).

Question 3: Design a Multi-Dimensional Eval Comparison System

Problem: You want to compare eval results across multiple dimensions: model version, dataset, language, domain, etc. Users should be able to slice and dice results any way they want: "show me accuracy for model A vs. model B, language=EN, domain=legal." Different dimensions have different cardinalities (maybe 2 model versions but 50+ languages).

Solution sketch:

Schema design: Store all dimension information with each result. Schema: (eval_id, run_id, example_id, metric_name, metric_value, model_version, dataset_name, language, domain, ...).

Pre-computed aggregates: Pre-compute common aggregations and store in a fact table: (eval_id, model_version, language, domain, metric_name) → (mean, stddev, min, max, count). This makes slice queries instant. Update fact tables nightly.

OLAP database: Use Pinot, Druid, or DuckDB for multi-dimensional queries. These databases are optimized for slicing and dicing.

Hierarchical dimensions: Some dimensions have hierarchies (language → region, e.g., en_US → North America). Support roll-up queries: "accuracy by region" automatically aggregates across languages in that region.

API design: Expose simple filters. Users specify dimensions they care about, what aggregation (mean, median, percentile), and the API returns results.

GET /api/compare?model_versions=["v1","v2"]&languages=["en"]&domain=legal&metric=accuracy

Materialized views: For very common queries, create materialized views. Update them incrementally when new results arrive.

Question 4: Design a Quality Monitoring System for Evals

Problem: You want to catch when eval results are unexpected or suspicious. What metric values are outliers? When does eval quality regress (results change dramatically)? How do you detect rater drift (human raters becoming less reliable)? You have hundreds of metrics across dozens of evals.

Solution sketch:

Distributional monitoring: For each metric, maintain a historical distribution. When a new run completes, compare its distribution to the historical one. Use Kolmogorov-Smirnov test or Jensen-Shannon divergence to detect distributional shift. If p < 0.01, alert.

Outlier detection: For each metric, fit a normal distribution to historical values. For new results, flag examples where the metric is 3+ standard deviations from the mean. These are potential data quality issues or genuine hard cases.

Trend monitoring: Fit a time-series model to historical metric values. If today's value deviates significantly from the trend, alert.

Rater quality monitoring: If using human raters, track: inter-rater agreement (Fleiss' kappa, Krippendorff's alpha), rater accuracy (if you have gold labels), task completion time. Alert if agreement suddenly drops or completion time increases (sign of confusion or disengagement).

Alert routing: Route alerts based on severity and type. Critical alert (metric dropped 50%) → page on-call engineer. Warning (metric drifted 20%) → Slack notification. Info (new high-performing model variant) → email. Each alert includes: what changed, how much, what might have caused it, and what to check.

Runbooks: Each alert type has a runbook: "metric dropped 50%, check: (1) did data change? (2) did model change? (3) did annotation criteria change? (4) is there a bug?" Runbooks should be discoverable and executable.

Question 5: Design an Eval-as-a-Service Platform for Multiple Teams

Problem: Multiple product teams want to use a shared eval platform. Each team has different needs (different metrics, different evaluation criteria, different SLAs). How do you build a self-service platform that's flexible but maintains data quality? Teams should be able to submit evals without writing code.

Solution sketch:

Config-driven evals: Provide a declarative format (YAML or JSON) for defining evals. Example:

name: sentiment-analysis-eval-v2
eval_type: classification
dataset: twitter_sentiment_v1
model_id: model_abc123
metrics:
  - name: accuracy
  - name: precision
    per_class: true
  - name: f1
annotations:
  required: true
  sample_size: 200
  task: "Classify sentiment as positive, negative, or neutral"

UI for non-technical users: Build a web UI where teams can define evals without touching code. Form fields for: eval name, dataset selection, metric selection, annotation requirements, etc. Under the hood, this generates the YAML config.

Config validation: When a team submits an eval config, validate: do all referenced datasets exist and is the team authorized to use them? Are all metrics computable? Are annotation requirements realistic (do you have enough raters)? Provide helpful error messages.

Templates: Provide templates for common eval types: classification accuracy, LLM-as-judge, regression metrics, annotation study. Teams can use a template or customize.

Execution pipeline: When config is submitted: (1) validate, (2) create an Airflow DAG, (3) submit to the scheduler, (4) monitor progress, (5) email results when done. Teams get a nice report showing results, confidence intervals, breakdowns by dimension.

Access control: Teams can see their own evals. Admins see everything. Use RBAC: teams have reader/writer/admin roles. Only admins can modify eval configs after creation (audit trail).

Cost tracking: Track compute cost per eval. Allocate costs to teams (chargeback model). This incentivizes efficient evals and makes budgets visible.

Notifications: Teams get notified when evals complete, fail, or need action. Use Slack, email, or in-app notifications based on preference.

Building Eval Tooling for Your Team: Libraries, CLIs, Dashboards, and Open Source

Every company ends up building eval infrastructure. Instead of each team reimplementing, centralize in an internal library.

Internal Eval Libraries

Your library should provide:

Metric implementations: BLEU, ROUGE, BERTScore, F1, accuracy, etc. Tested, vectorized implementations. Document any differences from standard implementations (which BLEU variant? how do you handle edge cases?).
Dataset loaders: Functions to load your internal datasets plus standard benchmarks (SQuAD, GLUE, MMLU, etc.). Handle caching and versioning. Return a standard dataset object with train/val/test splits.
Annotation integrations: Connect to your annotation platform (Scale AI, Labelbox, Prodigy, Mechanical Turk). Submit annotation jobs, poll for completion, download and parse results automatically.
Result storage: Helper functions to persist results to your eval database. Handles schema mapping and validation.
Utilities: Example sampling (random, stratified), confidence interval computation, result comparison, statistical testing.
Config management: Load and validate eval configs from YAML/JSON.

Example API:

import eval_sdk

# Load data
dataset = eval_sdk.load_dataset("squad_v2", split="validation")

# Define metrics
metrics = [
    eval_sdk.Metric("exact_match"),
    eval_sdk.Metric("f1"),
    eval_sdk.Metric("bleu"),
]

# Run eval
results = eval_sdk.run_eval(
    name="gpt4-squad-v2",
    model=my_model,
    dataset=dataset,
    metrics=metrics,
)

# Store results
eval_sdk.save_results(results, database=db)

# Analyze
stats = eval_sdk.compute_stats(results)

CLI Tools for Common Tasks

Build command-line tools for common eval tasks:

eval run --config eval_config.yaml — submit an eval, get run ID back
eval status --run-id abc123 — check run status, see progress and any errors
eval cancel --run-id abc123 — cancel a running eval
eval logs --run-id abc123 — stream logs for a run
eval compare --run-1 abc --run-2 def — compare two runs, show diffs with statistical significance
eval export --run-id abc --format csv — export results in various formats
eval list — list all recent runs
eval config validate eval_config.yaml — validate a config file

Make these tools discoverable (eval --help, eval run --help) and well-documented. Include examples in help text.

Dashboards for Different Audiences

Build dashboards for different stakeholders:

Eval team dashboard: Overview of all runs in the system, failures, queue status, resource usage, SLA compliance (are evals finishing within SLA?). Alerts and actionable insights.
Product team dashboard: Historical trends for their models, comparisons between versions, drill-down into failure examples. What's improving? What's regressing? Where should we focus?
Executive dashboard: Overall quality trends, comparison to competitors (if benchmarks allow), progress toward quality goals. One-page summary suitable for C-suite.

Use a BI tool (Looker, Tableau, Grafana) or build custom dashboards with React + D3/Recharts. Key principle: make it easy to ask and answer questions about quality without asking engineers for custom queries.

Open Source Contributions

As you build eval infrastructure, contribute back to the community. This establishes expertise and helps the field. Areas ripe for contribution:

Metric implementations: Efficient implementations of new metrics, especially domain-specific ones. Share reference implementations.
Best practices guides: "How to annotate X reliably," "How to scale eval to 1M examples," "How to build calibrated LLM judges."
Tools for evaluation: A faster BLEU implementation, an eval result comparison library, a tool to detect annotation drift.
Benchmarks: Carefully curated, representative evaluation sets for your domain. These are valuable to the community.
Case studies: Blog posts on how you solved specific eval challenges. Help others learn from your experience.

Open source contributions improve your resume and establish technical credibility. Potential employers look for engineers with strong open source track records.

Eval Engineer Career Ladder: IC1 to IC5 and Promotions

Most tech companies use individual contributor (IC) levels. Here's what eval engineering looks like at each level and what promotions look like.

IC1 / Entry-Level Eval Engineer (0-1.5 years)

Scope: One eval at a time. Work is supervised. You're still learning the infrastructure.
Expectations: Implement metrics following established patterns. Load datasets. Write basic eval scripts. Follow coding standards. Write tests for your code. Learn the infrastructure. Come to design reviews prepared. Ask thoughtful questions.
Impact: You execute evals reliably. Others can use your code without issues.
Time to IC2: 12-18 months.
Differentiator: Shows initiative. Understands not just what to do but why. Takes ownership. Learns fast.

IC2 / Mid-Level Eval Engineer (1.5-3 years)

Scope: Multiple evals autonomously. Start influencing technical direction.
Expectations: Design evals end-to-end. Identify what metrics we're missing and propose new ones. Mentor IC1s on best practices. Contribute to infrastructure improvements. Propose and implement tooling enhancements. Communicate results clearly to stakeholders (engineers, PMs, executives). Participate in design reviews as a peer, not just a listener.
Impact: You unblock teams by evaluating their models. Infrastructure improvements you make benefit many teams. Your insights improve how we approach evaluation.
Time to IC3: 18-24 months.
Differentiator: Thinks deeply about what makes a good eval. Proposes novel evaluation approaches. Colleagues respect your technical judgment.

IC3 / Senior Eval Engineer (3-5 years)

Scope: Multiple evaluation domains. Drive architectural decisions. Mentor IC1 and IC2s.
Expectations: Design the eval strategy for a major product area (e.g., all LLM outputs, or all recommendation systems). Build and maintain critical eval infrastructure used by many teams. Conduct research on new evaluation techniques. Lead design reviews. Drive cross-team eval standards. Publish findings (blog posts, internal tech talks, conference talks). Represent the company in eval communities. Make decisions about tool choices, infrastructure design, best practices.
Impact: You move the company's eval maturity forward. Your decisions affect how hundreds of engineers evaluate AI systems. You identify quality issues before they become customer issues.
Time to IC4: 24-36 months (if pursuing staff track).
Differentiator: Deep domain expertise. Recognizes subtle eval problems others miss. Improves quality across the organization. Mentors become strong IC3s.

IC4 / Staff Eval Engineer (5-8 years)

Scope: Eval strategy across multiple products. Organization-wide influence.
Expectations: Set eval standards for the company. Design large-scale evaluation programs (e.g., continuous evaluation infrastructure serving 100+ teams). Identify emerging eval challenges (new model types, new domains, regulatory requirements) and propose solutions. Mentor IC3s who are growing toward IC4. Work with executives on quality strategy and trade-offs (speed vs. quality, cost vs. comprehensiveness). Lead major eval infrastructure projects. Represent company externally at conferences, standards bodies, open source projects.
Impact: You shape how the company approaches AI quality. Your infrastructure decisions affect every team. You help set company quality standards and ensure they're met. Executives consult you on quality decisions.
Time to IC5: 36+ months (if continuing to grow). Note: Not all IC4s promote to IC5. Some specialize deeper at IC4.
Differentiator: Thinks strategically about quality. Moves the field forward (publishes papers, speaks at major conferences). Solves hard problems others thought unsolvable.

IC5 / Distinguished Engineer / Principal of Evals (8+ years)

Scope: Company-wide eval strategy. Industry influence.
Expectations: Set long-term eval vision for the company. Shape how the company approaches AI quality for the next 3-5 years. Mentor IC4s and engineering managers. Publish significant research. Speak at major conferences. Participate in standards bodies (ISO, NIST, etc.), helping set industry direction. Make architectural decisions that affect the entire eval ecosystem. Directly influence product strategy through eval insights.
Impact: You are the company's authority on AI quality. Your insights shape strategy. The industry knows your name and your work.
Differentiator: Recognized expert globally. Changes how people think about AI evaluation. Mentees become leaders in their own right.

Typical Promotion Criteria

Beyond level expectations, promotions usually require:

Track record of impact: Concrete examples of how your work improved quality, unblocked teams, or accelerated projects. Quantify when possible: "Built inference optimization that reduced eval time by 50%, enabling 10x more evals per month."
Demonstrated technical depth: Can you solve hard problems in your domain? Can you explain your reasoning? Interviewers should leave convinced you know more than they do.
Growth trajectory: Each promotion cycle, you're taking on bigger scope, harder problems. You're not staying in the same lane.
Peer feedback: Do people enjoy working with you? Do you make them better? Would peers recommend you for promotion?
Business impact: Has your work helped the company achieve its goals? Better models, faster shipping, fewer quality issues?
Documentation and artifacts: Bring a portfolio: eval designs you built, infrastructure improvements, analyses you did, blog posts, code you're proud of. Promotion committees want to see concrete work.

Potential Career Tracks at Staff Level

Once you hit IC4, you have options:

Specialist track (IC4 → IC5 → IC6 Specialist): Go deeper in eval engineering. Become the world expert in LLM evaluation or annotation methodology or eval infrastructure.
Management track: Transition to managing an eval team (usually requires IC3-4 level first). Evaluate people instead of models.
Breadth track: Transition to adjacent areas (AI safety, responsible AI, AI policy). Eval skills are valuable everywhere in AI organizations.

Most engineers choose one path and stick with it. It's hard to do both specialist and management tracks simultaneously.

Compensation & Equity Benchmarks (2024-2025)

Compensation varies by company stage, geography, experience, and negotiating skill. Here are typical ranges for the US (Bay Area, New York) in early 2025. These are market rates, not guarantees. Your actual offer depends on negotiation and exact circumstances.

Startup Stage (Series A-C, $10M-$100M funding)

IC2: $160K-$220K salary + 0.05-0.15% equity + $20K-$50K signing bonus
IC3: $200K-$280K salary + 0.1-0.3% equity + $40K-$100K signing bonus
IC4+: $250K-$350K salary + 0.2-0.5% equity + $75K-$200K signing bonus

Startups offer higher equity percentages but lower salary. The equity is often worth less than FAANG equity (higher failure risk), but the upside is also much higher (potential 10-100x return if successful). Use secondary markets (Carta, CapTable) to see what shares are trading at to estimate current value.

Growth Stage (Series D-F, $100M-$5B valuation)

IC2: $190K-$260K salary + 0.02-0.08% equity + $30K-$70K signing bonus
IC3: $240K-$320K salary + 0.05-0.15% equity + $50K-$120K signing bonus
IC4+: $300K-$400K salary + 0.1-0.3% equity + $100K-$250K signing bonus

Better salary than startups, meaningful but typically smaller equity packages than early-stage. Less risk than startups but still meaningful upside. Many growth-stage companies have gone public or been acquired, so secondary markets exist for some companies.

FAANG & Late-Stage Public (>$500B market cap)

IC2 (Google L4/Meta E4): $210K-$270K salary + $100K-$250K stock/year + $30K-$75K signing bonus
IC3 (Google L5/Meta E5): $260K-$350K salary + $200K-$600K stock/year + $60K-$150K signing bonus
IC4 (Google L6/Meta E6): $330K-$450K salary + $500K-$2M+ stock/year + $150K-$350K signing bonus

FAANG offers the highest total compensation, especially at senior levels. Stock packages are large and vest over 4 years. Signing bonuses are meaningful. Benefits are excellent (healthcare, 401k match, gym membership, childcare support, relocation).

Comparison: IC2 at startup might get $180K salary + 0.1% equity (worth $1M if company hits $1B valuation), total potential = $180K + $1M = $1.18M. IC2 at FAANG gets $240K salary + $150K stock/year = $240K + $600K (4-year vesting) = $840K. Startups have higher upside but higher risk.

Geographic Variations

San Francisco / Silicon Valley: Add 20-30% to listed ranges. Seattle, Boston, NYC: Add 10-20%. Austin, Denver, other tech hubs: Add 5-10%. Remote (except major hubs): Subtract 10-20%. International: European salaries typically 20-30% lower; Asian tech hubs (Singapore, Hong Kong) similar to US but with higher tax burden.

Equity Considerations: How to Evaluate Offers

Evaluate equity offers carefully. Understanding equity is crucial because at early-stage companies, equity is often worth more than salary.

Key terms to understand:

Vesting schedule: Standard is 4-year vesting with 1-year cliff. Means: no equity until 1 year, then 25% vests, then 1/48th vests each month for 36 months. If you leave after 1 year, you get 25%. After 4 years, you get 100%.
Strike price: Price you pay to exercise options. Lower is better (more upside). Strike price is usually set at FMV (fair market value) at grant time. If you join when FMV is low, your strike price is low, and upside is huge.
Equity type: Options (you exercise later) vs. RSUs (restricted stock units, automatically vest, no exercise needed). RSUs are safer (you don't have to pay exercise costs), options have higher tax efficiency (long-term capital gains vs. ordinary income if structured right).
Dilution: Future fundraising will dilute your stake. If you get 0.1% at Series A, expect maybe 0.06-0.07% at Series B after dilution. Ask about dilution expectations.
Liquidity: When can you sell? Only after IPO or acquisition usually. Startups may have secondary markets (employees can sell early), ask about this.

Rough valuation: Early-stage equity (pre-Series A) is highly risky but has huge upside. Expected value calculation: 10% chance of 10x return, 30% chance of 3x, 40% chance of 1x (exit at same valuation), 20% chance of 0x (failure). Expected value = 0.1*10 + 0.3*3 + 0.4*1 + 0.2*0 = 2.3x. So 0.1% equity worth $100K might have expected value of $230K. But it might be worth $1M or $0. Very uncertain.

Later-stage equity is more predictable but smaller. Series E company might have 80% chance of good exit. Expected value of 0.05% equity might be 0.8 * $500K = $400K.

Use these as rough estimates. Get a Carta or Pulley valuation for more precision.

Bonus & Benefits

Beyond salary and equity, most tech companies offer:

Performance bonus: 10-20% of salary at FAANG, 20-50% at startups/growth (subject to company performance and your performance).
Health insurance: Medical, dental, vision. Cost: company typically pays 80-100%, you pay 0-20%.
401k matching: Typically 4-6% of salary.
Unlimited PTO: In practice, take 15-25 days per year. More is unusual.
Home office stipend: $500-$2K annually for equipment, internet, furniture.
Professional development budget: $1K-$5K annually for conferences, courses, books.
Relocation: If moving for the job, company often pays for moving costs.

Negotiate: signing bonus (usually 10-25% of first-year salary), relocation (if applicable), home office stipend, professional development budget, flexible work arrangement, stock refresh (if staying multiple years, negotiate new grants to offset dilution).

Negotiation Tactics That Work

Most engineers leave 10-20% of compensation on the table by not negotiating. Here's what works:

Get offers in writing. Verbal offers are easy to walk back. Always confirm in writing with all terms.
Understand all components. Base salary, bonus (target %), equity (number of shares, strike price, vesting schedule), signing bonus, relocation, benefits. Get the details.
Negotiate each lever independently. If the base salary is non-negotiable, ask for higher equity. If they won't budge on salary or equity, ask for higher signing bonus or relocation budget or professional development.
Get competing offers if possible. Competing offers are the strongest negotiating lever. Even if you don't want the competing job, showing the offer helps negotiate the one you want.
Have a walk-away number. Know the minimum total comp you'll accept. Don't negotiate below that. This gives you confidence and prevents underselling.
Be reasonable. If they offer $250K and you ask for $500K, they'll laugh. If you ask for $280K, they'll consider it. Most companies expect negotiation, especially at IC3+.
Explain your reasoning. "Based on market rates for IC3 eval engineers in SF, I expected $300K salary + equity package. My offer was $260K. Can we get to $290K?" This is more persuasive than "I want more."
Timing. Negotiate before signing. Once you sign, you're locked in for at least a year. After that, you can negotiate again during review cycles.

Remember: recruiters and hiring managers negotiate all the time. They expect it. A modest counter-offer shows you've done your homework and respect your own value. Extreme asks (2x their offer) signal you're not serious or you don't understand the market.

Key Takeaways

Infrastructure is critical: Investing in eval storage, compute, APIs, and observability multiplies team productivity by 10x.
Technical depth matters: Master Python, statistics, and data pipelines. These are table stakes for eval engineers.
System design is valued: The ability to design large-scale eval systems serving many teams is a key differentiator for senior roles.
Career progression is structured: Clear ladders from IC1 to IC5. Promotion depends on impact, technical depth, scope, and business results.
Compensation is competitive: Eval engineering is relatively new and in-demand. Total comp ranges from $200K at IC2 to $800K+ at IC4 at FAANG.
Negotiate thoughtfully: Most offers have room for negotiation. Prepare, understand market rates, and ask for what you deserve.

Ready to Build Better Evals?

Whether you're starting an eval engineering function or advancing your skills to IC4+, the field needs strong technical leaders. Join the community of engineers building AI quality infrastructure at scale.

Explore More Topics

Eval Infrastructure Stack: Storage, Compute, APIs, Observability

Eval Results Database Design with Full Schema

Compute: Eval Job Scheduling Architecture

Eval Results API Design: Multi-Pattern Access

Observability: Logging, Tracing, and Alerting for Eval Systems

Core Technical Skills for Eval Engineers

Python for Eval Pipelines

Statistical Testing Libraries and Deep Statistical Knowledge

Data Pipeline Tools: Airflow, Prefect, Argo

ML Experiment Tracking: MLflow, Weights & Biases, Neptune

The Broader Stack

Eval System Design Interview: 5 Common Questions with Detailed Solutions

Question 1: Design an Eval Results Storage and Retrieval System

Question 2: Design an Evaluation Job Scheduler at Scale

Question 3: Design a Multi-Dimensional Eval Comparison System

Question 4: Design a Quality Monitoring System for Evals

Question 5: Design an Eval-as-a-Service Platform for Multiple Teams

Building Eval Tooling for Your Team: Libraries, CLIs, Dashboards, and Open Source

Internal Eval Libraries

CLI Tools for Common Tasks

Dashboards for Different Audiences

Open Source Contributions

Eval Engineer Career Ladder: IC1 to IC5 and Promotions

IC1 / Entry-Level Eval Engineer (0-1.5 years)

IC2 / Mid-Level Eval Engineer (1.5-3 years)

IC3 / Senior Eval Engineer (3-5 years)

IC4 / Staff Eval Engineer (5-8 years)

IC5 / Distinguished Engineer / Principal of Evals (8+ years)

Typical Promotion Criteria

Potential Career Tracks at Staff Level

Compensation & Equity Benchmarks (2024-2025)

Startup Stage (Series A-C, $10M-$100M funding)

Growth Stage (Series D-F, $100M-$5B valuation)

FAANG & Late-Stage Public (>$500B market cap)

Geographic Variations

Equity Considerations: How to Evaluate Offers

Bonus & Benefits

Negotiation Tactics That Work

Key Takeaways

Ready to Build Better Evals?

Related Lessons