What Is an Eval Platform? Orchestration at Scale
An evaluation platform is a system that orchestrates the complete workflow of executing evaluations: accepting evaluation jobs, managing datasets, running evaluators (LLM judges, metric calculators, automated tests), storing results, and providing analysis tools. Where point solutions focus on one part of the pipeline (e.g., just managing annotations), a platform manages the entire flow with orchestration and coordination between stages.
The key distinction: a tool like Label Studio helps you annotate data. A tool like DeepEval orchestrates: it ingests your data, runs evaluations using multiple evaluator backends (LLM judges, local ML models, human review), manages the entire workflow, and provides result analysis. Platforms are more complex but solve the coordination problem that arises when you scale from tens of evaluations to millions.
Building an internal platform (vs. using a SaaS solution) makes sense when: (1) You have evaluation workflows unique to your domain that no commercial platform handles well, (2) You need tight integration with internal systems (model registry, feature store, data lake), (3) You have data sensitivity constraints that make SaaS difficult, (4) You have sufficient engineering resources to build and maintain the platform. For most organizations, using a commercial platform (Braintrust, DeepEval) and building thin customization on top is the right tradeoff. Building a platform from scratch is a 6-12 month engineering effort.
Scale Requirements: From 1K to 100M Evaluations Per Day
Architecture decisions are driven by scale. A platform evaluating 1,000 examples/day has different requirements than one evaluating 100M/day. Your choices around job queuing, storage, and result indexing change fundamentally.
1K evals/day (small): You can evaluate synchronously. User submits 100 examples for evaluation, they wait, you return results within minutes. Your evaluator is a single container running sequentially. Storage is a Postgres database. This architecture is simple enough for a team to maintain.
1M evals/day (medium): You need asynchronous evaluation. User submits jobs, they queue, you process them in batches. You need a job queue (Celery, Kafka) to decouple submission from processing. You can run multiple evaluators in parallel. Results store in a columnar database (BigQuery, Clickhouse) because you need to run analytical queries across millions of results. Response time to completion is hours, not minutes.
100M evals/day (large): You need streaming evaluation. Real-time ingestion of evaluation requests, continuous processing, immediate result availability for dashboards. You need state-of-the-art infrastructure: Kafka for event streaming, Flink for stream processing, columnar databases for analytical queries, distributed caching (Redis) for hot results. You also need multi-region replication for availability. This is the scale of major cloud providers' infrastructure.
Most teams operate in the 1M-10M evals/day range, which is the sweet spot for standard SaaS or custom-built platforms. Key decision points at this scale: (1) How fast must results be available? (batch: hours, streaming: minutes), (2) How many concurrent evaluation requests can you handle? (single queue: 100s, distributed: 1000s), (3) How much storage? (terabytes: columnar DBs, petabytes: data lake).
Core Platform Components: Building Blocks
Eval Job Queue: Accepts evaluation requests, prioritizes them, and distributes to workers. Handles: job submission API, retry logic, priority assignment (high-priority evals cut the queue), status tracking.
Dataset Registry: Stores evaluation datasets with versioning, metadata, and lineage. Tracks: which examples are in which dataset versions, where they came from, what changes were made to versions.
Evaluator Registry: Manages available evaluators (LLM judges, metric calculators, custom code). Tracks: evaluator versions, configurations, cost, performance characteristics (latency, accuracy if known).
Worker Pool: Executes evaluations. Can be heterogeneous: some workers run LLM judges (GPT-4, Claude), others run fast metrics (BLEU score, exact match), others handle human review. Workers report results and status back to the queue.
Result Store: Durable storage for evaluation results with analytical query support. Typically a columnar database (BigQuery, ClickHouse, Parquet on S3). Supports: writing results at high throughput, querying across millions of results efficiently.
Analysis Engine: Computes aggregate statistics, comparisons, and insights from results. Examples: "metric distribution by model version," "failed examples grouped by error type," "segment-level performance comparison."
Reporting Layer: Dashboards, APIs, and exports for consuming results. Makes results accessible to engineers, data scientists, and non-technical stakeholders.
The Eval Job Queue: Orchestrating at Scale
A job queue decouples job submission from execution. A user or script submits an evaluation job ("run this evaluator on this dataset using this model version"), the queue stores it durably, and workers pick up jobs from the queue when they have capacity. This prevents system overload and makes the system resilient (if a worker crashes, the job remains in the queue and another worker can pick it up).
Job Structure: A job typically includes: dataset_id (which examples to evaluate), model_id (which model to evaluate), evaluator_id (which evaluator to run), parameters (model temperature, evaluator thresholds, etc.), priority (urgent evals jump the queue), user_id (for access control), timestamp (when was this job submitted).
Queue Implementation Options: Celery (distributed task queue, integrates with Python/Django), Kafka (streaming platform, more operational overhead but highly scalable), AWS SQS (managed service, simple but less flexible), Google Cloud Tasks (managed, good integration with GCP stack).
Retry Logic: If an evaluator fails (LLM API timeout, model crash), you need retry logic. Standard approach: retry with exponential backoff (wait 1s, then 4s, then 16s) for transient failures. For persistent failures, move the job to a dead-letter queue for manual investigation. Distinguish transient failures (API rate limit, temporary outage) from permanent failures (invalid evaluator, unsupported model).
Priority Queuing: High-priority evals (safety-critical evaluations, customer-facing requests) should cut the queue. Implement priority levels: critical (immediate processing), high (within 1 hour), normal (within 8 hours), low (background). Queue workers service higher-priority jobs first.
Dataset Registry Design: Versioning and Lineage
A dataset registry stores evaluation datasets with full versioning and lineage tracking. This solves a critical problem: "which dataset version was this evaluation run against?" Evaluation results are meaningless without knowing exactly what data was used.
Dataset Version Structure: Each dataset version includes: examples (input + expected output), metadata (source, created date, creator), lineage (which dataset versions contributed to this version). This creates an audit trail: you can trace how a dataset evolved over time and understand why certain examples were added/removed.
Example Metadata Schema: Each example should store: unique ID (for deduplication), text/embedding (the input), ground truth (if available), difficulty (easy/medium/hard, for stratification), source (where did this come from?), tags (use case, domain, failure mode type). This metadata enables intelligent sampling and segment-level analysis.
Versioning Strategy: Use semantic versioning or timestamp-based versions. When you modify a dataset (add examples, fix labels), create a new version. Never mutate existing versions—evaluation results from old evaluations should remain valid. This prevents the problem of "results from last week are now invalid because someone changed the dataset."
Storage Format: Store datasets as Parquet files or in a columnar database (BigQuery table). Parquet is portable (you can version it in Git or S3), columnar (efficient for querying), and schema-enforced. Large datasets (>10 GB) should live in cloud storage (S3, GCS) with a pointer in the registry.
Evaluator Orchestration: Managing Diverse Evaluators
Different evaluation tasks need different evaluators. Some use LLM judges, others use metric calculators, others require human review. The platform must orchestrate this heterogeneity: accept evaluation jobs, determine which evaluators are appropriate, manage costs and rate limits, and cache results.
Evaluator Types: LLM judge (GPT-4, Claude running a prompt to grade outputs), metric evaluator (BLEU score, ROUGE, exact match—fast, cheap, no API calls), embedding similarity (cosine similarity in embedding space), test suite (parameterized tests that check specific conditions), human review (send to annotation platform for human labeling).
Evaluator Registry: For each evaluator, store: evaluator_id (unique identifier), type (LLM, metric, test), configuration (which model, which prompt, hyperparameters), cost (cost to run 1 eval—GPT-4 judge might cost $0.01, BLEU score costs $0.00), latency (GPT-4 takes 2s per example, BLEU takes 10ms), accuracy (if known from validation). This lets you make intelligent routing decisions: "for this task, the LLM judge is 5x more accurate than metrics but 100x more expensive, so use LLM only for uncertain cases."
Cost Tracking: Track evaluation cost by job, user, time period. If LLM judges are expensive, you need visibility into costs to optimize. Standard optimization: (1) Run cheap metrics first, (2) Only use expensive evaluators on uncertain examples (active learning), (3) Cache evaluator outputs (same input should have same output).
Rate Limiting: LLM APIs have rate limits (e.g., 100K tokens/minute for GPT-4). Your platform needs to respect these limits to avoid getting rate-limited. Typical approach: track tokens used per minute per API key, queue jobs when approaching limits, distribute load across multiple API keys if available.
Fallback Strategies: What if your primary evaluator (GPT-4) is rate-limited or offline? Define fallbacks: try Claude, then Gemini, then a local open-source model. This ensures evaluations can complete even if a single evaluator becomes unavailable. Document fallback behavior so users understand which evaluators might be used.
Result Storage Architecture: Supporting Analytical Queries
Storing 1M evaluation results per day and supporting efficient analytical queries (aggregate metrics, segment analysis, failure clustering) requires specialized infrastructure. Row-oriented databases (Postgres, MySQL) optimize for single-record access; columnar databases (BigQuery, ClickHouse, DuckDB) optimize for analytical queries.
Result Schema: Each result includes: job_id, example_id, model_id, evaluator_id, metric_value (or label for LLM judges), execution_time, timestamp, error_message (if failed). Keep results immutable—never update old results. If you want to re-evaluate, create a new job and new result record.
Write Optimization: You're appending millions of result rows daily. Batch writes (write 10K results at once) is 100x more efficient than individual inserts. Use bulk APIs: BigQuery load, ClickHouse inserts, or S3 Parquet files with periodic ingestion.
Query Patterns: Typical queries: "average metric by model version," "metric percentiles by user segment," "failure rate over time," "top 10 failure examples." These require scanning millions of rows efficiently. Columnar databases handle this naturally; row-oriented databases struggle. Add indices on commonly-filtered columns (model_id, timestamp, user_id).
Retention Policy: Storage costs grow over time. Define retention: keep all results for 30 days, aggregate to daily summaries after 30 days, delete summaries after 1 year. This keeps recent data fully queryable while reducing storage costs for historical data. Ensure you're not deleting data you'll need for audits or compliance.
Real-Time vs. Batch Architecture Patterns
Batch Evaluation: Collect evaluation requests throughout the day, process them in batches at night. Pros: cost-efficient (batch processing is cheaper than streaming), simple to implement (standard data pipeline tools), works well for regression testing and periodic reviews. Cons: results take hours to available, can't respond to urgent issues immediately.
Streaming Evaluation: Process evaluations as they arrive, results available within minutes. Pros: real-time dashboards, can respond immediately to quality issues, better for production monitoring. Cons: more operational complexity, requires streaming infrastructure (Kafka, Flink), higher cost.
Hybrid Approach (Recommended): Streaming for real-time signal (safety and anomaly detection), batch for bulk evaluation (regression testing, dataset improvements). Example: stream production queries through a fast evaluator (check for PII, profanity, basic hallucination). Batch daily regression tests against your evaluation dataset at night. This gives you both real-time responsiveness and cost-efficient bulk processing.
Streaming Implementation: Kafka ingests evaluation requests, Flink processes them (applies evaluators), results stream to result store and real-time dashboards. Simple streaming query: "count evaluations per minute by model version," alert if one model version has unexpectedly high failure rate.
Multi-Tenant Platform Design: Isolating Data and Costs
If you're building a platform for multiple internal teams or multiple external customers, you need strong isolation: one team's data shouldn't be visible to others, and one team's runaway evaluation job shouldn't consume all system resources.
Data Isolation: Store data with tenant_id on every record. Enforce in queries: all queries must filter by tenant_id. In the application layer, validate that the current user belongs to the tenant_id they're querying. Use database row-level security if the database supports it (Postgres has RLS, BigQuery has row-level policies).
Resource Quotas: Limit each tenant: max evaluations per day, max cost per month, max storage used. Quotas prevent one tenant from monopolizing resources. When a tenant hits their quota, queue their evaluations but process them at lower priority.
Cost Attribution: Track cost per tenant: LLM API costs, storage, compute. Bill tenants or allocate costs internally. This incentivizes efficient evaluation practices (use cheap metrics first, cache aggressively).
API Keys and Access Control: Each tenant gets an API key for submitting jobs. Validate API keys, log which tenant submitted which job. This creates an audit trail and enables fine-grained access control (restrict certain evaluators to certain tenants, restrict certain datasets).
Reliability and Observability: Keeping the Platform Running
SLA Requirements: Define what the platform promises. Example SLA: "99.9% of submitted evaluations complete successfully, with median latency of 30 minutes." This commits the platform to reliability targets. Missing SLAs results in credit to customers, creating accountability.
Failure Mode Planning: What can go wrong? LLM API outage, database failure, evaluator crash, queue overload. For each failure mode, design a mitigation: LLM API outage → fallback to cheaper models, database failure → replicate to standby database, evaluator crash → circuit breaker stops sending jobs to that evaluator, queue overload → shed low-priority jobs. Document all mitigations.
Retry Strategies: Transient failures (API rate limit, temporary network issue) should retry automatically with backoff. Persistent failures (invalid evaluator, malformed input) should be logged and escalated. After N retries, mark the job as failed and alert the user.
Observability: Instrument everything: job latency, evaluator success rate, cost per evaluation, storage used, queue depth, API response times. Build dashboards showing platform health. Set up alerts: queue is growing (stuck jobs?), evaluator failure rate spike (evaluator broke?), cost per job increasing (something is inefficient?). These alerts should trigger PagerDuty or equivalent.
Tracing and Logging: Each job should have a trace ID. Log every step: job submitted, queued, started, evaluator selected, result stored, completed. If something fails, you can trace the path and understand where the failure occurred. Logs should be searchable by job_id, user_id, model_id for debugging.
An evaluation platform is fundamentally an orchestration system managing: job queuing, dataset registry, diverse evaluators, result storage, and analysis tools. Scale and operational requirements drive architectural decisions. Start with batch processing if scale is moderate; add streaming as scale and latency requirements grow.
