What a DCR Is and Why It Exists
A Deployment Clearance Report is the authoritative, frozen document that states whether a system is ready to go to production and under what conditions. Unlike an evaluation report (which is exploratory), a DCR is a gate document. It answers a single question: Should we deploy this?
The DCR exists because:
- Accountability: Someone signs it. Someone is willing to stake their credibility on the decision.
- Traceability: In 18 months, when the model fails, you need to know exactly what was evaluated and what was deemed acceptable.
- Legal protection: In regulated industries (healthcare, finance, insurance), the DCR may be your evidence of due diligence.
- Risk quantification: A DCR forces a clear statement of known risks, not a vague "more testing needed."
- Speed: Once signed, you can deploy without endless reiterations. The gate is closed; deployment can proceed.
A good DCR is a decision artifact, not a knowledge dump. It contains enough information to justify the decision but is ruthlessly edited to prevent noise.
The 6-Section Structure
1. Executive Summary (250-400 words)
This is read by 80% of readers. It must stand alone. A busy VP should be able to read only this and understand the go/no-go decision and the top 3 risks.
Sub-components:
- One-line recommendation: "APPROVED for production deployment in healthcare verticals only, with real-time monitoring of hallucination rate."
- System being evaluated: Name, version, vendor (if applicable), deployment target, decision deadline.
- Scope and constraints: What use cases are covered? What are explicitly out of scope?
- Key metrics: Top 3-4 metrics that drove the decision. Example: "Hallucination rate: 2.1% vs. 5% threshold (PASS); Entity recognition accuracy: 94.2% vs. 90% threshold (PASS); Edge case coverage: 73% vs. 85% threshold (CONDITIONAL)."
- Critical risks: The 3 biggest risks that aren't being mitigated. Be specific. Not "performance may degrade" but "performance degrades 8% when query length > 500 tokens (0.3% of production queries, remediated by length truncation)."
- Revocation triggers: Under what conditions would you pull the plug? "If production hallucination rate exceeds 4% over any 7-day window, system is taken offline pending investigation."
Write the Executive Summary last, after all sections are complete, to ensure accuracy.
2. Scope and Methodology (400-600 words)
This section prevents future arguments about whether the evaluation was relevant. Be painfully specific.
Sub-components:
- Evaluation mandate: Who asked for this evaluation? What was the original problem statement?
- System version: Git commit hash, model checkpoint ID, inference configuration. Include anything that affects behavior.
- Evaluation date: When was the evaluation run? Was it run multiple times?
- Test set composition: Size (n=), sources (real user logs, synthetic, public benchmark), stratification (by query length, topic, user tenure). Show coverage: "Test set includes 40% knowledge base queries, 30% clarification requests, 20% edge cases, 10% adversarial prompt injections."
- Human evaluation protocol: How many raters? Inter-rater agreement (Cohen's κ). Rater background (domain expert vs. customer service rep). Sampling methodology (all outputs evaluated or sample? if sample, what size?).
- Automated metric details: For RAGAS, ROUGE, or custom metrics, describe: input data format, preprocessing steps, any special parameters. "RAGAS faithfulness score uses OpenAI gpt-4-turbo as LLM judge, temperature=0, sampled at scale=1000 for compute efficiency."
- Evaluation environment: Hardware specs (important for latency claims). Was it run in same hardware as production? Under what load?
- Exclusions: What was explicitly NOT tested? Why? Example: "Load testing beyond 1000 concurrent users not performed (not deployment blocker per stakeholder agreement); performance under DNS failure not tested (infrastructure decision, not model decision)."
3. Findings (600-800 words)
This is the meat. Organize by capability, not by metric. A reader should understand what the system is good at and what it's bad at.
Sub-components:
- Primary findings (organized by use case or capability, not metric):
- Answering factual questions: Accuracy 96.1% (262/272), latency 1.2s p95. Best for specific facts (dates, names, numbers). Worst for synthesis across multiple documents (success rate 68%).
- Handling follow-ups: Re-ranking approach works 85% of the time; 12% of failures due to context loss, 3% due to lexical mismatch.
- Refusing out-of-scope: True negative rate 91% (correctly refused 212/233 out-of-scope queries). 18 false negatives (incorrectly refused valid queries) primarily on domain-specific terminology. 3 false positives (incorrectly attempted out-of-scope queries).
- Failure mode analysis (specific examples, not aggregates):
- Hallucinated facts: 23 instances found. Root cause 17/23 = retriever returning irrelevant documents. Remaining 6 = model generating plausible-sounding facts not in retrieved context. Example: "User asked about FDA approval timeline; model stated approval in Q3 2023 (not in docs, actually Q1 2024)."
- Latency regressions: 18 queries exceeded 5s threshold. Cause analysis: 14 were multi-hop queries on large knowledge base (remediable by tuning retrieve parameters); 4 were timeout artifacts (remediable by infrastructure change).
- Segmentation analysis: "Accuracy by query type: factual 96%, procedural 82%, clarification 71%. Accuracy by user tenure: first-time users 78%, long-term users 92%."
- Comparison to baseline: "Compared to previous model: accuracy +4 percentage points, hallucination rate -2 percentage points, latency -0.3s p95."
4. Risk Assessment (400-600 words)
Risks are not areas where performance was mediocre. Risks are scenarios where real harm could happen.
Sub-components:
- Severity: What is the harm if this risk materializes? Medical error, financial loss, user frustration, brand damage?
- Likelihood: How often will this actually happen? (Not "it could happen someday" but "it happens in X% of production traffic.")
- Detectability: Can we catch it in production? With what latency?
- Mitigation: What will we do to reduce the risk?
Example risk structure:
Risk: Model hallucination in medical recommendations. Severity: HIGH (patient safety). Likelihood: 2.1% (observed in eval set, extrapolated to production). Detectability: MEDIUM (requires human review of high-confidence medical recommendations; ~500 per day). Mitigation: (1) Real-time alert for any medical claims + recommendation (2) Requires second human review before showing to user (3) Monthly audit of alerts. Residual risk after mitigation: LOW.
List 5-8 distinct risks. Not every concern is a risk; some are just "areas we don't have data on."
5. Recommendation (100-200 words)
The shortest section. Three options:
- APPROVED: Deploy as-is. Performance meets all criteria. No additional monitoring beyond standard practice.
- APPROVED with CONDITIONS: Deploy, but with specific conditions (monitoring, guardrails, user restrictions). System doesn't deploy if conditions aren't met.
- NOT APPROVED: Do not deploy. Performance gaps must be closed first. Recommend specific fixes before re-evaluation.
Write the recommendation defensively. Assume it will be read in a lawsuit.
Strong language: "APPROVED for production deployment in US healthcare settings. Prerequisite: Real-time hallucination detection system must be operational before launch (target deployment:
Weak language: "We think it's probably ready to go. Some monitoring would be good. Hallucinations are a concern but maybe they'll be okay."
6. Appendix (variable length)
Everything that supports the claims in sections 1-5:
- Detailed metrics table (all metrics, all segments)
- Sample error analysis (20-30 representative failures with root cause)
- Test set composition breakdown
- Rater agreement documentation (κ, disagreement analysis)
- Latency distribution (p50, p95, p99 by query type)
- Cost/benefit analysis
- Links to evaluation code/config (or attach archived version)
DCR Writing Style Guide
Be specific. "High accuracy" is not acceptable. "94.2% accuracy on factual queries, 68% on synthesizing queries" is acceptable.
Quantify uncertainty. "Approximately 92% accurate" is weak. "92.1% ± 3.4% (95% CI)" is strong. This signals you understand statistical rigor.
Use numbers as evidence, not justification. Not: "The accuracy is 92%, which is good." Better: "Accuracy is 92%, vs. 85% for baseline, vs. 95% for production requirement. This is 3 percentage points below requirement; Section 4 details how this risk is mitigated."
Avoid hedging language: Not "might," "may," "could," "probably." Use "will," "does," "observed."
Example hedging → strong:
- "The model might hallucinate facts." → "Model hallucinated facts in 2.1% of evaluation outputs (23/1100). Root cause analysis shows 17/23 attributable to retriever failure."
- "Performance could degrade with longer queries." → "Accuracy decreases 1.2 percentage points per 100 additional query tokens. At 500 tokens (0.3% of production queries), accuracy is 91.8% vs. 96.1% baseline. Cost of specialized truncation logic: estimated 2 engineer-days."
Own the decision. A DCR is not a hedge. Don't hide behind "more research needed." If you genuinely cannot decide, the recommendation is NOT APPROVED.
Writing the Go/No-Go Recommendation Defensibly
Your recommendation will be scrutinized. Write it as if you're testifying under oath.
Elements of a defensible APPROVED recommendation:
- Clear criteria that were pre-specified. "Deployment approved if: (a) accuracy ≥ 90% on all segments, (b) hallucination rate < 5%, (c) latency p95 < 3s, (d) inter-rater agreement κ ≥ 0.70. All criteria met."
- Acknowledgment of known unknowns. "We have high confidence in performance on factual queries. We have not evaluated performance on hypothetical/counterfactual queries. This is acceptable because
." - Baseline comparison. If replacing an existing system: "Accuracy improved 4 percentage points. Hallucination rate decreased 60%. Latency increased 0.2s (acceptable tradeoff per stakeholder)."
- Specific monitoring and escalation. "Real-time monitoring of hallucination rate (daily dashboard, automated alert if > 4%). Monthly review of escalations. Automatic rollback if hallucination rate exceeds 5% for 2 consecutive days."
Elements of a defensible NOT APPROVED recommendation:
- Specific gaps. Not: "Accuracy is too low." Better: "Accuracy is 87%, below the 90% threshold. This is a 1.4 billion customer-facing system; 3% error rate (difference between 87% and 90%) means ~42 million quarterly exposures to errors."
- Path to approval. "To reach approval: retraining on 500 additional edge cases (estimated 2-week cycle) or reducing scope to FAQ-only use case (reduces error exposure to 8 million quarterly). Recommend retraining path."
- Re-evaluation criteria. "Will re-evaluate when: (a) retraining complete, (b) new eval set built from recent edge cases, (c) accuracy ≥ 90% on validation, (d) κ ≥ 0.75 on inter-rater agreement."
How Different Audiences Read the DCR
The CTO (30 seconds): Reads Executive Summary only. Needs to know: go or no-go, and the one biggest risk.
The Product Manager (5 minutes): Reads Executive Summary + Findings. Needs to understand: what does it do well, what does it do badly, can we launch it.
The Engineer (45 minutes): Reads everything. Needs to understand: the exact evaluation methodology, every failure mode, what monitoring to set up, what to be ready to debug.
The Compliance Officer (60 minutes): Reads Scope/Methodology, Risk Assessment, and Appendix. Needs to verify: was evaluation rigorous, are risks documented, can we defend this decision in an audit.
The End User (they don't): A public-facing DCR should exist, but a full technical DCR is internal.
Structure your DCR to serve all four audiences simultaneously. The Executive Summary lets the CTO skim. The appendix lets the engineer deep-dive. Section 4 (Risk Assessment) is where the Compliance Officer lives.
The Conditional Deployment Option
Many systems are neither clearly "GO" nor "NO-GO." They're "GO with guardrails." This is not a cop-out if structured properly.
Example: "APPROVED for production in healthcare settings, with the following mandatory conditions:
- Real-time hallucination detection system operational (Condition Status: ready for prod, deploy May 15).
- All outputs with confidence > 0.85 on medical claims require human review before showing to user (Condition Status: UX designer allocated, estimate June 1).
- Monitoring dashboard live with alerts if hallucination rate exceeds 4% on any rolling 24h window (Condition Status: data eng allocated, estimate May 20).
- Weekly review of flagged outputs for first 4 weeks, then monthly thereafter (Condition Status: ops team allocated).
System will NOT launch until all four conditions are met. Estimated launch date: June 5. If any condition is delayed >2 weeks, decision will be revisited."
This is concrete, not vague. It's deployable. It's defensible.
Sample DCR Excerpt (RAG System)
Scenario: Evaluating a RAG system that answers questions about a company's internal policies for 5,000 employees.
EXECUTIVE SUMMARY
RECOMMENDATION: APPROVED for production deployment, with mandatory real-time monitoring.
System: PolicyBot v2.1 (Git hash: a7f3e2b), candidate for replacing PolicyBot v1.0 (currently 300 queries/day).
Evaluation Summary: PolicyBot v2.1 correctly answers 94.2% of questions (accuracy on ground truth), vs. 88.3% for v1.0. Hallucination rate: 1.8% (20/1100 queries generated facts not in policy documents), vs. 4.2% for v1.0. Latency: p95 1.4 seconds (acceptable for async deployment model). Human evaluation (3 raters, κ=0.78) confirms accuracy on medical interpretation questions.
Top 3 Risks:
1. Hallucination on edge cases (salary policies): 8/20 hallucinations involve salary amounts. Likelihood: 0.7% of production queries. Mitigation: Real-time alert for salary-related answers; human review before showing to employee. Residual risk: LOW.
2. Performance on complex multi-step procedures: Success rate 71% on procedures requiring > 3 steps. Likelihood: 5% of queries. Mitigation: UI guidance to decompose complex questions; escalation to HR for complex cases. Residual risk: MEDIUM (acceptable given UI mitigation).
3. Handling of policy updates: System knowledge cutoff is Jan 1, 2024. Likelihood of out-of-date answer: 2% (recent policy changes). Mitigation: Monthly retraining on new policies; quarterly evaluation refresh. Residual risk: MEDIUM (expected for RAG systems).
Conditions: (1) Real-time hallucination detection system must be live before production. (2) Weekly monitoring of hallucination rate for first month. (3) Automatic rollback if hallucination rate exceeds 5%.
---
FINDINGS
Accuracy by Question Type:
- Factual (benefits, eligibility): 97.8% (156/159)
- Multi-step procedures: 71.2% (41/57)
- Interpretation (does policy X apply to situation Y): 89.3% (67/75)
- Salary/compensation: 93.1% (27/29)
Overall: 94.2% (291/309)
Hallucination Analysis (20 instances):
- Salary amounts hallucinated: 8 instances
Example: "Q: What is the bonus structure for engineers? A: 10-15% base salary (INCORRECT: policy specifies 5-8%)"
Root cause: Retriever returned salary band document; LLM extrapolated to bonus structure.
- Policy dates hallucinated: 7 instances
Example: "Q: When did remote work policy start? A: Started in 2019 (policy actually started 2020)."
Root cause: Model training data included older company announcements.
- Procedure details hallucinated: 5 instances
Example: "Q: Who approves PTO over 30 days? A: Department head and Finance (actually just Department head)."
Root cause: LLM combining two related policies (PTO approval and budget approval).
Latency: p50=0.8s, p95=1.4s, p99=2.1s. All queries completed within 3s SLA.
Comparison to v1.0: Accuracy +5.9pp, hallucination -2.4pp, latency -0.2s. v1.0 failure modes (context cutoff, inability to handle multi-turn) now resolved.
---
RISK ASSESSMENT
Risk 1: Salary-related hallucinations (8/20 all hallucinations).
Severity: HIGH (employee makes decisions based on incorrect compensation info).
Likelihood: 0.7% of production traffic (~2 queries/day).
Detectability: HIGH (real-time check if salary-related terms appear in query + answer; human review required).
Mitigation: (1) Automated salary query detector + escalation to HR. (2) If confidence < 0.80 on salary queries, refuse answer. (3) Weekly audit of flagged queries.
Residual Risk: LOW.
Risk 2: Complex procedure failures (7/57 multi-step procedures failed).
Severity: MEDIUM (employee follows incomplete procedure, wastes time, may need to restart).
Likelihood: 5% (71% success rate on 5% of queries = 3.55% overall impact).
Detectability: MEDIUM (user feedback + monthly metric review).
Mitigation: (1) UI recommends breaking complex questions. (2) System refuses answers for procedures with > 4 steps. (3) Escalates to HR.
Residual Risk: MEDIUM (acceptable; user experience improvement justifies residual risk).
Risk 3: Policy updates lag (quarterly retraining).
Severity: MEDIUM (employee gets outdated policy).
Likelihood: 2% (estimate based on policy change velocity).
Detectability: MEDIUM (detected in monthly eval refresh; user reports).
Mitigation: (1) Monthly update of high-velocity policies (salary, benefits deadlines). (2) Quarterly full retraining. (3) Change log visible to users ("Last updated: [date]").
Residual Risk: MEDIUM (expected for any RAG system; business accepts this tradeoff).
Common DCR Mistakes
Mistake 1: Dumping data instead of making a decision. A DCR with 200 metrics and no clear recommendation is a bad DCR. Ruthlessly prioritize. Show the 4-5 metrics that actually matter for the decision.
Mistake 2: Hedging the recommendation. "We think it's probably ready, but more testing might be good" is not a recommendation. It's a non-decision that will cause the project to stall.
Mistake 3: Making conditions that are impossible to verify. Bad: "System should perform well across all user types." Good: "Accuracy ≥ 90% for users with tenure < 1 month, which we will sample weekly from production."
Mistake 4: Not explaining failure modes. If there are failures, explain why. "2% of queries failed" is weak. "2% of queries failed; 60% were out-of-scope prompts that user should have been filtered before reaching model; 40% were legitimate edge cases related to time-sensitive policies." Better.
Mistake 5: No baseline comparison. "Accuracy is 94%" means nothing without context. vs. what? vs. v1.0 (88%)? vs. production requirement (90%)? vs. human performance (96%)? All three.
Mistake 6: Ignored statistical rigor. If your eval set is small (n < 100), your confidence intervals are wide. Say so. If you didn't check inter-rater agreement, your accuracy numbers are questionable. Don't hide it.
DCR vs. Model Card vs. System Card
| Document | Purpose | Audience | Tone | Length | Decision Type |
|---|---|---|---|---|---|
| DCR | Gate-keeping: is this ready to deploy? | Internal (exec, engineer, compliance) | Defensive, precise, decision-focused | 5-8 pages | Go/No-Go/Conditional |
| Model Card | Model transparency: what is this model, what does it do? | External (researchers, users, regulators) | Transparent, comprehensive, educational | 4-6 pages | Informational (no gate) |
| System Card | System transparency: how do all components work together? | External + internal | Educational, system-level view | 8-12 pages | Informational (no gate) |
A DCR is action-forcing. Model Card is informational. You could publish your Model Card without publishing the DCR (for privacy/liability reasons). You cannot deploy without a DCR.
Key Takeaways
- A DCR is a gate document: it authorizes or blocks deployment. It is not optional.
- The 6-section structure (Executive Summary, Scope, Findings, Risk Assessment, Recommendation, Appendix) ensures all stakeholders can extract what they need.
- Write defensibly. Assume your DCR will be scrutinized in a legal discovery process.
- Quantify everything. Avoid hedging. Specify conditions concretely, not vaguely.
- The recommendation (GO/NO-GO/CONDITIONAL) must be unambiguous. If you genuinely can't decide, the answer is NO-GO.
