The Deployment Clearance Report (DCR)

What a DCR Is and Why It Exists

A Deployment Clearance Report is the authoritative, frozen document that states whether a system is ready to go to production and under what conditions. Unlike an evaluation report (which is exploratory), a DCR is a gate document. It answers a single question: Should we deploy this?

The DCR exists because:

Accountability: Someone signs it. Someone is willing to stake their credibility on the decision.
Traceability: In 18 months, when the model fails, you need to know exactly what was evaluated and what was deemed acceptable.
Legal protection: In regulated industries (healthcare, finance, insurance), the DCR may be your evidence of due diligence.
Risk quantification: A DCR forces a clear statement of known risks, not a vague "more testing needed."
Speed: Once signed, you can deploy without endless reiterations. The gate is closed; deployment can proceed.

A good DCR is a decision artifact, not a knowledge dump. It contains enough information to justify the decision but is ruthlessly edited to prevent noise.

The 6-Section Structure

1. Executive Summary (250-400 words)

This is read by 80% of readers. It must stand alone. A busy VP should be able to read only this and understand the go/no-go decision and the top 3 risks.

Sub-components:

One-line recommendation: "APPROVED for production deployment in healthcare verticals only, with real-time monitoring of hallucination rate."
System being evaluated: Name, version, vendor (if applicable), deployment target, decision deadline.
Scope and constraints: What use cases are covered? What are explicitly out of scope?
Key metrics: Top 3-4 metrics that drove the decision. Example: "Hallucination rate: 2.1% vs. 5% threshold (PASS); Entity recognition accuracy: 94.2% vs. 90% threshold (PASS); Edge case coverage: 73% vs. 85% threshold (CONDITIONAL)."
Critical risks: The 3 biggest risks that aren't being mitigated. Be specific. Not "performance may degrade" but "performance degrades 8% when query length > 500 tokens (0.3% of production queries, remediated by length truncation)."
Revocation triggers: Under what conditions would you pull the plug? "If production hallucination rate exceeds 4% over any 7-day window, system is taken offline pending investigation."

Write the Executive Summary last, after all sections are complete, to ensure accuracy.

2. Scope and Methodology (400-600 words)

This section prevents future arguments about whether the evaluation was relevant. Be painfully specific.

Sub-components:

Evaluation mandate: Who asked for this evaluation? What was the original problem statement?
System version: Git commit hash, model checkpoint ID, inference configuration. Include anything that affects behavior.
Evaluation date: When was the evaluation run? Was it run multiple times?
Test set composition: Size (n=), sources (real user logs, synthetic, public benchmark), stratification (by query length, topic, user tenure). Show coverage: "Test set includes 40% knowledge base queries, 30% clarification requests, 20% edge cases, 10% adversarial prompt injections."
Human evaluation protocol: How many raters? Inter-rater agreement (Cohen's κ). Rater background (domain expert vs. customer service rep). Sampling methodology (all outputs evaluated or sample? if sample, what size?).
Automated metric details: For RAGAS, ROUGE, or custom metrics, describe: input data format, preprocessing steps, any special parameters. "RAGAS faithfulness score uses OpenAI gpt-4-turbo as LLM judge, temperature=0, sampled at scale=1000 for compute efficiency."
Evaluation environment: Hardware specs (important for latency claims). Was it run in same hardware as production? Under what load?
Exclusions: What was explicitly NOT tested? Why? Example: "Load testing beyond 1000 concurrent users not performed (not deployment blocker per stakeholder agreement); performance under DNS failure not tested (infrastructure decision, not model decision)."

3. Findings (600-800 words)

This is the meat. Organize by capability, not by metric. A reader should understand what the system is good at and what it's bad at.

Sub-components:

Primary findings (organized by use case or capability, not metric):
- Answering factual questions: Accuracy 96.1% (262/272), latency 1.2s p95. Best for specific facts (dates, names, numbers). Worst for synthesis across multiple documents (success rate 68%).
- Handling follow-ups: Re-ranking approach works 85% of the time; 12% of failures due to context loss, 3% due to lexical mismatch.
- Refusing out-of-scope: True negative rate 91% (correctly refused 212/233 out-of-scope queries). 18 false negatives (incorrectly refused valid queries) primarily on domain-specific terminology. 3 false positives (incorrectly attempted out-of-scope queries).
Failure mode analysis (specific examples, not aggregates):
- Hallucinated facts: 23 instances found. Root cause 17/23 = retriever returning irrelevant documents. Remaining 6 = model generating plausible-sounding facts not in retrieved context. Example: "User asked about FDA approval timeline; model stated approval in Q3 2023 (not in docs, actually Q1 2024)."
- Latency regressions: 18 queries exceeded 5s threshold. Cause analysis: 14 were multi-hop queries on large knowledge base (remediable by tuning retrieve parameters); 4 were timeout artifacts (remediable by infrastructure change).
Segmentation analysis: "Accuracy by query type: factual 96%, procedural 82%, clarification 71%. Accuracy by user tenure: first-time users 78%, long-term users 92%."
Comparison to baseline: "Compared to previous model: accuracy +4 percentage points, hallucination rate -2 percentage points, latency -0.3s p95."

4. Risk Assessment (400-600 words)

Risks are not areas where performance was mediocre. Risks are scenarios where real harm could happen.

Sub-components:

Severity: What is the harm if this risk materializes? Medical error, financial loss, user frustration, brand damage?
Likelihood: How often will this actually happen? (Not "it could happen someday" but "it happens in X% of production traffic.")
Detectability: Can we catch it in production? With what latency?
Mitigation: What will we do to reduce the risk?

Example risk structure:

Risk: Model hallucination in medical recommendations. Severity: HIGH (patient safety). Likelihood: 2.1% (observed in eval set, extrapolated to production). Detectability: MEDIUM (requires human review of high-confidence medical recommendations; ~500 per day). Mitigation: (1) Real-time alert for any medical claims + recommendation (2) Requires second human review before showing to user (3) Monthly audit of alerts. Residual risk after mitigation: LOW.

List 5-8 distinct risks. Not every concern is a risk; some are just "areas we don't have data on."

5. Recommendation (100-200 words)

The shortest section. Three options:

APPROVED: Deploy as-is. Performance meets all criteria. No additional monitoring beyond standard practice.
APPROVED with CONDITIONS: Deploy, but with specific conditions (monitoring, guardrails, user restrictions). System doesn't deploy if conditions aren't met.
NOT APPROVED: Do not deploy. Performance gaps must be closed first. Recommend specific fixes before re-evaluation.

Write the recommendation defensively. Assume it will be read in a lawsuit.

Strong language: "APPROVED for production deployment in US healthcare settings. Prerequisite: Real-time hallucination detection system must be operational before launch (target deployment: ). Automatic escalation to human review required for confidence > 0.9 on medical claims. System is NOT approved for unsupervised use."

Weak language: "We think it's probably ready to go. Some monitoring would be good. Hallucinations are a concern but maybe they'll be okay."

6. Appendix (variable length)

Everything that supports the claims in sections 1-5:

Detailed metrics table (all metrics, all segments)
Sample error analysis (20-30 representative failures with root cause)
Test set composition breakdown
Rater agreement documentation (κ, disagreement analysis)
Latency distribution (p50, p95, p99 by query type)
Cost/benefit analysis
Links to evaluation code/config (or attach archived version)

DCR Writing Style Guide

Be specific. "High accuracy" is not acceptable. "94.2% accuracy on factual queries, 68% on synthesizing queries" is acceptable.

Quantify uncertainty. "Approximately 92% accurate" is weak. "92.1% ± 3.4% (95% CI)" is strong. This signals you understand statistical rigor.

Use numbers as evidence, not justification. Not: "The accuracy is 92%, which is good." Better: "Accuracy is 92%, vs. 85% for baseline, vs. 95% for production requirement. This is 3 percentage points below requirement; Section 4 details how this risk is mitigated."

Avoid hedging language: Not "might," "may," "could," "probably." Use "will," "does," "observed."

Example hedging → strong:

"The model might hallucinate facts." → "Model hallucinated facts in 2.1% of evaluation outputs (23/1100). Root cause analysis shows 17/23 attributable to retriever failure."
"Performance could degrade with longer queries." → "Accuracy decreases 1.2 percentage points per 100 additional query tokens. At 500 tokens (0.3% of production queries), accuracy is 91.8% vs. 96.1% baseline. Cost of specialized truncation logic: estimated 2 engineer-days."

Own the decision. A DCR is not a hedge. Don't hide behind "more research needed." If you genuinely cannot decide, the recommendation is NOT APPROVED.

Writing the Go/No-Go Recommendation Defensibly

Your recommendation will be scrutinized. Write it as if you're testifying under oath.

Elements of a defensible APPROVED recommendation:

Clear criteria that were pre-specified. "Deployment approved if: (a) accuracy ≥ 90% on all segments, (b) hallucination rate < 5%, (c) latency p95 < 3s, (d) inter-rater agreement κ ≥ 0.70. All criteria met."
Acknowledgment of known unknowns. "We have high confidence in performance on factual queries. We have not evaluated performance on hypothetical/counterfactual queries. This is acceptable because ."
Baseline comparison. If replacing an existing system: "Accuracy improved 4 percentage points. Hallucination rate decreased 60%. Latency increased 0.2s (acceptable tradeoff per stakeholder)."
Specific monitoring and escalation. "Real-time monitoring of hallucination rate (daily dashboard, automated alert if > 4%). Monthly review of escalations. Automatic rollback if hallucination rate exceeds 5% for 2 consecutive days."

Elements of a defensible NOT APPROVED recommendation:

Specific gaps. Not: "Accuracy is too low." Better: "Accuracy is 87%, below the 90% threshold. This is a 1.4 billion customer-facing system; 3% error rate (difference between 87% and 90%) means ~42 million quarterly exposures to errors."
Path to approval. "To reach approval: retraining on 500 additional edge cases (estimated 2-week cycle) or reducing scope to FAQ-only use case (reduces error exposure to 8 million quarterly). Recommend retraining path."
Re-evaluation criteria. "Will re-evaluate when: (a) retraining complete, (b) new eval set built from recent edge cases, (c) accuracy ≥ 90% on validation, (d) κ ≥ 0.75 on inter-rater agreement."

How Different Audiences Read the DCR

The CTO (30 seconds): Reads Executive Summary only. Needs to know: go or no-go, and the one biggest risk.

The Product Manager (5 minutes): Reads Executive Summary + Findings. Needs to understand: what does it do well, what does it do badly, can we launch it.

The Engineer (45 minutes): Reads everything. Needs to understand: the exact evaluation methodology, every failure mode, what monitoring to set up, what to be ready to debug.

The Compliance Officer (60 minutes): Reads Scope/Methodology, Risk Assessment, and Appendix. Needs to verify: was evaluation rigorous, are risks documented, can we defend this decision in an audit.

The End User (they don't): A public-facing DCR should exist, but a full technical DCR is internal.

Structure your DCR to serve all four audiences simultaneously. The Executive Summary lets the CTO skim. The appendix lets the engineer deep-dive. Section 4 (Risk Assessment) is where the Compliance Officer lives.

The Conditional Deployment Option

Many systems are neither clearly "GO" nor "NO-GO." They're "GO with guardrails." This is not a cop-out if structured properly.

Example: "APPROVED for production in healthcare settings, with the following mandatory conditions:

Real-time hallucination detection system operational (Condition Status: ready for prod, deploy May 15).
All outputs with confidence > 0.85 on medical claims require human review before showing to user (Condition Status: UX designer allocated, estimate June 1).
Monitoring dashboard live with alerts if hallucination rate exceeds 4% on any rolling 24h window (Condition Status: data eng allocated, estimate May 20).
Weekly review of flagged outputs for first 4 weeks, then monthly thereafter (Condition Status: ops team allocated).

System will NOT launch until all four conditions are met. Estimated launch date: June 5. If any condition is delayed >2 weeks, decision will be revisited."

This is concrete, not vague. It's deployable. It's defensible.

Sample DCR Excerpt (RAG System)

Scenario: Evaluating a RAG system that answers questions about a company's internal policies for 5,000 employees.

EXECUTIVE SUMMARY

RECOMMENDATION: APPROVED for production deployment, with mandatory real-time monitoring.

System: PolicyBot v2.1 (Git hash: a7f3e2b), candidate for replacing PolicyBot v1.0 (currently 300 queries/day).

Evaluation Summary: PolicyBot v2.1 correctly answers 94.2% of questions (accuracy on ground truth), vs. 88.3% for v1.0. Hallucination rate: 1.8% (20/1100 queries generated facts not in policy documents), vs. 4.2% for v1.0. Latency: p95 1.4 seconds (acceptable for async deployment model). Human evaluation (3 raters, κ=0.78) confirms accuracy on medical interpretation questions.

Top 3 Risks:
1. Hallucination on edge cases (salary policies): 8/20 hallucinations involve salary amounts. Likelihood: 0.7% of production queries. Mitigation: Real-time alert for salary-related answers; human review before showing to employee. Residual risk: LOW.
2. Performance on complex multi-step procedures: Success rate 71% on procedures requiring > 3 steps. Likelihood: 5% of queries. Mitigation: UI guidance to decompose complex questions; escalation to HR for complex cases. Residual risk: MEDIUM (acceptable given UI mitigation).
3. Handling of policy updates: System knowledge cutoff is Jan 1, 2024. Likelihood of out-of-date answer: 2% (recent policy changes). Mitigation: Monthly retraining on new policies; quarterly evaluation refresh. Residual risk: MEDIUM (expected for RAG systems).

Conditions: (1) Real-time hallucination detection system must be live before production. (2) Weekly monitoring of hallucination rate for first month. (3) Automatic rollback if hallucination rate exceeds 5%.

---

FINDINGS

Accuracy by Question Type:
- Factual (benefits, eligibility): 97.8% (156/159) 
- Multi-step procedures: 71.2% (41/57)
- Interpretation (does policy X apply to situation Y): 89.3% (67/75)
- Salary/compensation: 93.1% (27/29)
Overall: 94.2% (291/309)

Hallucination Analysis (20 instances):
- Salary amounts hallucinated: 8 instances
  Example: "Q: What is the bonus structure for engineers? A: 10-15% base salary (INCORRECT: policy specifies 5-8%)"
  Root cause: Retriever returned salary band document; LLM extrapolated to bonus structure.
- Policy dates hallucinated: 7 instances
  Example: "Q: When did remote work policy start? A: Started in 2019 (policy actually started 2020)."
  Root cause: Model training data included older company announcements.
- Procedure details hallucinated: 5 instances
  Example: "Q: Who approves PTO over 30 days? A: Department head and Finance (actually just Department head)."
  Root cause: LLM combining two related policies (PTO approval and budget approval).

Latency: p50=0.8s, p95=1.4s, p99=2.1s. All queries completed within 3s SLA.

Comparison to v1.0: Accuracy +5.9pp, hallucination -2.4pp, latency -0.2s. v1.0 failure modes (context cutoff, inability to handle multi-turn) now resolved.

---

RISK ASSESSMENT

Risk 1: Salary-related hallucinations (8/20 all hallucinations).
Severity: HIGH (employee makes decisions based on incorrect compensation info).
Likelihood: 0.7% of production traffic (~2 queries/day).
Detectability: HIGH (real-time check if salary-related terms appear in query + answer; human review required).
Mitigation: (1) Automated salary query detector + escalation to HR. (2) If confidence < 0.80 on salary queries, refuse answer. (3) Weekly audit of flagged queries.
Residual Risk: LOW.

Risk 2: Complex procedure failures (7/57 multi-step procedures failed).
Severity: MEDIUM (employee follows incomplete procedure, wastes time, may need to restart).
Likelihood: 5% (71% success rate on 5% of queries = 3.55% overall impact).
Detectability: MEDIUM (user feedback + monthly metric review).
Mitigation: (1) UI recommends breaking complex questions. (2) System refuses answers for procedures with > 4 steps. (3) Escalates to HR.
Residual Risk: MEDIUM (acceptable; user experience improvement justifies residual risk).

Risk 3: Policy updates lag (quarterly retraining).
Severity: MEDIUM (employee gets outdated policy).
Likelihood: 2% (estimate based on policy change velocity).
Detectability: MEDIUM (detected in monthly eval refresh; user reports).
Mitigation: (1) Monthly update of high-velocity policies (salary, benefits deadlines). (2) Quarterly full retraining. (3) Change log visible to users ("Last updated: [date]").
Residual Risk: MEDIUM (expected for any RAG system; business accepts this tradeoff).

Common DCR Mistakes

Mistake 1: Dumping data instead of making a decision. A DCR with 200 metrics and no clear recommendation is a bad DCR. Ruthlessly prioritize. Show the 4-5 metrics that actually matter for the decision.

Mistake 2: Hedging the recommendation. "We think it's probably ready, but more testing might be good" is not a recommendation. It's a non-decision that will cause the project to stall.

Mistake 3: Making conditions that are impossible to verify. Bad: "System should perform well across all user types." Good: "Accuracy ≥ 90% for users with tenure < 1 month, which we will sample weekly from production."

Mistake 4: Not explaining failure modes. If there are failures, explain why. "2% of queries failed" is weak. "2% of queries failed; 60% were out-of-scope prompts that user should have been filtered before reaching model; 40% were legitimate edge cases related to time-sensitive policies." Better.

Mistake 5: No baseline comparison. "Accuracy is 94%" means nothing without context. vs. what? vs. v1.0 (88%)? vs. production requirement (90%)? vs. human performance (96%)? All three.

Mistake 6: Ignored statistical rigor. If your eval set is small (n < 100), your confidence intervals are wide. Say so. If you didn't check inter-rater agreement, your accuracy numbers are questionable. Don't hide it.

DCR vs. Model Card vs. System Card

Document	Purpose	Audience	Tone	Length	Decision Type
DCR	Gate-keeping: is this ready to deploy?	Internal (exec, engineer, compliance)	Defensive, precise, decision-focused	5-8 pages	Go/No-Go/Conditional
Model Card	Model transparency: what is this model, what does it do?	External (researchers, users, regulators)	Transparent, comprehensive, educational	4-6 pages	Informational (no gate)
System Card	System transparency: how do all components work together?	External + internal	Educational, system-level view	8-12 pages	Informational (no gate)

A DCR is action-forcing. Model Card is informational. You could publish your Model Card without publishing the DCR (for privacy/liability reasons). You cannot deploy without a DCR.

Key Takeaways

A DCR is a gate document: it authorizes or blocks deployment. It is not optional.
The 6-section structure (Executive Summary, Scope, Findings, Risk Assessment, Recommendation, Appendix) ensures all stakeholders can extract what they need.
Write defensibly. Assume your DCR will be scrutinized in a legal discovery process.
Quantify everything. Avoid hedging. Specify conditions concretely, not vaguely.
The recommendation (GO/NO-GO/CONDITIONAL) must be unambiguous. If you genuinely can't decide, the answer is NO-GO.

The Deployment Clearance Report (DCR): Structure, Format, and How to Write One That Gets Read

What a DCR Is and Why It Exists

The 6-Section Structure

1. Executive Summary (250-400 words)

2. Scope and Methodology (400-600 words)

3. Findings (600-800 words)

4. Risk Assessment (400-600 words)

5. Recommendation (100-200 words)

6. Appendix (variable length)

DCR Writing Style Guide

Writing the Go/No-Go Recommendation Defensibly

How Different Audiences Read the DCR

The Conditional Deployment Option

Sample DCR Excerpt (RAG System)

Common DCR Mistakes

DCR vs. Model Card vs. System Card

Key Takeaways

Ready to write your first DCR?

The Deployment Clearance Report (DCR): Structure, Format, and How to Write One That Gets Read

What a DCR Is and Why It Exists

The 6-Section Structure

1. Executive Summary (250-400 words)

2. Scope and Methodology (400-600 words)

3. Findings (600-800 words)

4. Risk Assessment (400-600 words)

5. Recommendation (100-200 words)

6. Appendix (variable length)

DCR Writing Style Guide

Writing the Go/No-Go Recommendation Defensibly

How Different Audiences Read the DCR

The Conditional Deployment Option

Sample DCR Excerpt (RAG System)

Common DCR Mistakes

DCR vs. Model Card vs. System Card

Key Takeaways

Ready to write your first DCR?

Related Lessons