Service Recovery Time Calculator

Calculate service recovery time from incident detection to resolution. Classify your DORA restoration tier and benchmark recovery speed.

min
min
Recovery Time (min)
115
Recovery Time (hours)
1.92
Recovery Time (days)
0.08
DORA Tier
High
< 1 day
Planning notes, formulas, and examples

About the Service Recovery Time Calculator

Service recovery time measures the elapsed time from when an incident is detected to when the service is fully restored. Also known as Time to Restore Service in the DORA framework, it is a critical reliability metric that directly impacts user experience and SLA compliance.

Elite teams can restore service in under one hour, while low performers may take over six months to recover from failures. The speed of recovery often matters more than the frequency of failures because users experience the outage duration, not the failure event itself.

This calculator computes recovery time from detection and resolution timestamps, classifies your DORA tier, and provides the breakdown in minutes, hours, and days. Tracking recovery time across incidents helps identify patterns and justify investments in observability, runbooks, and automated remediation.

When This Page Helps

Fast recovery minimizes the blast radius of every incident. Even if failures occur, rapid restoration limits downtime costs, SLA violations, and customer churn. This calculator helps teams benchmark their recovery capability against DORA standards and track improvement over time.

How to Use the Inputs

  1. Record the time when the incident was detected (alert fired or user report).
  2. Record the time when the service was fully restored and verified.
  3. Enter both timestamps as minutes from the reference point.
  4. Review the recovery time in minutes, hours, and days.
  5. Check your DORA tier classification for time to restore service.
  6. Break down time into detection, diagnosis, fix, and verification phases for deeper insights.
Formula used
Recovery Time = Incident Resolved Timestamp โˆ’ Incident Detected Timestamp. DORA tiers: Elite < 1 hour, High < 1 day, Medium < 1 week, Low < 1 month, Very Low > 1 month.

Example Calculation

Result: 115 minutes (1.92 hours) โ€” High tier

An incident detected 120 minutes ago and resolved 5 minutes ago has a recovery time of 115 minutes (about 1 hour 55 minutes). This falls in the High DORA tier, close to Elite. Reducing the recovery time by 55 minutes would achieve Elite status.

Tips & Best Practices

  • Start the clock when the alert fires, not when a human acknowledges it.
  • Stop the clock only when the service is verified restored, not when the fix is deployed.
  • Track recovery time per severity level to set realistic improvement targets.
  • Runbooks reduce diagnosis time โ€” ensure they cover your top 10 failure modes.
  • Automated rollback capabilities can reduce recovery time to minutes.
  • Practice incident response regularly through game days and tabletop exercises.
  • Post-incident reviews should identify the longest phase in recovery for targeted improvement.

Time to Restore Service in DORA

Time to restore service is one of the four DORA metrics that separate elite engineering organizations from the rest. It measures not whether you fail, but how quickly you recover when failures occur โ€” a far more practical measure of operational excellence.

Recovery Phases

Break recovery time into distinct phases: detection (from failure to first alert), triage (from alert to incident ownership), diagnosis (from ownership to root cause identification), remediation (from root cause to fix deployment), and verification (from fix to confirmed restoration). Each phase can be optimized independently.

Building Recovery Muscle

Fast recovery is a skill that requires practice. Regular game days, chaos engineering experiments, and incident response drills build the muscle memory teams need to respond quickly under pressure. Teams that practice recover faster.

Measuring Recovery Trends

Track recovery time as a rolling median over 30 and 90 days. Monitor trends by severity level (SEV1 vs SEV2 vs SEV3) and by service. Look for improvements after investing in runbooks, automation, or observability tooling.

Sources & Methodology

Last updated:

Frequently Asked Questions

  • It is the elapsed time from when an incident is detected (typically when an alert fires) to when the service is fully restored and verified operational. It captures the full incident resolution lifecycle.