Incident Response Overhead: When Firefighting Becomes the Job

Monday: Incident. Spend the morning debugging, afternoon recovering. Tuesday: Another incident. Different system, same drill. Wednesday: Maybe get some feature work done? No, incident. By Friday, the team has shipped nothing. Every sprint ends the same way: points not completed because of unplanned incident work.

Incident response is necessary. Systems break. Problems happen. But when incident response becomes the majority of the job, something is wrong. Healthy systems don't have constant incidents. Teams in constant firefighting mode can't improve the systems that are constantly on fire.

The firefighting cycle is self-reinforcing. Too many incidents means no time for improvement. No improvement means problems persist. Persistent problems mean more incidents. Breaking this cycle requires both immediate incident response and systemic improvement.

Signs of Excessive Incident Load

Recognizing the problem is the first step.

Velocity Impact

Features don't get shipped:

Sprint planning:
  - Feature A: 5 points
  - Feature B: 3 points
  - Feature C: 3 points
  Total: 11 points

Sprint end:
  - Feature A: In progress
  - Feature B: Not started
  - Feature C: Not started
  Completed: 2 points
  Why: Incidents consumed 80% of time

Incident overhead destroys velocity.

On-Call Burnout

On-call rotation is unsustainable:

On-call experience:
  - Pages at night
  - Pages on weekends
  - No rest during rotation
  - Dread about next rotation

Excessive incidents burn out on-call engineers.

Recurring Problems

Same incidents happen repeatedly:

Incident log:
  Week 1: Database connection exhaustion
  Week 2: Database connection exhaustion
  Week 3: Database connection exhaustion
  Week 4: Database connection exhaustion

Recurring incidents aren't fixed, just responded to.

Reactive Culture

Everything is urgent:

Team mode:
  - Always responding
  - Never preventing
  - Heroic effort normalized
  - "This is just how it is"

Reactive cultures accept constant firefighting.

Why Incidents Accumulate

Understanding accumulation reveals prevention.

Technical Debt

Unmaintained systems fail:

System health over time:
  Year 1: Stable, rare issues
  Year 2: Occasional issues
  Year 3: Regular issues
  Year 4: Constant issues

Debt accumulated, incidents resulted

Deferred maintenance creates incidents.

Insufficient Testing

Bugs reach production:

Testing gaps:
  - Edge cases not tested
  - Integration not tested
  - Performance not tested

Result: Bugs appear in production

Test gaps cause production incidents.

Monitoring Gaps

Problems aren't caught early:

Problem progression:
  Minor issue → Not detected →
  Growing issue → Not detected →
  Major incident → Finally noticed

Late detection increases severity.

No Root Cause Analysis

Problems aren't really fixed:

Incident response:
  - Restart service
  - Mark resolved
  - Move on

vs

Root cause analysis:
  - Why did it break?
  - How do we prevent recurrence?
  - What's the real fix?

Surface fixes don't prevent recurrence.

No Time for Prevention

Incidents prevent prevention:

Team allocation:
  - 80% incidents
  - 20% features
  - 0% prevention

Prevention would reduce incidents
But there's no time for prevention
Because of incidents

The cycle reinforces itself.

Breaking the Cycle

Escaping requires deliberate action.

Allocate Prevention Time

Reserve time for improvement:

@devonair allocate prevention time:
  - Minimum 20% for reliability work
  - Protected from incident response
  - Used for root cause fixes

Protected time enables improvement.

Root Cause Analysis

Fix real problems:

@devonair require root cause analysis:
  - Every significant incident
  - Identify true root cause
  - Create prevention actions
  - Follow up on fixes

Root cause analysis prevents recurrence.

Prioritize Recurring Incidents

Fix what keeps breaking:

@devonair prioritize recurring incidents:
  - Track incident frequency by cause
  - Fix the most common causes
  - Measure reduction

Fixing frequent issues has the most impact.

Improve Testing

Catch bugs earlier:

@devonair improve testing coverage:
  - Focus on areas that cause incidents
  - Add tests when fixing incidents
  - Prevent regression

Better testing prevents production incidents.

Improve Monitoring

Catch problems early:

@devonair improve monitoring:
  - Detect issues before impact
  - Alert on leading indicators
  - Enable faster response

Early detection reduces severity.

Measuring Incident Load

Track to improve.

Incident Frequency

How often incidents occur:

@devonair track incident frequency:
  - Incidents per week
  - By severity
  - Trend over time

Frequency shows overall health.

Time Spent

How much time incidents consume:

@devonair track incident time:
  - Time per incident
  - Total time per week
  - Percentage of capacity

Time spent shows overhead.

Recurrence Rate

How often problems repeat:

@devonair track recurrence:
  - Same root cause incidents
  - Percentage recurring
  - Time between recurrence

Recurrence shows whether fixes work.

Mean Time to Recovery

How long incidents take:

@devonair track MTTR:
  - Detection to resolution
  - By severity
  - Trend over time

MTTR shows response effectiveness.

Reducing Incident Severity

While reducing frequency, also reduce impact.

Faster Detection

Find problems sooner:

@devonair improve detection:
  - Better monitoring
  - Synthetic probes
  - Anomaly detection

Faster detection means shorter incidents.

Faster Response

Respond more quickly:

@devonair improve response:
  - Clear runbooks
  - Automated diagnostics
  - Faster communication

Faster response reduces duration.

Smaller Blast Radius

Limit impact:

@devonair limit blast radius:
  - Graceful degradation
  - Circuit breakers
  - Isolation

Contained problems are less severe.

Easy Rollback

Undo changes quickly:

@devonair enable rollback:
  - One-click rollback
  - Tested regularly
  - Available always

Fast rollback shortens incidents.

Sustainable On-Call

On-call shouldn't be punishing.

Reasonable Load

Manageable number of pages:

@devonair establish on-call standards:
  - Target: < 2 pages per week
  - No page should be false positive
  - Every page should be actionable

Reasonable load is sustainable.

Fair Rotation

Burden shared fairly:

@devonair ensure fair rotation:
  - Even distribution
  - Compensation for extra
  - No permanent on-call

Fair rotation prevents burnout.

Support Available

Responders have help:

@devonair provide on-call support:
  - Escalation path clear
  - Expert backup available
  - Permission to escalate

Support reduces individual stress.

Post-Incident Rest

Recovery time after incidents:

@devonair enable recovery:
  - Time off after severe incidents
  - No back-to-back hard rotations
  - Recovery is expected

Recovery prevents accumulation of stress.

Building Reliability Culture

Culture supports reduced incidents.

Prevention Valued

Proactive work recognized:

Cultural signals:
  - Prevention work celebrated
  - Reliability improvements valued
  - Not just firefighting heroics

Valuing prevention enables it.

Blameless Analysis

Focus on systems, not people:

@devonair establish blameless culture:
  - Incidents are system failures
  - Focus on prevention, not blame
  - Learning over punishment

Blameless culture enables honest analysis.

Quality Investment

Reliability is invested in:

Organization commitment:
  - Time for reliability work
  - Resources for improvement
  - Long-term thinking

Investment enables improvement.

Getting Started

Reduce incident overhead today.

Measure current state:

@devonair analyze incident load:
  - Incident frequency
  - Time consumed
  - Recurring issues

Protect prevention time:

@devonair allocate prevention capacity:
  - Minimum percentage protected
  - Focus on recurring issues
  - Track utilization

Fix recurring issues:

@devonair address top recurring incidents:
  - Root cause analysis
  - Prevention actions
  - Measure reduction

Improve detection:

@devonair improve early detection:
  - Better monitoring
  - Faster alerting
  - Proactive identification

Incident response overhead is reducible. With protected prevention time, root cause analysis, and systematic improvement, the firefighting cycle breaks. Your team spends more time building and less time responding. Incidents become rare events rather than daily occurrences.

FAQ

How do we get time for prevention when we're drowning in incidents?

Start small. Protect 10% of capacity for prevention work. Use it to fix the single most frequent recurring incident. Measure the reduction. The time saved creates more time for more prevention. Build momentum incrementally.

How do we justify prevention work to stakeholders who want features?

Track and communicate the cost of incidents: time spent, features delayed, user impact. Show how prevention investment reduces that cost. Frame prevention as enabling future feature delivery, not competing with it.

Should we have dedicated reliability engineers or should everyone do it?

Both. Everyone should care about reliability and contribute to incident prevention. Dedicated reliability engineers can focus on systemic improvements, tooling, and cross-cutting concerns. The combination is most effective.

How do we handle truly urgent incidents that prevent all other work?

Truly urgent incidents need immediate response. The problem is when everything becomes urgent. Prioritize ruthlessly - not every incident needs all-hands response. Establish clear severity levels and response expectations.