Monday: Incident. Spend the morning debugging, afternoon recovering. Tuesday: Another incident. Different system, same drill. Wednesday: Maybe get some feature work done? No, incident. By Friday, the team has shipped nothing. Every sprint ends the same way: points not completed because of unplanned incident work.
Incident response is necessary. Systems break. Problems happen. But when incident response becomes the majority of the job, something is wrong. Healthy systems don't have constant incidents. Teams in constant firefighting mode can't improve the systems that are constantly on fire.
The firefighting cycle is self-reinforcing. Too many incidents means no time for improvement. No improvement means problems persist. Persistent problems mean more incidents. Breaking this cycle requires both immediate incident response and systemic improvement.
Signs of Excessive Incident Load
Recognizing the problem is the first step.
Velocity Impact
Features don't get shipped:
Sprint planning:
- Feature A: 5 points
- Feature B: 3 points
- Feature C: 3 points
Total: 11 points
Sprint end:
- Feature A: In progress
- Feature B: Not started
- Feature C: Not started
Completed: 2 points
Why: Incidents consumed 80% of time
Incident overhead destroys velocity.
On-Call Burnout
On-call rotation is unsustainable:
On-call experience:
- Pages at night
- Pages on weekends
- No rest during rotation
- Dread about next rotation
Excessive incidents burn out on-call engineers.
Recurring Problems
Same incidents happen repeatedly:
Incident log:
Week 1: Database connection exhaustion
Week 2: Database connection exhaustion
Week 3: Database connection exhaustion
Week 4: Database connection exhaustion
Recurring incidents aren't fixed, just responded to.
Reactive Culture
Everything is urgent:
Team mode:
- Always responding
- Never preventing
- Heroic effort normalized
- "This is just how it is"
Reactive cultures accept constant firefighting.
Why Incidents Accumulate
Understanding accumulation reveals prevention.
Technical Debt
Unmaintained systems fail:
System health over time:
Year 1: Stable, rare issues
Year 2: Occasional issues
Year 3: Regular issues
Year 4: Constant issues
Debt accumulated, incidents resulted
Deferred maintenance creates incidents.
Insufficient Testing
Bugs reach production:
Testing gaps:
- Edge cases not tested
- Integration not tested
- Performance not tested
Result: Bugs appear in production
Test gaps cause production incidents.
Monitoring Gaps
Problems aren't caught early:
Problem progression:
Minor issue → Not detected →
Growing issue → Not detected →
Major incident → Finally noticed
Late detection increases severity.
No Root Cause Analysis
Problems aren't really fixed:
Incident response:
- Restart service
- Mark resolved
- Move on
vs
Root cause analysis:
- Why did it break?
- How do we prevent recurrence?
- What's the real fix?
Surface fixes don't prevent recurrence.
No Time for Prevention
Incidents prevent prevention:
Team allocation:
- 80% incidents
- 20% features
- 0% prevention
Prevention would reduce incidents
But there's no time for prevention
Because of incidents
The cycle reinforces itself.
Breaking the Cycle
Escaping requires deliberate action.
Allocate Prevention Time
Reserve time for improvement:
@devonair allocate prevention time:
- Minimum 20% for reliability work
- Protected from incident response
- Used for root cause fixes
Protected time enables improvement.
Root Cause Analysis
Fix real problems:
@devonair require root cause analysis:
- Every significant incident
- Identify true root cause
- Create prevention actions
- Follow up on fixes
Root cause analysis prevents recurrence.
Prioritize Recurring Incidents
Fix what keeps breaking:
@devonair prioritize recurring incidents:
- Track incident frequency by cause
- Fix the most common causes
- Measure reduction
Fixing frequent issues has the most impact.
Improve Testing
Catch bugs earlier:
@devonair improve testing coverage:
- Focus on areas that cause incidents
- Add tests when fixing incidents
- Prevent regression
Better testing prevents production incidents.
Improve Monitoring
Catch problems early:
@devonair improve monitoring:
- Detect issues before impact
- Alert on leading indicators
- Enable faster response
Early detection reduces severity.
Measuring Incident Load
Track to improve.
Incident Frequency
How often incidents occur:
@devonair track incident frequency:
- Incidents per week
- By severity
- Trend over time
Frequency shows overall health.
Time Spent
How much time incidents consume:
@devonair track incident time:
- Time per incident
- Total time per week
- Percentage of capacity
Time spent shows overhead.
Recurrence Rate
How often problems repeat:
@devonair track recurrence:
- Same root cause incidents
- Percentage recurring
- Time between recurrence
Recurrence shows whether fixes work.
Mean Time to Recovery
How long incidents take:
@devonair track MTTR:
- Detection to resolution
- By severity
- Trend over time
MTTR shows response effectiveness.
Reducing Incident Severity
While reducing frequency, also reduce impact.
Faster Detection
Find problems sooner:
@devonair improve detection:
- Better monitoring
- Synthetic probes
- Anomaly detection
Faster detection means shorter incidents.
Faster Response
Respond more quickly:
@devonair improve response:
- Clear runbooks
- Automated diagnostics
- Faster communication
Faster response reduces duration.
Smaller Blast Radius
Limit impact:
@devonair limit blast radius:
- Graceful degradation
- Circuit breakers
- Isolation
Contained problems are less severe.
Easy Rollback
Undo changes quickly:
@devonair enable rollback:
- One-click rollback
- Tested regularly
- Available always
Fast rollback shortens incidents.
Sustainable On-Call
On-call shouldn't be punishing.
Reasonable Load
Manageable number of pages:
@devonair establish on-call standards:
- Target: < 2 pages per week
- No page should be false positive
- Every page should be actionable
Reasonable load is sustainable.
Fair Rotation
Burden shared fairly:
@devonair ensure fair rotation:
- Even distribution
- Compensation for extra
- No permanent on-call
Fair rotation prevents burnout.
Support Available
Responders have help:
@devonair provide on-call support:
- Escalation path clear
- Expert backup available
- Permission to escalate
Support reduces individual stress.
Post-Incident Rest
Recovery time after incidents:
@devonair enable recovery:
- Time off after severe incidents
- No back-to-back hard rotations
- Recovery is expected
Recovery prevents accumulation of stress.
Building Reliability Culture
Culture supports reduced incidents.
Prevention Valued
Proactive work recognized:
Cultural signals:
- Prevention work celebrated
- Reliability improvements valued
- Not just firefighting heroics
Valuing prevention enables it.
Blameless Analysis
Focus on systems, not people:
@devonair establish blameless culture:
- Incidents are system failures
- Focus on prevention, not blame
- Learning over punishment
Blameless culture enables honest analysis.
Quality Investment
Reliability is invested in:
Organization commitment:
- Time for reliability work
- Resources for improvement
- Long-term thinking
Investment enables improvement.
Getting Started
Reduce incident overhead today.
Measure current state:
@devonair analyze incident load:
- Incident frequency
- Time consumed
- Recurring issues
Protect prevention time:
@devonair allocate prevention capacity:
- Minimum percentage protected
- Focus on recurring issues
- Track utilization
Fix recurring issues:
@devonair address top recurring incidents:
- Root cause analysis
- Prevention actions
- Measure reduction
Improve detection:
@devonair improve early detection:
- Better monitoring
- Faster alerting
- Proactive identification
Incident response overhead is reducible. With protected prevention time, root cause analysis, and systematic improvement, the firefighting cycle breaks. Your team spends more time building and less time responding. Incidents become rare events rather than daily occurrences.
FAQ
How do we get time for prevention when we're drowning in incidents?
Start small. Protect 10% of capacity for prevention work. Use it to fix the single most frequent recurring incident. Measure the reduction. The time saved creates more time for more prevention. Build momentum incrementally.
How do we justify prevention work to stakeholders who want features?
Track and communicate the cost of incidents: time spent, features delayed, user impact. Show how prevention investment reduces that cost. Frame prevention as enabling future feature delivery, not competing with it.
Should we have dedicated reliability engineers or should everyone do it?
Both. Everyone should care about reliability and contribute to incident prevention. Dedicated reliability engineers can focus on systemic improvements, tooling, and cross-cutting concerns. The combination is most effective.
How do we handle truly urgent incidents that prevent all other work?
Truly urgent incidents need immediate response. The problem is when everything becomes urgent. Prioritize ruthlessly - not every incident needs all-hands response. Establish clear severity levels and response expectations.