The on-call phone buzzes. Another alert. The engineer glances at it - another false positive, probably. They dismiss it without investigation. An hour later, users report the site is down. The alert that was dismissed? It was the real one, buried in a sea of noise.
Alert fatigue is dangerous. When teams receive too many alerts, they stop paying attention. When alerts frequently turn out to be nothing, investigation becomes optional. When every alert is marked urgent, urgency loses meaning. The system designed to catch problems becomes the system that lets problems through.
This isn't individual failure - it's system failure. Alerting systems that generate too much noise train teams to ignore signals. The solution isn't disciplining people to pay more attention. The solution is AI-powered tools like Devonair that generate fewer, better alerts - signals that deserve attention and get it.
The Anatomy of Alert Fatigue
Alert fatigue develops through predictable stages.
Stage 1: Alerting Everything
Initial setup casts a wide net:
Alert configuration:
- CPU > 80% - Alert
- Memory > 70% - Alert
- Disk > 60% - Alert
- Response time > 100ms - Alert
- Any error - Alert
Comprehensive alerting seems safe.
Stage 2: Noise Accumulates
Everything triggers alerts:
Daily alert volume:
- Transient CPU spikes: 50 alerts
- Normal memory fluctuation: 30 alerts
- Disk approaching threshold: 20 alerts
- Expected errors: 40 alerts
- Actual problems: 2 alerts
Signal drowns in noise.
Stage 3: Triage Becomes Impossible
Volume overwhelms investigation:
On-call experience:
- 140 alerts per day
- 142 require investigation to understand
- Time available: Not enough
- Response: Skim and dismiss
Humans can't investigate this volume.
Stage 4: Trained Indifference
Teams learn to ignore alerts:
Alert received:
"Probably nothing"
"Same alert as always"
"I'll check if it fires again"
"Someone else probably looked"
Experience teaches that most alerts aren't actionable.
Stage 5: Real Problems Missed
Critical alerts don't get attention:
Real incident:
- Alert fires
- Dismissed as noise
- Problem escalates
- Users report issue
- Investigation reveals: Alert was correct
The system designed to catch problems fails.
Why Alerting Generates Noise
Understanding noise sources reveals how to eliminate them.
Low Thresholds
Thresholds set too sensitively:
Threshold: CPU > 70%
Normal operation: CPU regularly hits 75%
Result: Constant alerting for normal behavior
Thresholds should reflect abnormal, not normal.
Missing Context
Alerts without enough information:
Alert: "Error rate elevated"
Questions: How elevated? Where? Since when? Why?
Investigation: Start from scratch
Context-poor alerts require investigation.
Transient Conditions
Alerting on momentary states:
Situation: Brief CPU spike during deployment
Alert: "CPU critical"
Reality: Returns to normal in 2 minutes
Transient conditions don't need alerts.
Expected Events
Alerting on normal operations:
Deployment happens
Alerts fire during deployment
Every deployment generates alerts
Alerts during deployment are ignored
Expected events shouldn't alert.
Duplicate Alerts
Same problem, multiple alerts:
Database slow:
- Database alert fires
- Application timeout alert fires
- API latency alert fires
- Frontend error alert fires
One problem, four alerts
Duplication multiplies noise.
Unmaintained Alerts
Alerts from defunct systems:
Alert: "Legacy batch job failed"
Reality: Batch job decommissioned 6 months ago
Response: Silence alert (eventually)
Alert configuration drifts from reality.
The Cost of Alert Fatigue
Alert fatigue has real consequences.
Missed Incidents
Real problems ignored:
Impact of missed alerts:
- Extended downtime
- Larger blast radius
- More customer impact
Missed alerts extend incidents.
On-Call Burnout
Alert volume burns out responders:
On-call experience:
- Constant interruption
- Sleep disruption
- Investigation fatigue
- Eventual departure
Alert fatigue drives attrition.
Slower Response
Even investigated alerts take longer:
Alert received:
"Is this real this time?"
[Investigation to determine if real]
[Then actual problem solving]
vs
Alert received:
"This is actionable"
[Immediate problem solving]
Distrust slows response.
False Confidence
Teams think they're covered:
Perception: "We have alerting"
Reality: "We have alerting that people ignore"
Alert coverage doesn't mean problem coverage.
Building Better Alerts
Alerts should be worth attention.
Alert on Symptoms, Not Causes
Alert on user impact:
@devonair configure symptom-based alerts:
- Error rate affecting users
- Response time affecting users
- Availability affecting users
Symptom alerts indicate real problems.
Actionable Alerts Only
Every alert has a response:
@devonair ensure alert actionability:
- What action should be taken?
- If no action, why alert?
Alerting without action is noise.
Appropriate Thresholds
Thresholds reflect abnormality:
@devonair tune thresholds:
- Based on normal behavior
- Account for variance
- Update as behavior changes
Well-tuned thresholds reduce noise.
Duration Requirements
Alert on sustained conditions:
@devonair configure duration:
- Condition must persist X minutes
- Filters transient spikes
- Reduces false positives
Duration requirements filter transients.
Suppression During Expected Events
Silence expected noise:
@devonair configure suppression:
- During deployments
- During maintenance windows
- During expected spikes
AI-powered suppression handles expected events automatically.
Alert Hygiene
Alerts need ongoing maintenance.
Regular Review
AI helps review alert effectiveness:
@devonair schedule alert review:
- Which alerts fire most?
- Which alerts are actionable?
- Which alerts are ignored?
Review identifies problem alerts.
Remove Unmaintained Alerts
Delete alerts that don't apply:
@devonair identify stale alerts:
- Alerts for decommissioned systems
- Alerts never acted on
- Alerts universally ignored
Removing stale alerts reduces noise.
Post-Incident Analysis
Learn from incidents:
@devonair analyze post-incident:
- Did alerts fire?
- Were they noticed?
- Were they acted on?
Incidents reveal alerting gaps.
Tuning Feedback Loop
Continuously improve:
@devonair implement alert feedback:
- Track alert outcomes
- Tune based on results
- Improve over time
Feedback drives improvement.
Alert Organization
Structure helps manage alerts.
Severity Levels
Clear severity definitions:
@devonair define severity levels:
- Critical: User-facing impact now
- High: User impact imminent
- Warning: Needs attention soon
- Info: For awareness only
Severity guides response urgency.
Routing
Right alerts to right people:
@devonair route alerts:
- Database alerts to DBA on-call
- Application alerts to app on-call
- Infrastructure alerts to infra on-call
Routing ensures relevant expertise.
Aggregation
Group related alerts:
@devonair aggregate alerts:
- Related alerts grouped
- Single notification for group
- Detail available when needed
Aggregation reduces volume.
Runbooks
Link to response procedures:
@devonair attach runbooks:
- Every alert links to runbook
- Runbook explains response
- Reduces investigation time
Runbooks enable faster response.
Measuring Alert Health
Track alerting effectiveness.
Volume Metrics
How many alerts fire:
@devonair track alert volume:
- Alerts per day
- Alerts per on-call rotation
- Volume trends
High volume indicates problems.
Actionability Metrics
How many alerts are acted on:
@devonair track actionability:
- Percentage of alerts acted on
- Percentage immediately dismissed
- Percentage requiring investigation
Low actionability indicates noise.
Response Metrics
How fast alerts get attention:
@devonair track response:
- Time to acknowledge
- Time to resolve
- Resolution outcomes
Response metrics show effectiveness.
Incident Correlation
Do alerts catch problems:
@devonair correlate with incidents:
- Did alert fire before incident detected?
- Did alert fire and get ignored?
- Was incident found without alert?
Correlation shows alert value.
Getting Started
Fix alert fatigue today.
Audit current state:
@devonair audit alerts:
- Current volume
- Actionability rate
- Most frequent alerts
Fix the worst offenders:
@devonair address problem alerts:
- Tune thresholds
- Add duration requirements
- Remove stale alerts
Establish hygiene:
@devonair establish alert maintenance:
- Regular review process
- Post-incident analysis
- Continuous tuning
Track improvement:
@devonair track alert health:
- Volume trends
- Actionability trends
- Incident correlation
Alert fatigue is fixable. When alerts are meaningful, actionable, and well-tuned, teams pay attention. When every alert is worth investigating, investigation happens. Your alerting system can actually catch problems before users notice - but only if alerts deserve the attention they demand.
FAQ
How do we fix alerts without risking missing real problems?
Improve signal-to-noise ratio rather than reducing coverage. Better thresholds, duration requirements, and symptom-based alerting catch real problems with less noise. Monitor incident correlation to ensure you're catching what matters.
Who should own alert configuration?
Teams that respond to alerts should own their configuration. They have the context to tune effectively. Central guidance on standards and best practices helps, but local ownership enables appropriate tuning.
How do we handle alerts during deployments?
Suppress or reduce sensitivity during deployments when elevated metrics are expected. But ensure critical user-facing alerts remain active. Consider separate deployment-specific alerts that expect different baselines.
What's a good target for alert volume?
There's no universal number, but every alert should be actionable. If on-call engineers can't reasonably investigate each alert, volume is too high. Start by tracking current volume and actionability, then set improvement targets.