Flaky tests are the silent productivity killer in your CI pipeline. They pass sometimes, fail sometimes, and nobody knows why. Teams develop coping mechanisms: re-run the pipeline, ignore certain failures, quarantine unreliable tests. None of these solve the problem.
The traditional approach to flaky tests is either ignore them (bad) or spend hours debugging race conditions and timing issues (expensive). AI offers a third path: systematic identification and repair of the patterns that cause test flakiness.
This guide covers practical techniques for using AI to fix flaky tests—not as a magic solution, but as a tool that handles the tedious diagnosis and repair work.
Understanding Why Tests Become Flaky
Before fixing flaky tests, it helps to understand what makes them flaky. AI is effective at identifying and fixing these patterns because they're often systematic rather than random.
Timing and Race Conditions
The most common cause of flakiness: tests that depend on things happening in a certain order, but don't enforce that order.
// Flaky: assumes data loads before assertion
test('displays user name', async () => {
render(<UserProfile id="123" />);
expect(screen.getByText('John Doe')).toBeInTheDocument();
});
// Stable: waits for data to load
test('displays user name', async () => {
render(<UserProfile id="123" />);
await waitFor(() => {
expect(screen.getByText('John Doe')).toBeInTheDocument();
});
});
AI can identify tests missing proper async handling and add appropriate wait conditions.
Shared State Between Tests
Tests that modify shared state create order-dependent failures. Test A passes alone but fails after Test B runs.
// Flaky: modifies global state
let userCount = 0;
test('creates user', () => {
createUser();
userCount++;
expect(userCount).toBe(1); // Fails if another test modified userCount
});
// Stable: isolated state
test('creates user', () => {
const state = { userCount: 0 };
createUser(state);
expect(state.userCount).toBe(1);
});
AI can trace state dependencies and identify tests that need isolation.
External Dependencies
Tests that hit real APIs, databases, or file systems are inherently flaky. Network hiccups, slow responses, or concurrent test runs cause unpredictable failures.
// Flaky: depends on external API
test('fetches weather', async () => {
const weather = await fetchWeather('NYC');
expect(weather.temp).toBeDefined();
});
// Stable: mocked external dependency
test('fetches weather', async () => {
mockFetch({ temp: 72, conditions: 'sunny' });
const weather = await fetchWeather('NYC');
expect(weather.temp).toBe(72);
});
AI can identify unmocked external calls and generate appropriate mocks.
Time-Dependent Tests
Tests that rely on specific times or dates break on different days or in different timezones.
// Flaky: depends on current time
test('shows greeting', () => {
const greeting = getGreeting();
expect(greeting).toBe('Good morning'); // Fails after noon
});
// Stable: controlled time
test('shows greeting', () => {
jest.useFakeTimers().setSystemTime(new Date('2025-01-01T09:00:00'));
const greeting = getGreeting();
expect(greeting).toBe('Good morning');
});
AI can identify time-sensitive code and add appropriate time mocking.
Resource Exhaustion
Tests that create resources without cleanup eventually exhaust available resources—ports, file handles, memory—causing failures in subsequent tests.
// Flaky: doesn't close server
test('handles request', async () => {
const server = createServer();
await server.listen(3000);
// test logic
// server never closed - port 3000 now unavailable
});
// Stable: proper cleanup
test('handles request', async () => {
const server = createServer();
await server.listen(3000);
try {
// test logic
} finally {
await server.close();
}
});
AI can identify resource allocation without corresponding cleanup.
AI-Powered Flaky Test Diagnosis
The first step in fixing flaky tests is identifying them. AI can help with both detection and diagnosis.
Identifying Flaky Tests
Not all test failures indicate flakiness. A flaky test is one that produces different results without code changes. AI can analyze test history to identify:
- Tests that fail intermittently (passed 90% of runs, failed 10%)
- Tests that fail only in CI but pass locally
- Tests that fail more often at certain times (suggesting time dependencies)
- Tests that fail after other specific tests (suggesting shared state)
With Devonair, you can surface flaky tests automatically:
@devonair analyze test suite for flakiness patterns
This identifies which tests are unreliable and categorizes the likely cause.
Diagnosing Root Causes
Once you know which tests are flaky, AI can analyze the test code to determine why:
@devonair diagnose flaky test: UserProfile.test.js
The AI examines:
- Async operations without proper awaiting
- Shared state across tests
- External dependencies without mocks
- Time-sensitive assertions
- Resource management patterns
The output isn't just "this test is flaky" but "this test is flaky because it doesn't await the API call on line 23."
Practical AI Repair Techniques
Here's how to use AI to actually fix flaky tests, category by category.
Fixing Async/Timing Issues
Timing issues are the most common cause of flakiness and often the easiest to fix with AI.
Before:
test('loads and displays data', () => {
render(<DataTable />);
expect(screen.getByRole('table')).toBeInTheDocument();
expect(screen.getAllByRole('row')).toHaveLength(10);
});
AI prompt:
@devonair fix timing issues in DataTable.test.js
After:
test('loads and displays data', async () => {
render(<DataTable />);
await waitFor(() => {
expect(screen.getByRole('table')).toBeInTheDocument();
});
await waitFor(() => {
expect(screen.getAllByRole('row')).toHaveLength(10);
});
});
AI identifies assertions that need async handling and adds appropriate wait patterns.
Fixing Shared State Issues
State isolation requires understanding what state is being shared and how to isolate it.
Before:
describe('UserService', () => {
const service = new UserService();
test('creates user', () => {
service.create({ name: 'Alice' });
expect(service.count()).toBe(1);
});
test('deletes user', () => {
service.delete('Alice');
expect(service.count()).toBe(0); // Fails if create test didn't run first
});
});
AI prompt:
@devonair isolate shared state in UserService.test.js
After:
describe('UserService', () => {
let service;
beforeEach(() => {
service = new UserService();
});
afterEach(() => {
service.reset();
});
test('creates user', () => {
service.create({ name: 'Alice' });
expect(service.count()).toBe(1);
});
test('deletes user', () => {
service.create({ name: 'Bob' });
service.delete('Bob');
expect(service.count()).toBe(0);
});
});
AI moves shared instances to setup/teardown and ensures each test is self-contained.
Mocking External Dependencies
External dependencies need mocking to be reliable. AI can identify external calls and generate appropriate mocks.
Before:
test('sends notification', async () => {
const result = await sendPushNotification({
userId: '123',
message: 'Hello'
});
expect(result.delivered).toBe(true);
});
AI prompt:
@devonair add mocks for external dependencies in notification.test.js
After:
import { mockPushService } from '../__mocks__/pushService';
jest.mock('../services/pushService');
test('sends notification', async () => {
mockPushService.send.mockResolvedValue({ delivered: true, id: 'notif-123' });
const result = await sendPushNotification({
userId: '123',
message: 'Hello'
});
expect(result.delivered).toBe(true);
expect(mockPushService.send).toHaveBeenCalledWith({
userId: '123',
message: 'Hello'
});
});
AI identifies the external call, creates an appropriate mock, and updates the test to use it.
Fixing Time-Dependent Tests
Tests that depend on dates or times need controlled time environments.
Before:
test('shows expiration warning', () => {
const subscription = { expiresAt: '2025-01-15' };
const warning = getExpirationWarning(subscription);
expect(warning).toBe('Expires in 5 days'); // Only passes on Jan 10
});
AI prompt:
@devonair fix time-dependent tests in subscription.test.js
After:
test('shows expiration warning', () => {
jest.useFakeTimers();
jest.setSystemTime(new Date('2025-01-10T12:00:00Z'));
const subscription = { expiresAt: '2025-01-15' };
const warning = getExpirationWarning(subscription);
expect(warning).toBe('Expires in 5 days');
jest.useRealTimers();
});
AI adds time mocking and sets a specific time that makes the assertion deterministic.
Adding Resource Cleanup
Resource leaks cause flakiness that gets worse over time. AI can identify missing cleanup.
Before:
test('handles database query', async () => {
const db = await connectDatabase();
const results = await db.query('SELECT * FROM users');
expect(results).toHaveLength(5);
});
AI prompt:
@devonair add resource cleanup to database.test.js
After:
describe('database queries', () => {
let db;
beforeAll(async () => {
db = await connectDatabase();
});
afterAll(async () => {
await db.close();
});
test('handles database query', async () => {
const results = await db.query('SELECT * FROM users');
expect(results).toHaveLength(5);
});
});
AI identifies resources that need cleanup and adds appropriate lifecycle hooks.
Workflow Integration
Fixing flaky tests shouldn't be a periodic cleanup project. It should be integrated into normal development.
Continuous Monitoring
Set up AI to continuously monitor for flakiness:
@devonair monitor test suite for flakiness
- Alert when test failure rate exceeds 5%
- Track flakiness trends over time
- Identify newly flaky tests immediately
This catches flakiness early, before it becomes normalized.
Automated Repair PRs
When AI identifies a flaky test, it can automatically create a fix PR:
@devonair auto-fix flaky tests in PR
- Diagnose root cause
- Generate fix
- Run tests to verify fix
- Create PR with explanation
The PR explains what was flaky, why, and how the fix addresses it.
Pre-Merge Flakiness Check
Before tests become part of the main suite, check for potential flakiness:
@devonair check new tests for flakiness patterns
This catches common mistakes before they become problems.
What AI Can't Fix
AI is effective for pattern-based flakiness, but some issues require human investigation.
Complex Race Conditions
Some race conditions involve subtle timing between multiple async operations. AI can identify that a race condition exists but may not determine the correct fix without deeper understanding of the intended behavior.
Business Logic Errors
If a test is flaky because the business logic it's testing is wrong, AI will try to make the test match the broken behavior. Humans need to determine whether the test or the code is correct.
Infrastructure Issues
Flakiness caused by CI infrastructure—resource limits, network configuration, parallel execution conflicts—isn't visible in the test code. AI can identify that a test fails only in CI but can't fix infrastructure problems.
Intentionally Non-Deterministic Behavior
Some code is supposed to be non-deterministic (random selection, load balancing). Tests for this code need special handling that requires understanding the intent.
Measuring Progress
Track these metrics to measure flaky test improvement:
Flaky Test Rate
Percentage of test runs with at least one flaky failure. Should decrease over time.
Mean Time to Detect Flakiness
How long after a test becomes flaky until it's identified. AI monitoring should reduce this to hours, not weeks.
Mean Time to Fix Flakiness
How long from identification to fix. AI-assisted fixes should take minutes for common patterns.
Re-Run Rate
How often developers re-run CI hoping for a different result. Should approach zero as flakiness is eliminated.
Quarantine Size
Number of tests quarantined as unreliable. This should shrink, not grow.
Getting Started
Start with the highest-impact flaky tests—the ones that fail most often and block the most merges.
Step 1: Identify Your Worst Offenders
Review CI history for the past month. Which tests failed most often without associated code changes? These are your flaky tests.
Step 2: Categorize the Causes
For each flaky test, determine the likely cause:
- Timing/async issues
- Shared state
- External dependencies
- Time sensitivity
- Resource leaks
Step 3: Fix One Category at a Time
Start with the most common category. Use AI to fix all tests in that category, then move to the next.
Step 4: Establish Ongoing Monitoring
Once existing flakiness is addressed, set up continuous monitoring to catch new flakiness immediately.
Step 5: Gate New Tests
Add flakiness pattern detection to your PR process so new tests don't introduce the same problems.
Conclusion
Flaky tests don't have to be a fact of life. They're caused by identifiable patterns: timing issues, shared state, external dependencies, time sensitivity, and resource leaks. AI can diagnose these patterns and generate fixes for most cases.
The goal isn't a magic button that fixes all flakiness. It's reducing the tedious diagnosis and repair work so developers can focus on the tests that require genuine investigation.
Stop quarantining flaky tests. Start fixing them systematically.
FAQ
How accurate is AI at fixing flaky tests?
For common patterns (timing issues, missing mocks, shared state), AI fixes work correctly most of the time. Complex race conditions or logic errors may need human intervention. Every fix goes through PR review, so incorrect fixes get caught.
Will AI-generated test fixes break my tests?
All fixes are verified by running the test suite. If the fix breaks something, the PR will show failing tests. You review before merging.
How long does it take to fix a flaky test with AI?
For common patterns, AI generates a fix in seconds to minutes. Review and merge takes a few more minutes. Compare to hours of manual debugging for race conditions.
Should I fix flaky tests or delete them?
Fix them if they test valuable functionality. Delete them if they test trivial behavior that's covered elsewhere or if the code they test is being removed. AI can help identify tests that provide little value.
What if AI keeps flagging tests as flaky that aren't?
Adjust flakiness detection thresholds. A test that fails once in 1000 runs might not be worth fixing. Focus on tests with meaningful failure rates.