Use CasesguideDecember 8, 202511 min read

How to Use AI to Fix Flaky Tests: A Practical Guide

Learn practical techniques for using AI to identify, diagnose, and fix flaky tests. Stop quarantining unreliable tests and start making them stable.

Flaky tests are the silent productivity killer in your CI pipeline. They pass sometimes, fail sometimes, and nobody knows why. Teams develop coping mechanisms: re-run the pipeline, ignore certain failures, quarantine unreliable tests. None of these solve the problem.

The traditional approach to flaky tests is either ignore them (bad) or spend hours debugging race conditions and timing issues (expensive). AI offers a third path: systematic identification and repair of the patterns that cause test flakiness.

This guide covers practical techniques for using AI to fix flaky tests—not as a magic solution, but as a tool that handles the tedious diagnosis and repair work.

Understanding Why Tests Become Flaky

Before fixing flaky tests, it helps to understand what makes them flaky. AI is effective at identifying and fixing these patterns because they're often systematic rather than random.

Timing and Race Conditions

The most common cause of flakiness: tests that depend on things happening in a certain order, but don't enforce that order.

// Flaky: assumes data loads before assertion
test('displays user name', async () => {
  render(<UserProfile id="123" />);
  expect(screen.getByText('John Doe')).toBeInTheDocument();
});

// Stable: waits for data to load
test('displays user name', async () => {
  render(<UserProfile id="123" />);
  await waitFor(() => {
    expect(screen.getByText('John Doe')).toBeInTheDocument();
  });
});

AI can identify tests missing proper async handling and add appropriate wait conditions.

Shared State Between Tests

Tests that modify shared state create order-dependent failures. Test A passes alone but fails after Test B runs.

// Flaky: modifies global state
let userCount = 0;

test('creates user', () => {
  createUser();
  userCount++;
  expect(userCount).toBe(1); // Fails if another test modified userCount
});

// Stable: isolated state
test('creates user', () => {
  const state = { userCount: 0 };
  createUser(state);
  expect(state.userCount).toBe(1);
});

AI can trace state dependencies and identify tests that need isolation.

External Dependencies

Tests that hit real APIs, databases, or file systems are inherently flaky. Network hiccups, slow responses, or concurrent test runs cause unpredictable failures.

// Flaky: depends on external API
test('fetches weather', async () => {
  const weather = await fetchWeather('NYC');
  expect(weather.temp).toBeDefined();
});

// Stable: mocked external dependency
test('fetches weather', async () => {
  mockFetch({ temp: 72, conditions: 'sunny' });
  const weather = await fetchWeather('NYC');
  expect(weather.temp).toBe(72);
});

AI can identify unmocked external calls and generate appropriate mocks.

Time-Dependent Tests

Tests that rely on specific times or dates break on different days or in different timezones.

// Flaky: depends on current time
test('shows greeting', () => {
  const greeting = getGreeting();
  expect(greeting).toBe('Good morning'); // Fails after noon
});

// Stable: controlled time
test('shows greeting', () => {
  jest.useFakeTimers().setSystemTime(new Date('2025-01-01T09:00:00'));
  const greeting = getGreeting();
  expect(greeting).toBe('Good morning');
});

AI can identify time-sensitive code and add appropriate time mocking.

Resource Exhaustion

Tests that create resources without cleanup eventually exhaust available resources—ports, file handles, memory—causing failures in subsequent tests.

// Flaky: doesn't close server
test('handles request', async () => {
  const server = createServer();
  await server.listen(3000);
  // test logic
  // server never closed - port 3000 now unavailable
});

// Stable: proper cleanup
test('handles request', async () => {
  const server = createServer();
  await server.listen(3000);
  try {
    // test logic
  } finally {
    await server.close();
  }
});

AI can identify resource allocation without corresponding cleanup.

AI-Powered Flaky Test Diagnosis

The first step in fixing flaky tests is identifying them. AI can help with both detection and diagnosis.

Identifying Flaky Tests

Not all test failures indicate flakiness. A flaky test is one that produces different results without code changes. AI can analyze test history to identify:

  • Tests that fail intermittently (passed 90% of runs, failed 10%)
  • Tests that fail only in CI but pass locally
  • Tests that fail more often at certain times (suggesting time dependencies)
  • Tests that fail after other specific tests (suggesting shared state)

With Devonair, you can surface flaky tests automatically:

@devonair analyze test suite for flakiness patterns

This identifies which tests are unreliable and categorizes the likely cause.

Diagnosing Root Causes

Once you know which tests are flaky, AI can analyze the test code to determine why:

@devonair diagnose flaky test: UserProfile.test.js

The AI examines:

  • Async operations without proper awaiting
  • Shared state across tests
  • External dependencies without mocks
  • Time-sensitive assertions
  • Resource management patterns

The output isn't just "this test is flaky" but "this test is flaky because it doesn't await the API call on line 23."

Practical AI Repair Techniques

Here's how to use AI to actually fix flaky tests, category by category.

Fixing Async/Timing Issues

Timing issues are the most common cause of flakiness and often the easiest to fix with AI.

Before:

test('loads and displays data', () => {
  render(<DataTable />);
  expect(screen.getByRole('table')).toBeInTheDocument();
  expect(screen.getAllByRole('row')).toHaveLength(10);
});

AI prompt:

@devonair fix timing issues in DataTable.test.js

After:

test('loads and displays data', async () => {
  render(<DataTable />);

  await waitFor(() => {
    expect(screen.getByRole('table')).toBeInTheDocument();
  });

  await waitFor(() => {
    expect(screen.getAllByRole('row')).toHaveLength(10);
  });
});

AI identifies assertions that need async handling and adds appropriate wait patterns.

Fixing Shared State Issues

State isolation requires understanding what state is being shared and how to isolate it.

Before:

describe('UserService', () => {
  const service = new UserService();

  test('creates user', () => {
    service.create({ name: 'Alice' });
    expect(service.count()).toBe(1);
  });

  test('deletes user', () => {
    service.delete('Alice');
    expect(service.count()).toBe(0); // Fails if create test didn't run first
  });
});

AI prompt:

@devonair isolate shared state in UserService.test.js

After:

describe('UserService', () => {
  let service;

  beforeEach(() => {
    service = new UserService();
  });

  afterEach(() => {
    service.reset();
  });

  test('creates user', () => {
    service.create({ name: 'Alice' });
    expect(service.count()).toBe(1);
  });

  test('deletes user', () => {
    service.create({ name: 'Bob' });
    service.delete('Bob');
    expect(service.count()).toBe(0);
  });
});

AI moves shared instances to setup/teardown and ensures each test is self-contained.

Mocking External Dependencies

External dependencies need mocking to be reliable. AI can identify external calls and generate appropriate mocks.

Before:

test('sends notification', async () => {
  const result = await sendPushNotification({
    userId: '123',
    message: 'Hello'
  });
  expect(result.delivered).toBe(true);
});

AI prompt:

@devonair add mocks for external dependencies in notification.test.js

After:

import { mockPushService } from '../__mocks__/pushService';

jest.mock('../services/pushService');

test('sends notification', async () => {
  mockPushService.send.mockResolvedValue({ delivered: true, id: 'notif-123' });

  const result = await sendPushNotification({
    userId: '123',
    message: 'Hello'
  });

  expect(result.delivered).toBe(true);
  expect(mockPushService.send).toHaveBeenCalledWith({
    userId: '123',
    message: 'Hello'
  });
});

AI identifies the external call, creates an appropriate mock, and updates the test to use it.

Fixing Time-Dependent Tests

Tests that depend on dates or times need controlled time environments.

Before:

test('shows expiration warning', () => {
  const subscription = { expiresAt: '2025-01-15' };
  const warning = getExpirationWarning(subscription);
  expect(warning).toBe('Expires in 5 days'); // Only passes on Jan 10
});

AI prompt:

@devonair fix time-dependent tests in subscription.test.js

After:

test('shows expiration warning', () => {
  jest.useFakeTimers();
  jest.setSystemTime(new Date('2025-01-10T12:00:00Z'));

  const subscription = { expiresAt: '2025-01-15' };
  const warning = getExpirationWarning(subscription);

  expect(warning).toBe('Expires in 5 days');

  jest.useRealTimers();
});

AI adds time mocking and sets a specific time that makes the assertion deterministic.

Adding Resource Cleanup

Resource leaks cause flakiness that gets worse over time. AI can identify missing cleanup.

Before:

test('handles database query', async () => {
  const db = await connectDatabase();
  const results = await db.query('SELECT * FROM users');
  expect(results).toHaveLength(5);
});

AI prompt:

@devonair add resource cleanup to database.test.js

After:

describe('database queries', () => {
  let db;

  beforeAll(async () => {
    db = await connectDatabase();
  });

  afterAll(async () => {
    await db.close();
  });

  test('handles database query', async () => {
    const results = await db.query('SELECT * FROM users');
    expect(results).toHaveLength(5);
  });
});

AI identifies resources that need cleanup and adds appropriate lifecycle hooks.

Workflow Integration

Fixing flaky tests shouldn't be a periodic cleanup project. It should be integrated into normal development.

Continuous Monitoring

Set up AI to continuously monitor for flakiness:

@devonair monitor test suite for flakiness
- Alert when test failure rate exceeds 5%
- Track flakiness trends over time
- Identify newly flaky tests immediately

This catches flakiness early, before it becomes normalized.

Automated Repair PRs

When AI identifies a flaky test, it can automatically create a fix PR:

@devonair auto-fix flaky tests in PR
- Diagnose root cause
- Generate fix
- Run tests to verify fix
- Create PR with explanation

The PR explains what was flaky, why, and how the fix addresses it.

Pre-Merge Flakiness Check

Before tests become part of the main suite, check for potential flakiness:

@devonair check new tests for flakiness patterns

This catches common mistakes before they become problems.

What AI Can't Fix

AI is effective for pattern-based flakiness, but some issues require human investigation.

Complex Race Conditions

Some race conditions involve subtle timing between multiple async operations. AI can identify that a race condition exists but may not determine the correct fix without deeper understanding of the intended behavior.

Business Logic Errors

If a test is flaky because the business logic it's testing is wrong, AI will try to make the test match the broken behavior. Humans need to determine whether the test or the code is correct.

Infrastructure Issues

Flakiness caused by CI infrastructure—resource limits, network configuration, parallel execution conflicts—isn't visible in the test code. AI can identify that a test fails only in CI but can't fix infrastructure problems.

Intentionally Non-Deterministic Behavior

Some code is supposed to be non-deterministic (random selection, load balancing). Tests for this code need special handling that requires understanding the intent.

Measuring Progress

Track these metrics to measure flaky test improvement:

Flaky Test Rate

Percentage of test runs with at least one flaky failure. Should decrease over time.

Mean Time to Detect Flakiness

How long after a test becomes flaky until it's identified. AI monitoring should reduce this to hours, not weeks.

Mean Time to Fix Flakiness

How long from identification to fix. AI-assisted fixes should take minutes for common patterns.

Re-Run Rate

How often developers re-run CI hoping for a different result. Should approach zero as flakiness is eliminated.

Quarantine Size

Number of tests quarantined as unreliable. This should shrink, not grow.

Getting Started

Start with the highest-impact flaky tests—the ones that fail most often and block the most merges.

Step 1: Identify Your Worst Offenders

Review CI history for the past month. Which tests failed most often without associated code changes? These are your flaky tests.

Step 2: Categorize the Causes

For each flaky test, determine the likely cause:

  • Timing/async issues
  • Shared state
  • External dependencies
  • Time sensitivity
  • Resource leaks

Step 3: Fix One Category at a Time

Start with the most common category. Use AI to fix all tests in that category, then move to the next.

Step 4: Establish Ongoing Monitoring

Once existing flakiness is addressed, set up continuous monitoring to catch new flakiness immediately.

Step 5: Gate New Tests

Add flakiness pattern detection to your PR process so new tests don't introduce the same problems.

Conclusion

Flaky tests don't have to be a fact of life. They're caused by identifiable patterns: timing issues, shared state, external dependencies, time sensitivity, and resource leaks. AI can diagnose these patterns and generate fixes for most cases.

The goal isn't a magic button that fixes all flakiness. It's reducing the tedious diagnosis and repair work so developers can focus on the tests that require genuine investigation.

Stop quarantining flaky tests. Start fixing them systematically.


FAQ

How accurate is AI at fixing flaky tests?

For common patterns (timing issues, missing mocks, shared state), AI fixes work correctly most of the time. Complex race conditions or logic errors may need human intervention. Every fix goes through PR review, so incorrect fixes get caught.

Will AI-generated test fixes break my tests?

All fixes are verified by running the test suite. If the fix breaks something, the PR will show failing tests. You review before merging.

How long does it take to fix a flaky test with AI?

For common patterns, AI generates a fix in seconds to minutes. Review and merge takes a few more minutes. Compare to hours of manual debugging for race conditions.

Should I fix flaky tests or delete them?

Fix them if they test valuable functionality. Delete them if they test trivial behavior that's covered elsewhere or if the code they test is being removed. AI can help identify tests that provide little value.

What if AI keeps flagging tests as flaky that aren't?

Adjust flakiness detection thresholds. A test that fails once in 1000 runs might not be worth fixing. Focus on tests with meaningful failure rates.