[AI in QA] AI in QA is unreliable - unless you do this

Milena Cylińska
5 days ago
4 min read

When teams start using AI in QA, the first results often feel almost great. You ask for a Playwright test, the model generates something reasonable, and it looks like you’ve just accelerated your entire testing process overnight - that feeling rarely lasts.

After a few iterations, cracks begin to appear: failures are “fixed” in ways that don’t actually work, flaky tests start to multiply instead of disappear and slowly confidence in the system begins to erode.

At that point, most teams assume the issue lies in the model or the prompt. They try a different provider, a larger model, or tweak the prompt again and again. In reality, the problem is rarely the root cause. It’s the architecture around it.

AI in QA is not a chatbot, it's a system

One of the most important mindset shifts you can make is to stop thinking about AI as a prompt-based test generator and start thinking about it as a system.

This distinction explains why two teams using the exact same model can see completely different results. The Confucius Code Agent demonstrated this clearly by achieving significantly higher performance on SWE-Bench without changing the model at all. The improvement came entirely from better handling of context, memory, and tool orchestration. The difference between an AI that occasionally generates useful tests and one that consistently improves your pipeline lies in how you design the system around it.

Memory: agents that don’t learn

One of the most common architectural gaps is the absence of memory.

Most AI agents behave as if every test run is the first time they have ever seen your codebase. They do not remember that a selector was flaky yesterday. They do not recall that a login test failed because of a race condition. They simply regenerate solutions based on the current prompt, often repeating the same mistakes.

To understand how limiting this is, imagine a QA engineer who forgets everything after each test run. They would continuously rediscover the same issues, propose the same incorrect fixes, and never improve over time. This is exactly how most AI-driven QA systems operate today.

A simple but powerful improvement is to introduce structured memory that captures not just errors, but their causes and resolutions. For example, after a failing test, instead of storing raw logs, the system can persist a structured insight such as:

{  
	"problem": "login test flaky",  
	"cause": "race condition",  
	"fix": "wait for API response",  
	"confidence": 0.9
}

When a similar issue appears again, the system does not start from scratch. It retrieves relevant past knowledge and injects it into the context before generating a solution. This approach, known as Retrieval-Augmented Generation (RAG), allows the agent to behave less like a stateless generator and more like an engineer who learns from experience.

However, this only works if memory is treated carefully. Storing everything leads to noise, and noise leads to worse decisions. The key is to store only high-quality insights and retrieve only the most relevant ones, typically limiting context injection to a small number of entries that directly relate to the current problem.

Why mixing context layers breaks your AI

Another subtle but critical issue appears in how teams handle context. In many implementations, everything is thrown into the prompt: CI logs, stack traces, debug messages, and human-readable notes.

At first glance, this seems helpful. More information should lead to better decisions, right? In practice, it has the opposite effect.

Different types of information serve different purposes. Logs are written for humans and often contain noise, repetition, and irrelevant details. AI models, on the other hand, perform best when the input is structured, concise, and semantically meaningful.

Consider the difference between passing a raw error log and a structured interpretation of that error. A log might include hundreds of lines of stack traces, while a structured input might simply state that a specific test failed due to a timeout likely caused by a missing await on an API call. Both contain the same underlying information, but only one is usable in a predictable way.

Separating what the human sees from what the agent processes is one of the most effective ways to improve output quality. It reduces noise, clarifies intent, and allows the agent to focus on reasoning instead of parsing.

Define roles, enforce boundaries

As systems grow more complex, many teams attempt to solve problems by adding more agents. Unfortunately, without proper structure, this often makes things worse. Multiple agents operating without clear boundaries tend to interfere with each other. One agent may generate a test while another modifies it, leading to unpredictable results. Execution order becomes unclear, and implicit assumptions creep into the system. The root cause is a misunderstanding of how multi-agent systems behave.

Each agent should be treated as a component with a clearly defined responsibility, explicit inputs and outputs, and strict limitations on what it can and cannot do. For example, a test-generation agent should focus solely on creating tests, while a debugging agent should handle failures without introducing new tests. If both can modify the same files, you introduce race conditions that are difficult to detect and even harder to debug. By defining roles and enforcing boundaries, you transform a chaotic set of interactions into a predictable and scalable system.

The missing loop

Perhaps the most critical difference between a fragile AI setup and a robust one is the presence of an execution loop. Many implementations stop at generation. The agent produces a test, and a human runs it later. This breaks the feedback cycle and prevents the system from learning. A proper system closes the loop. It generates a test, runs it immediately using a command like:

npx playwright test --reporter=json

Then it parses the results, identifies failures, and attempts to fix them before running the test again.

This continuous cycle transforms the agent from a passive generator into an active participant in the QA process. It grounds decisions in real outcomes rather than assumptions and enables iterative improvement.

Building this in practice

This is exactly the gap I help teams close at scale-qa.com. Instead of focusing on prompts or isolated tools, I work on designing end-to-end systems that integrate AI into Playwright workflows in a way that is reliable, scalable, and safe for production environments. The goal is not to experiment with AI, but to make it a dependable part of your QA strategy.