Why Most AI Projects Fail (And It's Not the Models)
The model is almost never the thing that's broken. After a decade of shipping production software and two years of shipping AI agents on top of it, the pattern is hard to miss.
Every few weeks someone emails us to say their AI thing is broken. Could we please swap in Claude Opus 4.6, or GPT-5, or whatever just dropped? They're pretty sure that will fix it.
We almost never touch the model. Nine times out of ten, the model is fine. The problem is somewhere else, and if we swap the model it just hides the real issue for another quarter. Then the bill gets bigger and the problem comes back.
Here's what's actually going wrong when AI projects stall, ranked by how often we see it.
1. The data is a mess and nobody wants to say it out loud
This is the big one. A client last year had a customer support RAG system that kept giving weird, confident, wrong answers. They wanted us to try a different embedding model. We asked to see the knowledge base first.
The knowledge base was eight years of exported PDFs, old Confluence pages, two conflicting versions of the refund policy, and a folder called "DO NOT USE - OLD" that was somehow still being indexed. The AI was doing exactly what it was told. It was finding the closest match in a pile of contradictory documents and answering confidently. That's the job.
We didn't change the model. We spent two weeks cleaning the source documents, adding a freshness filter, and marking which pages were canonical. Accuracy went from something like 60% to 94% on their eval set. Same model.
If your AI system is wrong a lot, look at what it's reading before you look at how it's thinking.
2. There's no feedback loop, so nobody knows if it's working
This one is quieter and more dangerous. A team ships an AI feature, it demos well, the CEO is happy, and then it just sits there for six months. Nobody is logging which responses users accepted, which they edited, which they rejected. No one has written a single evaluation. The whole thing is running on vibes.
The minimum viable feedback loop is boring and unglamorous:
- Log the input, the output, and what the user did next.
- Pick 50 real examples and label them by hand. That's your eval set.
- Run the eval every time you change anything. Anything at all.
- When a user flags a bad response, add it to the eval set.
That's it. That's the whole loop. It takes maybe a day to set up and it turns "we think it's working" into "it's passing 47 of 50 cases, and we know exactly which three are broken." Without it, you're going to keep swapping models and crossing your fingers.
3. Nobody owns it
The agent is an org chart problem disguised as a technical one. Who decides what the agent can and can't do? Who approves the prompt when someone wants to change it? Who gets paged when it starts hallucinating on a Saturday? If the answer to any of those is "I don't know" or "we're still figuring that out," the project is going to drift.
The clients who get real value from AI systems all have one person whose job includes "this agent works." Sometimes it's a product manager, sometimes it's an ops lead, sometimes it's the founder. It's almost never "everyone."
4. The success metric is fake
A lot of AI projects get measured on things like "time saved" or "tickets handled" without anyone checking whether the work that got done was actually good. That's how you end up with an auto-responder that answers 10,000 emails a month and quietly tells 400 of them the wrong return window.
Pick a metric that only moves when the thing is actually working. For a support agent, that looks like "tickets resolved without human escalation, with no follow-up from the customer within 48 hours." For a document extraction agent, "fields extracted correctly, sampled against human review." The metric should make you nervous. If a metric doesn't make you nervous, it's not measuring anything.
5. Somebody promised 'it's just a prompt'
Prompting is a real skill. Prompting alone is not a product. The gap between "we wrote a great prompt in the playground" and "this runs 10,000 times a day against real user input and handles edge cases gracefully" is enormous. It's the gap between a demo and a system. Most of the engineering is in that gap.
You need retries, fallbacks, timeouts, rate limit handling, input validation, output validation, logging, auditing, caching, and a way to roll back a bad prompt change. Every one of those has bitten us in production. None of them are in the playground.
What actually helps
If you're staring at a stuck AI project, here's the order I'd work the problem in:
- Spend a day looking at real outputs by hand. Not metrics. Actual inputs and outputs, read one at a time. Patterns will jump out that no dashboard will show you.
- Look at the data the system is reading. Is it clean? Is it current? Is it canonical? Most of the time you'll find the fix right here.
- Build an eval set of 30-50 examples that you can run automatically. Label them. Keep them updated.
- Name a single owner. Make sure they have authority to change the prompt and the data.
- Only after all of that, and only if the evals are still bad, do you start thinking about a different model.
The models will keep getting better. That's a given. But a better model on broken data with no feedback loop and no owner is still going to fail. It's just going to fail faster and more expensively.
If your AI project is stuck, we can usually tell you in 30 minutes whether it's the model or something else. It's almost always something else.