How to Scope an AI Agent Project Without Getting Burned
The biggest predictor of whether an AI agent project succeeds isn't the tech. It's whether the scope document is honest. Here's what a good one looks like.
I've seen maybe sixty AI agent scope documents in the last two years. The ones where the project went well and the ones where it went sideways look almost nothing alike, and you can usually tell which is which in about ten minutes of reading.
The good ones are short, specific, and slightly boring. The bad ones are ambitious, full of adjectives, and impossible to verify. Here's what to put in a scope document that actually holds up, and the red flags I've learned to watch for.
Start with one task, not a job
The single most common failure mode is scoping an agent to do a "job" instead of a task. "Handle customer support" is a job. "Classify incoming support tickets into the right category and suggest a draft reply based on our help center" is a task.
The task version is scopeable. You can build it, test it, measure it, and ship it. The job version cannot be built because it isn't one thing, it's twelve things pretending to be one, and each of the twelve has edge cases the others don't.
If you cannot describe what the agent does in one sentence without using the word "and," break it up. One scope per task.
Write down the inputs and outputs like a function
This sounds obvious and people still don't do it. Before anyone writes a line of code, your scope should describe:
- What triggers the agent (a webhook? a cron? a user action? an email?)
- What data it receives (exact fields, source systems, schemas)
- What data it produces (exact fields, destination, format)
- What happens if any of the above is missing or malformed
- What it is absolutely not allowed to do
Think of it like you're writing the signature of a function. If you can't agree on the signature, you're not ready to start. And if your scope doesn't have a "what it cannot do" section, you're going to discover that section the hard way.
Define success before you start building
Every scope should have a number. Not vibes, a number. Two examples that work:
- "The agent correctly categorizes 90% of tickets on a sample of 200 real tickets that the ops lead has pre-labeled. We will not ship until that threshold is hit."
- "The agent produces compliant summaries in under 30 seconds for 95% of the test corpus. Latency over 30 seconds counts as a failure."
The word "accurate" without a number is a lie. The word "fast" without a number is a lie. "Soon" is a lie. Numbers force honest conversations and they give you a ship criterion. Without them you end up in endless rounds of "is it good enough yet?"
List every system the agent has to touch
Integration is where AI agent timelines go to die. Every system the agent reads from or writes to is a potential two-week surprise. Write them all down, and for each one, answer:
- Does a real API exist? (Not "they said they have an API" — a documented one you can read.)
- Who owns the credentials? Who can provision them?
- What's the rate limit? What happens when we hit it?
- Is there a sandbox environment or do we have to test in production?
- If this system goes down, what should the agent do?
I have watched a four-week project turn into a four-month project because the phrase "we'll just connect it to our ERP" was in the scope without any of these questions answered. The ERP team was not expecting us, the sandbox didn't exist, and the only person with credentials was on sabbatical. That was the whole story.
Decide the handoff before you build
What happens when the agent isn't sure? What happens when it fails? What happens when the output is weird and somebody wants to override it? These aren't edge cases. They're the thing your users will actually experience, and if the scope doesn't cover them, the engineers will have to guess.
Good scopes include the handoff plan: below some confidence threshold, the task routes to a human. Above some confidence threshold, it proceeds automatically. There's a queue, there's a way to see what the agent did, and there's a way to mark an output as wrong so it feeds into the eval set. That's the whole loop. If it's not in the scope, it's going to get bolted on later for triple the cost.
Red flags in a scope document
A handful of phrases that, in my experience, predict trouble:
- "Like ChatGPT but for [thing]." This is a vibe, not a scope.
- "Fully autonomous" as a headline feature. Nothing is fully autonomous. The question is what happens at the boundaries.
- "Learns over time." Unless someone has actually planned out the feedback loop and the retraining cadence, this is a wish.
- No mention of how failures will be detected.
- No mention of who owns the agent after launch.
- No mention of the data source, or the data source is 'we'll figure it out.'
- "MVP in 2 weeks" for something that integrates with 5 systems. The math doesn't math.
A scoping worksheet you can actually use
If you're about to send a scope to a dev team (ours or anyone else's), here's a one-page checklist that separates the well-scoped from the wishful:
- One sentence, no "and": what task does this agent do?
- What triggers it? (webhook, schedule, user action, etc.)
- What inputs does it need? List the exact fields and sources.
- What outputs does it produce? List the exact fields and destinations.
- What is it explicitly not allowed to do?
- What's the success metric? (With a number.)
- What's the test set you'll measure against? (Who owns it?)
- Every external system it touches, with API status, credentials owner, rate limits, and sandbox availability.
- What happens when it's uncertain? (Human escalation path.)
- What happens when it fails? (Retry logic, alerting, fallback.)
- Who owns it after launch? (Name. Singular. Not "the team.")
- What's the rollback plan if it turns out to be worse than the manual process?
If you can answer all twelve, you're in good shape and any competent dev shop should be able to give you a real estimate. If you can't answer more than half of them, you're not ready to scope yet. You're ready for a discovery phase, and the honest thing to do is call it that.
The shortest possible summary
One task. Typed inputs and outputs. A number for success. A list of every system it touches. A plan for failure. A named owner. That's a good scope. Everything else is decoration.
If you want us to pressure-test a scope you're working on, send it over on a strategy call. No obligation, no pitch deck. We'll tell you what's missing and whether we think the thing is even the right project to be doing.