Back to blog

    June 11, 2026

    How we build an AI customer support agent: the 2-week pilot, end to end

    If you're running a small business and drowning in repetitive support questions, you've probably looked at AI support agents and felt one of two things: this is magic, or this is going to embarrass me in front of customers. Both reactions are reasonable. What follows is exactly how I build a working customer support agent in two weeks, what gets shipped, and where the guardrails go so the thing doesn't make stuff up.

    This is the process behind the $2,997 Pilot Agent package. No mystery, no jargon — just the actual workflow.

    Why two weeks, and why a pilot

    A pilot is not a prototype you throw away. It's a working agent answering real questions in one channel, with logging and a clear path to either expand it or shut it off. Two weeks is enough time to do discovery properly, build something that handles your top questions well, and prove it works against a real evaluation set. It's not enough time to build a sprawling agent that touches six systems and handles every edge case — and that's the point. Most small businesses don't need that. They need the top 20 questions handled accurately so their inbox stops being a treadmill.

    If the pilot works, we extend it. If it doesn't, you've spent two weeks and a fixed fee instead of two quarters and a salary.

    Week 1, Day 1–2: Discovery kickoff

    The first conversation is not about AI. It's about your support reality.

    I ask for:

    • The last 30–90 days of support tickets, emails, or chat transcripts (whatever you have)
    • Any FAQ pages, help docs, internal SOPs, onboarding emails, or PDFs your team uses
    • A list of the questions you're tired of answering
    • Who currently handles support, and where the bottleneck is

    From that, I extract the top 20 questions by volume. Not the most interesting ones — the most repeated. In most small businesses, 20 questions cover 70–80% of inbound support volume. That's the agent's job.

    I also identify the high-risk questions — the ones where a wrong answer costs you money, a customer, or a legal headache. Refund policy, billing disputes, anything involving promises about delivery dates, anything medical, financial, or contractual. These get treated differently (more on that below).

    Week 1, Day 3–5: Knowledge sources and escalation rules

    Now we figure out what the agent is allowed to know and what it's allowed to do.

    Knowledge sources are the documents the agent can pull answers from. For most small businesses, this is a mix of:

    • Public help docs or FAQ pages
    • Internal SOPs (cleaned up — I'll flag anything that contradicts itself)
    • Product or service descriptions
    • A "canned responses" document if your team has one

    I do a pass on these documents before anything goes into the system. If your refund policy says one thing on the website and something different in an internal doc, the agent will happily contradict itself in front of a customer. Cleaning the source material is half the work.

    Escalation rules define when the agent stops answering and hands off to a human:

    • Customer explicitly asks for a human
    • Agent's confidence drops below a threshold
    • Question touches a high-risk topic (refunds over a certain amount, legal, complaints)
    • Customer is frustrated (detected by tone and language patterns)
    • Conversation has gone more than N turns without resolution

    These rules get written down before any code is written. They become part of the runbook you keep.

    Success metric is the last piece of discovery. What does "this worked" look like? Usually it's one of:

    • Deflection rate (% of conversations resolved without a human)
    • Time-to-first-response
    • Customer satisfaction on resolved conversations
    • Hours of staff time freed per week

    You pick one primary metric. Everything else is secondary.

    Week 2, Day 1–3: Building the agent

    This is where the actual engineering happens. Three things get built in parallel.

    1. Retrieval over your documents. The agent uses a technique called retrieval-augmented generation (RAG). In plain English: when a customer asks a question, the system first searches your documents for relevant chunks, then asks the language model to answer using only those chunks. This is what prevents the agent from making things up about your business. It can only speak from the source material you provided.

    2. Hardcoded routes for high-risk questions. Not every question should be answered by a language model. Some questions get a deterministic response — a fixed answer, or an immediate handoff to a human, every single time. Examples:

    • "I want a refund" → hardcoded response with your refund policy + handoff option
    • "This is urgent / emergency" → immediate human handoff
    • "Cancel my account" → hardcoded flow that captures the request and notifies you
    • Anything matching keywords like "lawyer," "lawsuit," "BBB," "chargeback" → immediate escalation

    The rule of thumb: if a wrong answer would cost you more than the entire pilot, hardcode it.

    3. Human handoff path. When the agent escalates, what actually happens? Options I commonly set up:

    • Email notification to your support inbox with the full conversation transcript
    • Slack message to a designated channel
    • Ticket created in your existing help desk
    • "We're connecting you with a human, expect a reply within X hours" message to the customer

    The customer should never feel abandoned at the handoff point. The agent says something like "I'm going to bring in a teammate who can help with this — they'll respond by [time]." Then it actually happens.

    Week 2, Day 4: The eval set

    Before the agent goes live, I build an evaluation set: 30–50 real questions pulled from your historical tickets, paired with the correct answer. I run the agent against this set and grade the results.

    The grades are simple:

    • Correct and complete — agent answered accurately
    • Correct but incomplete — agent got it right but missed nuance
    • Incorrect — agent got it wrong
    • Refused / escalated appropriately — agent didn't try to answer something it shouldn't
    • Refused / escalated when it shouldn't have — agent punted on something it could handle

    A pilot ships when the agent is hitting 85%+ "correct and complete" on the eval set for the in-scope questions, with zero incorrect answers on high-risk topics. If we don't hit that, I keep tuning the retrieval, prompts, and hardcoded routes until we do.

    Week 2, Day 5: What ships

    At the end of week two, you get three things.

    A working agent in your channel of choice. Most small businesses pick one of:

    • A web widget embedded on your site
    • An email auto-responder that handles common questions and escalates the rest
    • A Slack bot for internal support
    • A WhatsApp or SMS responder

    You pick one channel for the pilot. Adding channels happens after the pilot proves out.

    A dashboard. A simple view that shows:

    • Conversations per day
    • Deflection rate (resolved without human)
    • Escalation reasons
    • Average response time
    • Recent conversations you can click into and read

    Nothing fancy. You need to be able to see what's happening without logging into five different tools.

    A runbook. A short document that covers:

    • What the agent can and can't do
    • The escalation rules
    • How to update the knowledge base when your policies change
    • How to turn the agent off if something's wrong
    • Who to call (me) if it breaks

    Hardcoded vs LLM-driven: where the line is

    This is the question most small business owners don't think to ask, and it's the most important one. Here's how I draw the line:

    Hardcoded (rules-based):

    • Anything involving money commitments
    • Anything legal or compliance-related
    • Account changes (cancel, downgrade, change billing)
    • Emergencies and urgency signals
    • Routing logic (which department, which agent)
    • Explicit human-handoff requests

    LLM-driven (model judgment):

    • "How do I…" questions answered from your docs
    • Product or service explanations
    • Hours, location, contact info
    • Status updates that pull from public sources
    • Friendly small talk and conversation flow
    • Rephrasing customer questions to understand what they actually need

    The model handles the conversation and the answering. The rules handle the consequences. That split is what makes the agent both useful and safe.

    Monitoring and what happens after launch

    The pilot doesn't end the day the agent goes live. The first two weeks of production are where you find out what discovery missed.

    Logged conversations. Every conversation is logged in full. I review the first week's logs with you and flag:

    • Questions the agent handled well
    • Questions the agent should have escalated but didn't
    • Questions the agent escalated unnecessarily
    • Gaps in your knowledge base that show up as repeated "I don't know" responses

    Eval cadence. The evaluation set gets re-run monthly to catch drift. Language models change, your documents change, your customers ask new things. The eval set grows over time — every weird question that comes in gets added.

    Drift checks. I watch for two specific failure modes:

    1. The agent starts hallucinating because your knowledge base has gone stale (you changed a policy and didn't update the source doc)
    2. The agent starts over-escalating because customers are asking new types of questions the eval set never covered

    Both are fixable. Neither is invisible if you're watching.

    What this costs and what you get

    The $2,997 Pilot Agent package covers everything above: discovery, build, eval, deployment in one channel, dashboard, runbook, and the first two weeks of monitoring. Fixed price, two weeks, working agent at the end.

    If it works, you can extend into a production package (more channels, deeper integrations, ongoing tuning) or move to a retainer for maintenance. If it doesn't work — meaning it doesn't hit the success metric we agreed on in week one — we figure out together whether to iterate or stop.

    What you don't get: a black box, a six-month timeline, a salesperson, or a tool you have to figure out how to use yourself.

    Ready to see what your top 20 questions look like as an agent?

    If you've got a support inbox that's eating your team's time and a stack of documents that already contain the answers, the pilot is built for you. Two weeks, fixed price, working agent at the end — and an honest answer about whether AI is the right fix for your situation.

    Take a look at the AI Pilot Agent package or get in touch and we'll start with a 20-minute conversation about your top questions.

    Need help with what this post covers? I do this for a living.

    Book a free 15-min site audit