Back to blog

    June 15, 2026

    Building your first AI ops agent: a step-by-step walkthrough on a real workflow

    Most "AI agent" content stops at the demo. You see a chatbot answer a question and the writer calls it a day. That's not an agent — that's a wrapper around a chat model. A real ops agent does work: it pulls data from somewhere, decides something, writes to a system of record, and tells a human what happened. This post walks through exactly how to build one for a workflow almost every small business has: turning inbound contact form submissions into qualified, routed, logged leads in your CRM.

    By the end, you'll know the architecture, see the actual prompts, understand the guardrails that keep it from doing something stupid, and have a real number for what it costs to run.

    The workflow we're automating

    Here's the manual version, the one most small businesses run today:

    1. Someone fills out the contact form on your site.
    2. You get an email. Maybe it goes to a shared inbox.
    3. Someone (you, an assistant, a sales rep) reads it during business hours.
    4. They Google the company to figure out if it's a real lead or a tire-kicker.
    5. They decide it's worth a call, copy the info into your CRM, and either reply or assign it to someone.
    6. Some leads sit for hours. Some sit for days. Some get forgotten.

    The agent version:

    1. Someone fills out the contact form.
    2. Within ~30 seconds: the agent enriches the submission with public company data, scores it against your ideal customer profile, writes a clean record to your CRM with notes and a priority tag, and pings the right person in Slack with a one-paragraph summary and a recommended next action.
    3. Low-confidence or weird submissions land in a human review queue instead of getting auto-actioned.

    Same outcome, faster, consistent, logged. Nobody forgets a lead because they were at lunch.

    The architecture, in plain English

    There are five moving parts. Don't let the names intimidate you — each one has a single job.

    1. Trigger. Something has to kick off the agent. In this case, it's a webhook from your contact form (Typeform, Formspree, a custom Next.js form, HubSpot forms — they all do this). The form vendor POSTs the submission to a URL you control.

    2. Orchestrator. This is the brain of the operation. It receives the webhook, decides which tools to call, calls them in order, and handles errors. I usually build this as a single serverless function (Vercel, Cloudflare Workers, or AWS Lambda) or a small worker on something like Railway. It's a few hundred lines of TypeScript or Python.

    3. Retrieval tools. These pull context the model needs to make a good decision. For lead triage that means: a company lookup (Clearbit, Apollo, or just a clever Google + scrape), a check against your existing CRM ("have we talked to this person before?"), and optionally a check against a blocklist (competitors, known spammers).

    4. Action tools. Once the model decides what to do, these execute. CRM write (HubSpot, Pipedrive, Close, Attio — all have decent APIs), Slack message, email reply, calendar booking link.

    5. Eval log. Every run gets logged: input, retrieval results, model output, action taken, confidence score, latency, token cost. Without this, you're flying blind. You can't improve what you can't measure, and you can't trust an agent you can't audit.

    That's it. Trigger, orchestrator, retrieval, action, log. Every production agent I build follows this same skeleton — only the tools change.

    The classifier prompt

    This is where most people go wrong. They write a vague prompt like "you are a helpful sales assistant, qualify this lead." The model does something, but you can't predict what, and you definitely can't trust the output enough to auto-write it to your CRM.

    Here's a stripped-down version of the classifier prompt I'd use for this workflow. Real prompts get longer with company-specific context, but this is the structure:

    You are a lead qualification agent for [COMPANY NAME], a [BUSINESS 
    DESCRIPTION] serving [TARGET CUSTOMER].
    
    Your job: given a contact form submission and enrichment data, output 
    a JSON object that classifies the lead and recommends an action.
    
    IDEAL CUSTOMER PROFILE:
    - Company size: [X-Y] employees
    - Industry: [list]
    - Indicators of fit: [list, e.g. "has a marketing team", "runs 
      WordPress", "has ecommerce"]
    - Disqualifiers: [list, e.g. "agency reselling", "students", 
      "competitor domains"]
    
    INPUTS:
    - form_submission: the raw form data
    - enrichment: company data from Clearbit lookup (may be null)
    - crm_history: prior interactions if any (may be empty)
    
    OUTPUT SCHEMA (you MUST return valid JSON matching this exactly):
    {
      "intent": "buying" | "researching" | "support" | "spam" | "unclear",
      "fit_score": 0-100,
      "fit_reasoning": "1-2 sentences citing specific evidence",
      "confidence": 0.0-1.0,
      "recommended_action": "auto_route_sales" | "auto_route_support" 
        | "human_review" | "auto_reject_spam",
      "assigned_to": "sales" | "support" | "founder" | null,
      "priority": "high" | "medium" | "low",
      "summary_for_slack": "2-3 sentence summary written for a human 
        teammate, including the key ask and any red flags"
    }
    
    RULES:
    1. If enrichment is null AND the email is a free provider (gmail, 
       yahoo, etc.), confidence cannot exceed 0.6.
    2. If the message contains fewer than 15 words, intent defaults to 
       "unclear" unless context strongly suggests otherwise.
    3. Never recommend auto_route_sales with confidence below 0.75.
    4. If you see disqualifier signals, recommended_action must be 
       "human_review" or "auto_reject_spam" — never auto_route.
    5. fit_reasoning must cite the actual data, not generalities. Bad: 
       "seems like a good fit". Good: "50-employee SaaS in healthcare 
       matches ICP; message mentions specific pain point (HIPAA compliance)".
    

    Two things to notice. First, the output is a strict JSON schema. We're not parsing free text — the model returns structured data we can validate. Second, the rules section is where the guardrails live. The model is told, in plain terms, what it's not allowed to do.

    Guardrails: the boring part that keeps you out of trouble

    This is the section most tutorials skip. It's also the difference between an agent you can leave running unsupervised and one that, on a bad day, emails 200 prospects calling them by the wrong company name.

    Schema validation on every tool output. When the classifier returns JSON, you validate it against the schema with something like Zod (TypeScript) or Pydantic (Python). If the model returns malformed JSON or a field outside the allowed enum, you don't trust it — you retry once, and if it fails again, the lead goes to human review with an error note. Don't ever let an unvalidated model output reach an action tool.

    import { z } from "zod";
    
    const ClassificationSchema = z.object({
      intent: z.enum(["buying", "researching", "support", "spam", "unclear"]),
      fit_score: z.number().min(0).max(100),
      fit_reasoning: z.string().min(20).max(500),
      confidence: z.number().min(0).max(1),
      recommended_action: z.enum([
        "auto_route_sales",
        "auto_route_support",
        "human_review",
        "auto_reject_spam",
      ]),
      assigned_to: z.enum(["sales", "support", "founder"]).nullable(),
      priority: z.enum(["high", "medium", "low"]),
      summary_for_slack: z.string().min(40).max(600),
    });
    
    const parsed = ClassificationSchema.safeParse(modelOutput);
    if (!parsed.success) {
      await sendToReviewQueue(submission, "schema_validation_failed");
      return;
    }
    

    Confidence floor before auto-action. Even if the model says auto_route_sales, the orchestrator double-checks the confidence score. Below 0.75? Route to human review anyway. The model's stated action is a suggestion, not a command.

    Human review queue. This is a Slack channel, a Linear project, a simple internal page — anywhere humans can see the queued submissions, the agent's reasoning, and approve or reject with one click. Every flagged lead lands here. In the first month of running a new agent, you should expect 20-40% of leads to hit this queue. As you refine the prompt and ICP rules, that number drops. You'll never get it to zero, and you shouldn't want to.

    Domain blocklists and rate limits. Before the model even runs, the orchestrator checks: is this email from a known competitor domain? Have we already received 5 submissions from this IP in the last hour (spam bot)? Cheap filtering at the edge saves you API tokens and prevents weird outputs.

    No "creative" actions allowed. The action tools are a fixed set: write to CRM, post to Slack, send templated email reply. The agent cannot send custom emails, cannot call APIs you haven't whitelisted, cannot edit existing CRM records (only create new ones). This is the single most important architectural decision: limit the action surface area. An agent that can only do five things will fail in at most five ways.

    Idempotency. If the webhook fires twice (it will, sometimes), you don't want two CRM records and two Slack pings. Hash the submission and check if you've seen it in the last 24 hours before processing.

    The full run, step by step

    Here's what happens when a real submission comes in:

    1. Webhook arrives. Form vendor POSTs JSON to /api/agent/lead-triage. The orchestrator validates the request signature so randos can't hit your endpoint with fake leads.

    2. Pre-checks. Hash the submission for idempotency. Check the email domain against the blocklist. Check the IP rate limit. If anything fails, log and exit.

    3. Enrichment (parallel). Fire two API calls at the same time: company lookup on the email's domain, and CRM search for prior contacts. Total wait time: ~1.5 seconds for the slower of the two.

    4. Classifier call. Build the prompt with the form data, enrichment results, and CRM history. Call the model (I default to GPT-4o-mini or Claude Haiku for this — more on cost below). Use the structured output / tool-use feature so the response is guaranteed JSON-shaped.

    5. Validate. Parse the response through the Zod schema. If it fails, retry once with a "your previous response was invalid JSON, here's the schema again" message. If it fails twice, push to review queue.

    6. Apply guardrails. Check confidence floor. Check that recommended_action is consistent with intent. If the lead is flagged spam but fit_score is 80, something's weird — push to review.

    7. Execute action. Write to CRM (HubSpot's /crm/v3/objects/contacts endpoint, for example). Post to the appropriate Slack channel with the summary, priority tag, and a "Take this lead" button that assigns it. Maybe send a templated acknowledgment email to the prospect.

    8. Log everything. Write to your eval log: timestamp, input hash, enrichment results, model output, validation pass/fail, action taken, latency, input tokens, output tokens, cost. I use a simple Postgres table for this, but a logging service like Axiom or Datadog works too.

    The whole thing runs in roughly 4-8 seconds from form submit to Slack notification.

    What it actually costs to run

    Real numbers. I'll use mid-2024 pricing for GPT-4o-mini as the baseline.

    Per-lead model cost:

    • Classifier prompt + context: ~2,500 input tokens
    • Structured JSON output: ~400 output tokens
    • GPT-4o-mini pricing: $0.15 per 1M input, $0.60 per 1M output
    • Cost per classification: ~$0.0006 (roughly six hundredths of a cent)

    Per-lead enrichment cost:

    • Clearbit Enrichment API: roughly $0.10 per lookup on lower-tier plans, less in volume. Apollo and similar tools have their own pricing. You can also use cheaper or free alternatives that scrape LinkedIn-adjacent data, with corresponding tradeoffs in accuracy.
    • Realistic blended cost: $0.05–$0.15 per lead.

    Per-lead infrastructure cost:

    • Serverless function execution: fractions of a cent on Vercel/Cloudflare.
    • Database writes for the eval log: also negligible at small business volume.
    • Slack and CRM API calls: free up to generous limits.

    Total cost per lead: roughly $0.06–$0.16, dominated by the enrichment API, not the model.

    Fixed monthly costs:

    • Hosting (Vercel/Railway/Fly): $20–$50/month
    • Monitoring (optional but recommended): $0–$30/month at small business volume
    • CRM and Slack: you're already paying for these

    So if you get 300 leads a month, you're looking at maybe $18–$48 in variable costs plus $20–$80 in fixed costs. Call it $40–$130/month, all-in, to triage 300 leads with sub-30-second response time, full logging, and a human review safety net.

    Compare that to the cost of one hour a day of someone's time doing the same work manually, and the math gets obvious fast.

    What to expect in month one

    I want to be honest about what this is and isn't. The first two weeks of any new ops agent are not "set it and forget it." They're tuning weeks. You're going to look at the eval log and find leads that got misclassified. You're going to tighten the ICP rules. You're going to add a new disqualifier when a spammer figures out a new angle. You're going to lower or raise the confidence floor based on how the review queue feels.

    By week three or four, the agent settles into a rhythm. Auto-action rate climbs to 60–80% of leads. Review queue volume drops to a handful per day. You stop worrying about it and start noticing that leads are getting contacted faster, no one's dropping the ball when someone's on vacation, and your CRM data is suddenly clean and consistent.

    The agent doesn't replace your judgment. It does the rote work fast and reliably, surfaces the weird stuff for you to handle, and keeps a paper trail of every decision so you can improve over time.

    A note on the boring stuff that matters most

    The architecture above is the easy part. The hard parts, in my experience:

    • Writing a good ICP definition. Most small businesses can't articulate their ideal customer crisply enough to feed it to an agent. The exercise of building the agent often forces this clarity.
    • Cleaning up CRM field hygiene so the agent has somewhere consistent to write.
    • Getting the team to actually use the Slack notifications instead of falling back to email habits.
    • Resisting the urge to make the agent "smarter" by giving it more tools. Fewer tools, used well, beats more tools used erratically.

    If any of those sound like the actual bottleneck — not the AI part, but the operational part around it — that's normal. That's where most of the value lives.

    Wrapping up

    A real ops agent is five parts: trigger, orchestrator, retrieval, action, log. The model is one component, not the whole thing. The guardrails — schema validation, confidence floors, human review queues, limited action surface — are what make it production-ready instead of a demo. And at small business scale, the whole thing costs less than a streaming subscription to run.

    If you've got a workflow that looks like the one in this post — repetitive, rule-based at the core, but with enough variation that a simple Zapier flow doesn't cut it — there's a good chance an agent is the right tool. Inbound lead triage is the most common starting point because the ROI is immediate and the failure modes are recoverable (a misrouted lead is fixable; a misfiled invoice is not).

    If you want help building this on your stack, the AI Production Agent package covers exactly this kind of build — scoped to your workflow, your tools, your ICP — with the eval log, guardrails, and review queue wired up from day one. Or if you're not sure whether your workflow is a good fit, start a conversation and I'll tell you straight whether an agent makes sense or whether a simpler automation would do the job.

    Need help with what this post covers? I do this for a living.

    Book a free 15-min site audit