June 13, 2026

Hiring a freelance AI consultant: what to ask, what to avoid

You've decided your business needs an AI agent — maybe to handle customer support tickets, qualify leads, or take some load off your ops team. Now you're staring at a dozen LinkedIn profiles and Upwork listings, all promising the same thing, and you have no idea how to tell the real ones from the demo-jockeys. This post gives you eight questions that separate operators from talkers, plus four red flags that should end the call.

Why this matters more than picking a web developer

A bad website is embarrassing. A bad AI agent is dangerous. It can quote wrong prices to customers, leak data into a public LLM, hallucinate refund policies, or quietly degrade for three weeks before anyone notices revenue dropping. The cost of a broken agent isn't the build fee — it's the customers it mishandled while you thought it was working.

That asymmetry is why the vetting questions below focus less on "can you build it" and more on "can you tell when it's broken, and what happens when it is."

The 8 questions to ask before you sign anything

1. "Show me a production agent you've shipped."

Not a demo. Not a screenshot from a hackathon. A live system handling real traffic for a real business. Ask for the use case, the volume it handles, and how long it's been running.

If they can't show you one, that's not automatically disqualifying — everyone starts somewhere — but the price should reflect it. A first-timer charging senior rates is a problem. A first-timer being honest about it and pricing accordingly is fine.

2. "What's your eval process?"

This is the question that filters out 80% of pretenders. A real answer sounds like: "I build an eval set of 50–200 test cases drawn from your actual data, including edge cases and adversarial inputs. I run the agent against the set after every prompt change and track pass rate over time."

A fake answer sounds like: "I test it thoroughly" or "I do prompt engineering iteratively."

Evals are how you know the agent works. Without them, every change is a coin flip.

3. "What's your handoff path to a human?"

Every agent needs one. When the agent doesn't know the answer, what happens? Does it create a ticket? Slack your support lead? Send an email? Just say "I'm not sure, please contact us"?

The right answer depends on your business, but the consultant should have a clear opinion and a clear implementation. "It'll figure it out" is not an answer.

4. "What monitoring ships with this?"

After the agent goes live, how will you know if it's working? You need at minimum:

Log of every conversation (with PII handling)
Latency and error rate tracking
Cost per conversation (token usage)
Some signal on quality — thumbs up/down, escalation rate, or sentiment

If the consultant's plan is "I'll check on it occasionally," walk away. You're buying a system that runs 24/7. It needs telemetry that runs 24/7.

5. "What's your guardrails approach?"

Guardrails are the rules that keep the agent from doing harmful or off-topic things. Examples: refusing to discuss competitors, never quoting a price not in the catalog, never promising a refund without human approval, not engaging on legal or medical advice.

Ask for specifics. "I use system prompts" is not enough — system prompts can be bypassed. Look for a layered answer: system prompt + input filtering + output validation + tool-level constraints (the agent literally can't issue a refund because it doesn't have that API permission).

6. "Who owns the prompts after handoff?"

This catches a lot of small businesses off guard. Six months in, you want to tweak how the agent greets customers. Can you do it yourself? Do you need to call the consultant? Are the prompts even in a place you can see them?

The right answer: you own everything — prompts, eval sets, configuration, code. It lives in your accounts (your OpenAI/Anthropic key, your hosting, your repo). The consultant gives you documentation on how to make safe changes.

The wrong answer: anything that sounds like vendor lock-in.

7. "What's the kill switch?"

If the agent starts misbehaving at 2 a.m. on a Saturday, how do you turn it off without calling the consultant? There should be a single toggle — a feature flag, an environment variable, a button in an admin panel — that takes the agent offline and either routes traffic to a human queue or shows a "we're unavailable right now" message.

If there's no kill switch, you're not in control of your own business.

8. "What's your refund or redo policy?"

What happens if the agent doesn't hit the success criteria you agreed on? A serious consultant will have a defined answer — usually a fix-it-or-refund-the-final-milestone clause. If the answer is hand-wavy ("we'll figure it out, I'm sure you'll be happy"), get it in writing before you pay.

The 4 red flags that should end the conversation

Red flag 1: No production references

If they've never shipped a paying agent into production, they're learning on your dime. That can be okay at junior rates with junior expectations. It's not okay when you're paying senior rates and betting your customer experience on it.

Red flag 2: Can't explain a failure mode

Ask: "Tell me about a time an agent you built failed in production. What happened and how did you fix it?"

Anyone who's actually shipped will have a story. Hallucinated answers, runaway token costs, a tool call that hit the wrong API, a prompt injection that leaked a system message — every operator has scars. If they tell you they've never had a failure, they either haven't shipped or they're lying. Both are disqualifying.

Red flag 3: "Agentic" with no specifics

The word "agentic" is doing enormous work in AI sales decks right now. When someone says they build "agentic systems," ask what that actually means in their implementation. You want to hear about specific tools the agent can call, specific decision points, specific frameworks (LangGraph, OpenAI Assistants, custom orchestration, whatever).

If you get back another round of buzzwords — "autonomous reasoning," "self-improving loops," "next-gen workflows" — you're talking to a marketer, not a builder.

Red flag 4: All hype, no monitoring story

If the pitch is 90% about what the agent will do and 0% about how you'll know if it's still doing it next month, you're looking at a launch-and-leave operation. Real agents need real observability. If that's an afterthought in the sales process, it'll be an afterthought in the build.

What "good" actually looks like

When you're talking to someone who's done this before, the conversation feels different. They'll volunteer war stories without prompting. They'll walk you through an eval set on their screen — actual rows of test inputs and expected outputs. They'll show you a monitoring dashboard from a live client (anonymized) with latency graphs and escalation rates.

They'll also push back on your scope. If you ask for an agent to do something it shouldn't, they'll tell you. If you want full autonomy on actions that could lose you money, they'll insist on human-in-the-loop for those specific steps. That pushback is a feature, not friction — it's how you know they've been burned and learned.

The cost-vs-value math nobody runs

Here's the frame most small businesses miss. A $5,000 agent that breaks silently is more expensive than a $15,000 agent with monitoring. The first one looks cheaper on the invoice. The second one is cheaper after you count the customers the first one mishandled for three weeks before anyone noticed.

Monitoring, evals, guardrails, and kill switches aren't extras. They're the difference between "we have an AI agent" and "we have an AI agent we can actually trust." If a consultant's quote doesn't include those line items, you're not getting a deal — you're getting an incomplete product.

The same logic applies to handoff and documentation. Paying a little more upfront for clean ownership and a runbook saves you from paying a lot more later to a different consultant who has to reverse-engineer the first one's work.

A short hiring checklist

Before you send the deposit, you should have written answers to:

What use case is this agent solving, and what's the measurable success criteria?
What does the eval set look like, and what's the target pass rate?
Who gets paged when the agent breaks?
Where do prompts, code, and configs live, and who has access?
What's the monthly run cost (API usage + hosting + any retainer)?
What does month two look like after the build is done?

If those questions get clear, specific answers, you're probably in good hands. If they get vague answers or pivots back to capabilities, keep looking.

Wrapping up

Hiring an AI consultant isn't fundamentally different from hiring any other specialist — you're looking for evidence of real work, honesty about limits, and a structure that leaves you in control. The questions above are designed to surface all three. Use them, and you'll skip the painful 90 days most small businesses spend learning these lessons the expensive way.

If you'd like to run those questions past someone who'll answer them straight — including the failure stories — head to thewizrdz.io and use the contact form. Happy to walk through your use case and tell you honestly whether an agent is the right call, and what it should actually cost.