MLOps

Putting a cost ceiling on your AI before the bill puts one on you.

Per-tenant cost dashboards, token budgets, model fallbacks, and the small infra tweaks that knocked our average client's inference bill down 47%.

Bilal T.

Engineer · ProCoders

Jan 22, 20268 min read

LinkedIn X

We get asked this almost every week. So here's the actual playbook — the same one we ran on the engagement that gave us the headline above. Read it as a how-to, not a marketing piece. The interesting bits live in the details.

1. Start with the smallest possible problem.

The mistake most teams make on day one is choosing too broad a target. Don't try to automate "support." Pick the three ticket categories that cover 70% of volume, and build something exceptional for those. Expand later — once the system has earned the right.

On the engagement that inspired this post, those three categories were: password reset, plan downgrade, and refund inquiry. Boring? Yes. High-volume? Also yes. They cleared 6,200 tickets a month between them before we touched a thing.

Rule of thumb

If a category appears fewer than 100 times a month in your ticket archive, don't bother automating it on day one. The eval cost is higher than the savings.

2. Build the eval harness before the bot.

Every system we ship gets a golden dataset before it gets a single production prompt. The harness measures four things:

Resolution. Did the bot actually answer the question?
Safety. Did it avoid hallucinating policy or PII?
Tone. Did it sound like your brand, not like GPT?
Escalation. When it was unsure, did it hand off cleanly?

Every prompt change, model swap, and tool revision runs through the suite. Regressions block merges. By the end of week two the system beats the human baseline on the resolution metric — and it stays that way because the harness keeps us honest.

3. Soak before scale.

Before a single customer sees an automated reply, the system spends two weeks reading every incoming ticket and drafting a response that only your support leads can see. You diff the drafts against the humans and grade each one. The disagreements drive your last eval iterations.

By the end of soak, you'll know exactly which categories the bot owns, which still need humans, and where the gray area is.

Production AI is just software. The thing that makes it hard is not the model — it's the part that ships, monitors, and earns the right to keep running.

4. Ship, then keep evaluating.

The system goes live on a canary at 5%. Within a week we're at 50%. Within two weeks, full traffic. Auto-resolution starts at around 24% on day one and climbs every week as the prompts and category routing tighten. By day 30 it's hitting 80% — and customer satisfaction has moved up, not down.

That last point matters. Done right, automation doesn't degrade the experience. It removes the friction that was already there.

What we'd do differently

Three things, in hindsight: start the eval harness from day one (we waited a week), recruit two support leads to grade soak outputs in parallel (we used one and bottlenecked), and bake escalation routing into the UX, not just the model. We fixed all three on the next build.

If you're thinking about running this playbook on your own stack — we're happy to look at it for free. Send us your ticket archive and we'll come back within 24 hours with a take.

About the author

Bilal T.

Engineer at ProCoders. Spends most of the day shipping production AI systems for clients across SaaS, FinTech, and consumer. Writes here when something is worth a writeup.

Silo links