AI features that pay rent - not demos.
LLMs, RAG, and agents shipped into real products, with guardrails, evals, and cost control from day one.
You’ll recognise it if -
- -You want LLMs, RAG, or agents inside a product - without the hallucination drama.
- -The prototype looks great in demos, but nobody actually uses it in daily work.
- -Answers are plausible but the team can't tell when they are wrong.
- -Cost per query is eating the margin, or quality has drifted since launch and nobody noticed.
- -Your team is spending more time babysitting prompts than shipping features.
By the time we leave -
- An AI feature people reach for without being told to - adoption, not just pilot usage.
- An evaluation suite you can run on every change, so quality never silently drifts.
- Citations that link back to the exact source paragraph, so users can check the work.
- Cost, quality, and drift on a single dashboard someone on your side reads every week.
- Guardrails and fallback logic so the model can say "I don't know" without embarrassing anyone.
5 steps, in this order.
Every engagement follows roughly the same shape. The details change with your situation; the order rarely does.
- 01
Set the bar
We build an evaluation set from real user queries with ground truth answers. Before we change anything else, we know how good "good" looks - and what the current baseline actually is.
- 02
Retrieval
Most RAG fails at retrieval, not generation. We tune chunking, hybrid search, reranking, and metadata filters against the eval set until retrieval is reliably good. Model swaps come later.
- 03
Generation & guardrails
Careful prompting, structured outputs, tool use, and the ability to say "I don't know." Citations back to source paragraphs so users can verify, and refusal behaviour you can point to.
- 04
Cost & latency
Prompt caching, response caching, model routing (small model first, fall through to larger only when needed), batching where the workload allows. Usually cuts cost per query by 3–10×.
- 05
Measure in prod
A weekly quality digest the right person on your side actually reads. Regression tests on every deploy. Drift alerts. Cost dashboard per feature. The bar we set in step one, defended forever after.
From the inside.
A typical mid-engagement week. Details shift with the project; the rhythm does not.
- Monday
Review last week's eval results. Pick the biggest quality regression and decide what changes this week. No change goes in without a hypothesis.
- Tuesday
Experimentation day. Retrieval tweaks, prompt changes, tool-use refactors. Every change is measured against the eval set before merging.
- Wednesday
Cost and latency review. Prompt caching, response caching, model routing. Usually the single highest-leverage work for AI in production after quality itself.
- Thursday
Ship a change behind a feature flag. 10% of traffic at first. Watch the quality metrics, the cost metrics, and the user feedback signal together.
- Friday
Write the weekly quality digest for whoever on your side needs to care - usually the product lead. Ship the summary. Queue next week's experiment.
A week of discovery where we build the eval set and establish the baseline. You get that eval set regardless.
A small senior team, embedded with yours.
No pyramid, no rotating juniors, no partnership-level engineers who never open the editor. Everyone on the engagement ships code, writes the runbook, and answers the 3am page.
We size the team to the work, not the other way around. If you need fewer people, you get fewer. If the scope grows, we tell you - and you decide.
$35k – $150k per feature shipped
Applied-AI engagements tend to cluster in a narrower range than product or platform work because the scope is usually one feature. A focused RAG retrofit lands near the lower end; an agent system with tool use, guardrails, and a full evaluation programme lands higher. Ongoing retainer for quality monitoring is separate and optional.
- 01Whether we're retrofitting an existing feature or building one from zero
- 02Data complexity - single corpus vs multi-source retrieval
- 03Tool use and agent depth - static prompts vs multi-step agents
- 04How rigorous your eval programme needs to be (internal quality vs regulated environment)
- 05Inference provider - hosted API vs self-hosted open-weight models
Things we’ll tell you no to.
We’d rather lose the sale on Monday than lose the trust by month three.
- No.01
You want a flashy demo for a pitch deck. We build things that work in production. If you need a proof of concept, a short contract with an AI prototyper is cheaper and honest.
- No.02
You want us to fine-tune a base model as the first move. 95% of applied-AI problems are retrieval, prompting, and evaluation problems. We will exhaust those before fine-tuning, and we will tell you when fine-tuning is the wrong choice.
- No.03
You want the model to never hallucinate. The model will sometimes be wrong. We build systems so your users can catch it when it is, not so it never happens.
- No.04
You want to replace humans with an agent, wholesale, on day one. Human-in-the-loop designs ship better outcomes for 12–24 months. We will argue this with you before we build.
Questions & answers.
Do we have to train our own model?
Almost never. 95% of applied AI problems are retrieval, prompting, evaluation, and product design problems - not model-training problems. Fine-tuning is sometimes the right call, but we will have exhausted everything else first.
How do you handle hallucinations?
Three layers: (1) retrieval that reliably returns the right context, (2) structured outputs and refusal behaviour baked into the prompt, (3) citations back to source paragraphs so users can verify. Hallucinations are rarely zero - the goal is that they're rare enough that trust compounds.
What about our data and privacy?
We default to keeping data inside your cloud. OpenAI and Anthropic's zero-retention / business tiers are fine for most workloads. For regulated or sensitive data, we run open-weight models on your infrastructure. We write down the data path before we build anything.
How do you know the AI feature is getting better, not worse?
Evals are the first thing we build, not the last. Every change - prompt, model, retrieval - runs against the same eval set, and the results are visible in CI. Quality is a line on a graph, not a vibe.
Other engagements we run.
Tell us what you’re shipping.
30 minutes, no pitch deck. We’ll ask what you’re building, what hurts, and whether we’re the right fit. If we’re not, we usually know someone who is.