Bring one workflow. An agent will run it.
Tier-one support replies. Lead research. Document processing. Internal Q&A. The same workflow your ops lead spends fifteen hours a week on. Four-week pilot. Eval harness on day one. Full IP transfer at SOW.
Anatomy of every agent we ship
Trigger
Email · webhook · cron
Classify
Intent + priority
Retrieve
Vector + SQL + tools
Reason
Plan · ReAct · CoT
Act
API · DB · CRM
Approve
Human gate · audit
4 wk
Pilot to production
90 d
Post-launch monitoring
Owned
IP at SOW signing
Six agent patterns · we have shipped each many times
Pick one workflow. We'll quote it on the call.
Support Triage
Auto-classifies tickets, drafts replies for tier-one questions, escalates the rest. Plug into Zendesk, Freshdesk, Intercom, or your own queue.
Lead Research
Pulls intent signals from LinkedIn, the company blog, the founder's Twitter, and your CRM. Drafts personalised outreach for human review.
Document Processor
Invoices, vendor contracts, claims, KYC documents. PDF in, structured data out, sitting in your CRM or ERP. Audit trail for every extraction.
Internal Q&A
Slack or Teams bot answering from your wiki, Drive, Notion, Confluence. The tribal-knowledge person stops being a bottleneck.
MCP Server
Model Context Protocol server exposing your SaaS data to Claude, Cursor, and other MCP clients. Useful for internal AI tooling and AI-native products.
Workflow Agent
Multi-step ops automation with retries, tracing, human-in-the-loop. The kind that survives a year in production, not a hackathon demo.
No framework religion
We pick frameworks for the job, not the trend.
Most agencies sell you whichever framework they know. We pick after the workflow audit. LangChain for ninety percent of it, plain Python when LangChain is too magical, Bedrock when compliance demands it, self-hosted Llama when your data cannot leave the building. Below is the actual matrix we apply.
LangChain
Default · most projects
CrewAI
Multi-agent orchestration
AutoGen
Rare · research-leaning
Pure Python
When LangChain is too magical
DSPy
Prompt eval / optimisation
Anthropic Claude
Reasoning-heavy work
OpenAI GPT
Bulk · cheap inference
AWS Bedrock
Compliance-bound clients
Llama (self-hosted)
Sensitive data on-prem
LangSmith / Langfuse
Tracing in production
The four gates we ship every agent through
Eval harness on day one. Demos do not earn production.
Every agent we ship has the same four gates. We have rolled back at the third gate twice in 2025 and we tell you about both — that is what the harness is for. The gate target is locked in writing in the SOW, not invented later.
- 1
Day 1 — eval harness
Real labelled examples from your data. Accuracy target locked in writing before a single prompt is tuned.
- 2
Day 14 — guardrail tests
What the agent will refuse, what it will escalate, what it must never invent. Adversarial prompts in CI.
- 3
Day 21 — shadow mode
Agent runs against live traffic, output saved, no actions executed. Humans grade the diff.
- 4
Day 28 — rollout gate
Five-percent traffic for 72 hours. Ramp 25, 50, 100% only if KPIs hold. We have rolled back twice. We tell you about both.
A returns triage that survived a year
D2C beauty brand. Three weeks to ship. Twelve months without breaking.
- Tier-one tickets auto-handled
- 92% steady state
- Time saved per agent / day
- ~5 hrs
- Model swaps without rebuild
- 2 (Claude → Sonnet → Haiku)
- Drift incidents
- 1 · caught before customer
Their ops lead was spending two-and-a-half hours a day on returns email — same six categories, slightly different SKUs. We sat with her for a morning, took screen recordings, and the agent specification fell out by lunch.
The agent shipped in three weeks. We ran shadow mode for two more before letting it write a single email. Eight months later, when Anthropic deprecated the original model, we swapped from Claude to Sonnet to Haiku without a rebuild — abstraction layer paid for itself in one afternoon. The agent has been running for a year. Eighty-percent of returns clear without a human, and the genuinely tricky ten-percent reach the team faster because the queue is empty.
What you should be asking us
The questions we get on every agent call. Same answers we give there.
Ready when you are
Bring one workflow to a thirty-minute call. We'll tell you yes, no, or stretch.
We have turned down agent projects before. We would rather decline than ship something that fails at month four. Half an hour, your workflow, our honest read.
- Production agents in three industries
- Andrew Ng credentialed founder
- Eval harness on day one
- Full IP transfer at SOW signing
