AI Summary - 20-sec read - Reviewed by experts
- A managed API charges you per token and nothing when idle. A self-hosted model charges you for a GPU by the hour whether it is busy or not, so the whole question is utilization, not the sticker price per token.
- Self-hosting only gets cheaper once your sustained token volume keeps the GPU genuinely busy. Below that threshold you are paying full price for hardware that sits idle most of the day.
- The three numbers that decide it: your monthly token volume, the throughput one GPU gives you for your chosen model, and how close to 24x7 you can keep that GPU loaded.
- The hidden costs that move the break-even point: engineering time, on-call, redundancy for uptime, and idle capacity overnight and on weekends. Most teams forget these and the math flatters self-hosting.
- Short on time? We will run your real numbers and tell you honestly whether to self-host or stay on an API. Book a free call.
Short on time? Book a free call.
Someone on your team has done the back-of-envelope sum: an open model is free, a GPU is a few hundred pounds a month, and your API bill is climbing, so self-hosting must be cheaper. Sometimes it is. Often it is not, and the gap is not small. The per-token price you see on a pricing page is not what self-hosting costs you. What self-hosting costs is a GPU running around the clock, busy or idle, and the real question is how much of that time you can actually keep it working.
Why "self-hosting is cheaper" is only half true
A managed inference API has one property that is easy to undervalue: it bills you for work done. Send a thousand tokens, pay for a thousand tokens. Send nothing overnight, pay nothing overnight. The provider absorbs the idle time, the spare capacity, and the cost of keeping a fleet warm. You rent usage, not hardware.
Self-hosting flips that. You rent a GPU by the hour, and it costs the same whether it is saturated with requests or sitting at two percent load at 3am. So the comparison is never "free model versus paid API". It is "a fixed hourly hardware bill versus a variable per-token bill". Fixed beats variable only when you have enough steady volume to spread that fixed cost across a lot of tokens. That is the entire game, and it is why two teams running the same model can reach opposite conclusions. We lay out the full trade-off in our self-hosted versus API LLM cost and performance comparison; this piece is the narrower question of where the line actually sits.
Not sure whether your volume justifies your own GPU?
We will take your real token volume and traffic pattern and model both paths - per-token API versus a self-hosted GPU at honest utilization - so the decision is arithmetic, not a hunch. No pitch, reply in 2 hrs, no card needed, NDA on request.
Get a free auditThe three numbers that decide it
You do not need a spreadsheet with forty inputs. Three numbers carry almost all of the answer.
- Your monthly token volume. Input plus output tokens across all your traffic. This is what an API bills against, so it is also what you must beat with owned hardware.
- Throughput per GPU for your model. How many tokens a second one GPU sustains for the specific model and quantization you intend to run. A small 7-8B model on a single mid-range GPU is a very different number from a 70B model that needs multiple cards.
- Realistic utilization. What fraction of the day that GPU is actually serving requests. A B2B tool used in working hours might see 20-35 percent. A consumer app with global traffic might hold 60 percent or more. This single number moves the break-even point more than anything else.
Hold those three in mind and the logic falls out. Owned hardware has a fixed monthly cost. Divide that cost by the tokens you actually push through it, and you get your real cost per token. If that number is below the API price, self-host. If it is above, the API is cheaper. Idle time is not free; it just raises your cost per token by shrinking the denominator.
A worked break-even example
Use round, illustrative numbers to see the shape. Say a single GPU instance suitable for a small open model costs roughly 600 to 900 a month on demand, less if you reserve it. Say a managed API for a comparable model charges somewhere around 0.30 to 0.60 per million tokens. And say one GPU, well batched, can serve on the order of a few hundred million tokens a month if you keep it busy.
Now the two scenarios diverge entirely on utilization. Keep that GPU near saturation and you spread the 600-900 across hundreds of millions of tokens, landing well under the API rate - self-hosting wins clearly. Run the same GPU at 20 percent because your traffic is bursty and office-hours only, and you are paying the full monthly bill to push a fraction of those tokens, so your real cost per token can land several times above the API price. Same hardware, same model, opposite answer. The lesson is blunt: low, spiky volume favours the API; high, steady volume favours owning the GPU. Most teams sit closer to the first case than they assume, which is why monitoring real usage matters - the same discipline we describe in monitoring AI agent costs in real time.
Takeaways
- The comparison is a fixed hourly GPU bill versus a variable per-token bill - never "free model versus paid API".
- Self-hosting wins only when steady volume keeps the GPU busy; idle time quietly raises your real cost per token.
- Three numbers decide it: monthly token volume, throughput per GPU for your model, and realistic utilization.
- Add engineering, on-call, and redundancy before you compare - they push the break-even higher than the raw hardware bill suggests.
The costs people forget
The hardware bill is the part everyone counts. The parts that quietly change the answer are the ones left off the napkin.
First, engineering time. Someone has to stand up the serving stack, tune batching, handle model updates, and debug it at 2am when throughput collapses. That is real salary, and on a small team it is the scarcest resource you have. Second, redundancy. One GPU is a single point of failure; the moment this is customer-facing you want at least two, which roughly doubles the fixed cost and pushes your break-even volume up with it. Third, idle capacity for peaks. To handle your busiest hour without queueing, you size for the peak and then pay for that headroom all the quiet hours too.
None of this means self-hosting is wrong. It means the honest fixed cost is higher than the instance price, so you need more volume than the simple sum suggests before owning the hardware pays off. When we plan inference infrastructure, we cost the operating reality, not just the compute, the same way we approach any build on AI workloads on AWS.
When self-hosting actually wins
Self-hosting earns its place in a few clear cases. You have high, steady volume that keeps GPUs genuinely busy, so the fixed cost spreads thin. You have data or compliance constraints that make sending text to a third party a non-starter, in which case cost is not even the deciding factor. You need a specific fine-tuned or open model the APIs do not offer. Or you have hit a scale where your API bill is large enough that a dedicated team managing your own fleet is plainly cheaper than the per-token rate.
The API wins in the opposite cases, which describe most teams most of the time: early-stage or unpredictable volume, bursty office-hours traffic, a small team without spare engineering capacity, or a need to move fast without owning infrastructure. The right answer often changes as you grow - start on an API to ship, watch your volume and utilization, and revisit self-hosting only when the numbers, not the instinct, say it is time. If you want help choosing or sizing the move, that is exactly what our AI development services team does.
Frequently asked questions
Is an open model really free?
The model weights can be free to download and use, but running them is not. Inference costs you GPU time, and GPU time is billed by the hour regardless of load. "Free model" and "free to operate" are different claims, and the second one is never true at production scale.
What utilization do I need before self-hosting pays off?
There is no single figure, because it depends on your model, GPU throughput, and the API price you are comparing against. The principle holds though: the higher and steadier your utilization, the better self-hosting looks. Bursty, low utilization almost always favours the API because you pay for idle hardware.
Does fine-tuning change the maths?
It can. A smaller fine-tuned open model that matches a larger general model on your task may run on cheaper hardware at higher throughput, which lowers your self-hosted cost per token. But you still have to keep that GPU busy enough to clear the break-even, so utilization remains the deciding variable.
Can I mix both?
Yes, and many teams should. Route steady, high-volume, low-complexity traffic to a self-hosted model and send spiky or harder requests to an API. That keeps your owned GPU well utilized while the API absorbs the peaks you would otherwise overprovision for.
The short version: ignore the sticker price per token and look at utilization. Self-hosting is a fixed bet that pays off only when you keep the hardware busy. Run your three numbers honestly, add the costs that never make the napkin, and let the arithmetic decide - not the feeling that owning it must be cheaper.
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
