Hacker News

Ask HN: What makes it so hard to keep LLMs online?

3 points by realberkeaslan 1 day ago | hide | 8 comments

It feels like every few days one of the big AI services is down, degraded, or just slow. I don't mean this as a complaint. I'm just genuinely curious. These are well-funded companies with smart people. What is it about running these models that makes reliability so elusive? Is it just demand nobody predicted, or is there something fundamentally different about serving AI vs. a normal web app?

A piece that often gets overlooked: unlike a web app where you can cache aggressively and serve millions of users from relatively few servers, LLM inference is stateless in a weird way — each request is expensive and can't really be batched the way traditional workloads can. A spike in traffic doesn't just slow you down linearly, it creates queuing effects that cascade fast.

There's also the memory side. A large model has to live entirely in GPU VRAM to run efficiently. You can't just "add more RAM" on the fly the way you can with CPU workloads. Scaling means acquiring, provisioning, and loading entirely new physical machines — which takes minutes to hours, not seconds.

So you end up with a system that's simultaneously very expensive per-request, very hard to scale horizontally in real time, and very sensitive to traffic spikes. That's a reliability engineer's nightmare even before you factor in the supply constraints the sibling comment mentioned.

Not sure why this was apparently flagged to death. Vouched.

It's the em-dashes from a green account.

Account's comment history didn't look particularly AI generated. And, as an organic human who uses em dashes myself, I kind of hope we can get past this simplistic take that they are a signifier of ai content.

Besides that, I thought the comment had something useful to say — whether ai-generated or not.

Ai slop

If you can’t tell then damn idk man.

must people think it’s just GPU cost. In practice it’s coordination: model latency variance + queueing + retries under load. You don’t scale linearly, you get cascading slowdowns.

Likely one large contributor is that for a normal service, if it's down it's as simple as re-routing to another service, and there is basically an unlimited amount of CPU servers around the world to spin up on demand. GPU servers are much harder to spin up on demand, as supply is so constrained.

Another factor is just it's a new field and move fast and break things is still the go to as competition is high, and the stakes are incredibly high monetary wise.

A pessimistic, but perhaps true theory is also just vibe-coding/slop is reducing their reliability.

A counter point is that regular services like github seem to go down almost as frequently.

honestly it's mostly gpu supply. scaling up to handle load means spinning up new nodes, and that takes minutes not seconds because the models are huge and need multiple coordinated gpus per instance.

also worth saying, even when things are "up" you often get different answers to the same question. that's the reliability problem nobody talks about. fine for a chatbot, not fine if you're building anything that needs to be repeatable and deterministic... i moved more to the ML route, but i guess it depends on what you are trying to do.