A piece that often gets overlooked: unlike a web app where you can cache aggressively and serve millions of users from relatively few servers, LLM inference is stateless in a weird way — each request is expensive and can't really be batched the way traditional workloads can. A spike in traffic doesn't just slow you down linearly, it creates queuing effects that cascade fast.

There's also the memory side. A large model has to live entirely in GPU VRAM to run efficiently. You can't just "add more RAM" on the fly the way you can with CPU workloads. Scaling means acquiring, provisioning, and loading entirely new physical machines — which takes minutes to hours, not seconds.

So you end up with a system that's simultaneously very expensive per-request, very hard to scale horizontally in real time, and very sensitive to traffic spikes. That's a reliability engineer's nightmare even before you factor in the supply constraints the sibling comment mentioned.

andyjohnson01 day ago | | | parent | | on: 47753999
Not sure why this was apparently flagged to death. Vouched.
roywiggins23 hours ago | | | parent | | on: 47755948
It's the em-dashes from a green account.
andyjohnson012 hours ago | | | parent | | on: 47758397
Account's comment history didn't look particularly AI generated. And, as an organic human who uses em dashes myself, I kind of hope we can get past this simplistic take that they are a signifier of ai content.

Besides that, I thought the comment had something useful to say — whether ai-generated or not.

moomoo116 hours ago | | | parent | | on: 47755948
Ai slop

If you can’t tell then damn idk man.