honestly it's mostly gpu supply. scaling up to handle load means spinning up new nodes, and that takes minutes not seconds because the models are huge and need multiple coordinated gpus per instance.
also worth saying, even when things are "up" you often get different answers to the same question. that's the reliability problem nobody talks about. fine for a chatbot, not fine if you're building anything that needs to be repeatable and deterministic... i moved more to the ML route, but i guess it depends on what you are trying to do.