I am running OpenWeb UI + Ollama + 7B on a Proxmox LXC container, it consumes less than 2GB, the GPU only has 4GB, and 50% CPU, it is very usable, sometimes faster than online ones to start giving you the answer and 100% offline.
If I replace the GPU with a faster one, I have no need to use online ones.
My last experiment in January was trying to run a Qwen model locally (RTX 4080; 128GB RAM; 9950X3D). I must have been doing it extremely wrong because the models that I tried either hallucinated severely or got stuck in a loop. The funniest one was stuck in a "but wait, ..." loop.
I fortunately had started experimenting with Claude, so I opted to pay Anthropic more money for tokens (work already covers the bill, this was for personal use).
That whole experience + a noisy GPU, put me off the idea of running/building local agents.
Nice to score one of those.
I've used several agent frameworks and they all support many different providers from cloud to local. These are orthogonal responsibilities. I'm using VertexAI for cloud and ollama on a minisforum with rocm locally. There is a dropdown to change between them.
All you need is a used GPU slapped onto any disused ddr4 mobo. New 5060s, the 16gb models, can do basically everything now.
If you mean that you can't just run the largest unquantized models, then it's indeed true.