Hacker News

points by xrd 23 hours ago | hide | 0 comments

I wanted to believe but anyone who has spent any time trying to run models locally knows this is not going to be solved by two lines of python running on rocm as the example shows.

Not entirely.

I am running OpenWeb UI + Ollama + 7B on a Proxmox LXC container, it consumes less than 2GB, the GPU only has 4GB, and 50% CPU, it is very usable, sometimes faster than online ones to start giving you the answer and 100% offline.

If I replace the GPU with a faster one, I have no need to use online ones.

Curious to hear more. My experience is limited to llama.cpp on Apple silicon so far, but have been eyeing AMD ecosystem from afar.

FWIW I run llama.cpp on AMD hardware using Vulkan. I've got no complaints but also nothing else to compare against.

Perhaps not a good example, I tried running local models a few times, to much disappointment (actually made me skeptical of LLMs in general for a while).

My last experiment in January was trying to run a Qwen model locally (RTX 4080; 128GB RAM; 9950X3D). I must have been doing it extremely wrong because the models that I tried either hallucinated severely or got stuck in a loop. The funniest one was stuck in a "but wait, ..." loop.

I fortunately had started experimenting with Claude, so I opted to pay Anthropic more money for tokens (work already covers the bill, this was for personal use).

That whole experience + a noisy GPU, put me off the idea of running/building local agents.

I have a Mac Studio with 512GB Ram and ran models of different sizes to test out how local agents are and I agree that local models aren't there yet but that depends on whether you need a lot of knowledge or not to answer your question, and I think it should be possible to either distill or train a smaller model that works on a subset of knowledge tailored toward local execution. My main interest is in reducing the latency and it feels that the local agents that work at high speeds should be an answer to this but it's not something that someone is trying to solve yet. Feels like if I could get a smaller model that could run at incredible speed locally that could unlock some interesting autoresearching.

Also running gemma-4 on Apple M5 Max. As fast or faster than Opus 4.6 extended but not of course the same competence. However, great tunability with llama.cpp and no issues related to IP leakage.

> Mac Studio with 512GB Ram

Nice to score one of those.

I've been running Gemma4, my initial experiments put it around gemini-3-flash levels (vibe evals)

I hope you are not running models under Q8, preferably Q8 directly from the vendor.

The main thing to consider is that how you run the models does not need to be coupled to the what you send models (and how you orchestrate agents).

I've used several agent frameworks and they all support many different providers from cloud to local. These are orthogonal responsibilities. I'm using VertexAI for cloud and ollama on a minisforum with rocm locally. There is a dropdown to change between them.

I am running q 4xgpu rig at home (similar to a mining rig) doing everything from llms to content creation. I have learned a lot. Having an AI rig today is much like having an early PC in the 80s. You dont appeciate the possible uses until you have it in your hands.

All you need is a used GPU slapped onto any disused ddr4 mobo. New 5060s, the 16gb models, can do basically everything now.

Can you specify a bit, what gpus and how do you wire them "together"? Nvlink?

A couple 5060s and a couple 3060s. They are wired via PCI risers to an older mono with an amd cpu. (I wanted to avoid long 3-fan cards.) It looks like a mining rig, but with thicker pci risers. Many llm tools easily leverage multiple GPUs. Sucks 800w at full load, idles below 50w.

Would you please share a link to your chassis and risers? I have the PCIE lanes, but not yet encountered a reasonable way to have more than 3 GPUs directly attached to a host, both from physical space and power requirements. External PCIe switch cases are not reasonably available to mortals :/

I have three 3090 cards. Are you saying you run them together using specialized hardware or can I somehow combine them using software over Ethernet?

Uhmm... I have a local Ollama setup on Linux+AMD, and it was only a bit more involved than this sample. And only because I wanted to run everything in a container.

If you mean that you can't just run the largest unquantized models, then it's indeed true.