Curious to hear more. My experience is limited to llama.cpp on Apple silicon so far, but have been eyeing AMD ecosystem from afar.
craftkiller21 hours ago | | | parent | | on: 47758633
FWIW I run llama.cpp on AMD hardware using Vulkan. I've got no complaints but also nothing else to compare against.
nevi-me22 hours ago | | | parent | | on: 47758633
Perhaps not a good example, I tried running local models a few times, to much disappointment (actually made me skeptical of LLMs in general for a while).

My last experiment in January was trying to run a Qwen model locally (RTX 4080; 128GB RAM; 9950X3D). I must have been doing it extremely wrong because the models that I tried either hallucinated severely or got stuck in a loop. The funniest one was stuck in a "but wait, ..." loop.

I fortunately had started experimenting with Claude, so I opted to pay Anthropic more money for tokens (work already covers the bill, this was for personal use).

That whole experience + a noisy GPU, put me off the idea of running/building local agents.

buryat22 hours ago | | | parent | | on: 47758717
I have a Mac Studio with 512GB Ram and ran models of different sizes to test out how local agents are and I agree that local models aren't there yet but that depends on whether you need a lot of knowledge or not to answer your question, and I think it should be possible to either distill or train a smaller model that works on a subset of knowledge tailored toward local execution. My main interest is in reducing the latency and it feels that the local agents that work at high speeds should be an answer to this but it's not something that someone is trying to solve yet. Feels like if I could get a smaller model that could run at incredible speed locally that could unlock some interesting autoresearching.
robwwilliams20 hours ago | | | parent | | on: 47759168
Also running gemma-4 on Apple M5 Max. As fast or faster than Opus 4.6 extended but not of course the same competence. However, great tunability with llama.cpp and no issues related to IP leakage.
musicale17 hours ago | | | parent | | on: 47759168
> Mac Studio with 512GB Ram

Nice to score one of those.

verdverm22 hours ago | | | parent | | on: 47759168
I've been running Gemma4, my initial experiments put it around gemini-3-flash levels (vibe evals)
lostmsu20 hours ago | | | parent | | on: 47758717
I hope you are not running models under Q8, preferably Q8 directly from the vendor.
verdverm22 hours ago | | | parent | | on: 47758633
The main thing to consider is that how you run the models does not need to be coupled to the what you send models (and how you orchestrate agents).

I've used several agent frameworks and they all support many different providers from cloud to local. These are orthogonal responsibilities. I'm using VertexAI for cloud and ollama on a minisforum with rocm locally. There is a dropdown to change between them.