An Affordable AI Server

Two AMD MI60s from eBay cost me about $1,000 total and gave me 64GB of VRAM. That’s enough to run Llama 3.3 70B at home with a 32K context window.

When I started looking into running large language models locally, the obvious limiting factor was VRAM. Consumer GPUs top out at 24GB, and that’s on an RTX 4090 at the high end. I wanted to run 70B parameter models locally, on hardware I own.

Datacenter Castoff, Homelab Treasure#

The MI60 is a 2018 server GPU that AMD built for datacenters. It has 32GB of HBM2 memory, the same high-bandwidth memory you find in modern AI accelerators, and you can pick one up for around $500 on eBay. Two of them give you 64GB of VRAM, more than enough for Llama 3.3 70B.

One problem: they’re passive-cooled cards designed for server chassis with serious airflow. Plug one into a regular PC case and it’ll thermal throttle within minutes. I ended up 3D printing a duct and running a push-pull configuration: a 120mm fan inside blowing air across the heatsinks, and a 92mm fan on the rear pulling hot air out. A custom fan controller script keeps the fans in sync with GPU utilization, maintaining junction temps around 80°C instead of the 97°C I saw before I figured out cooling.

Why Not Just Use NVIDIA?#

NVIDIA has better software support, more documentation, and CUDA is everywhere. But the MI60 has 32GB of HBM2. An RTX 3090 has 24GB of GDDR6X and costs significantly more on the secondary market. The MI60 gives me more memory for less money, and for inference workloads, that memory matters more than raw compute throughput. The MI60’s HBM2 delivers higher theoretical memory bandwidth than GDDR6X. For inference, which is memory-bound, that helps. The tradeoff: with two cards doing tensor parallelism, PCIe becomes the bottleneck.

The software situation is workable, with caveats. The MI60 uses AMD’s gfx906 architecture. AMD stopped actively developing for it, but backward compatibility carries forward. I’m running ROCm 6.3 without issues. The upside is that years of bug fixes have made the platform stable. I’m building on well-established code.

vLLM has been my best experience. I tried Ollama first, but performance was noticeably worse and tensor parallelism across both GPUs wasn’t as smooth. vLLM gives me better speeds, but switching models isn’t as simple as Ollama’s pull-and-run. I built a solution for that, which I’ll cover in another post.

What Can It Actually Do?#

Here are some real numbers from my setup, running vLLM with AWQ-quantized models:

Model	Tokens/sec	GPUs
Qwen3 8B	~90	1
Qwen3 32B	~31	1
Llama 3.3 70B	~26	2 (tensor parallel)

The 8B and 32B models respond quickly, and even the 70B is very usable.

Most dual-GPU consumer setups max out at 48GB. Two MI60s give you 64GB for around $1,000. You’ll need to solve cooling (see above), but it’s a one-time fix.

I’ll be writing more about this setup: the cooling solution, the software stack, and how I switch between model configurations. Spoiler: Stable Diffusion still locks up the GPU, and I haven’t gotten Whisper working yet.

The MI60 isn’t the only option: there are MI50s, MI100s, and various NVIDIA Tesla cards floating around the secondary market. Memory, compute, and software support all matter.