When I started looking into running large language models locally, I hit the same wall everyone does: VRAM. Consumer GPUs top out at 24GB, and that’s on an RTX 4090 that costs more than some used cars. I wanted to run 70B parameter models—not cloud-hosted, not quantized into oblivion—actually run them on hardware I own.

That’s when I discovered the AMD Instinct MI60.

Datacenter Castoff, Homelab Treasure#

The MI60 is a 2018 server GPU that AMD built for datacenters. It has 32GB of HBM2 memory—the same high-bandwidth memory you find in modern AI accelerators—and you can pick one up for around $500 on eBay. Two of them give you 64GB of VRAM, enough to run Llama 3.3 70B with room to spare.

There’s a catch, of course. These are passive-cooled cards designed for server chassis with serious airflow. Plug one into a regular PC case and it’ll thermal throttle within minutes. I ended up 3D printing a duct and running a push-pull configuration—a 120mm fan inside blowing air across the heatsinks, and a 92mm fan on the rear pulling hot air out. A custom fan controller script keeps the fans in sync with GPU utilization, maintaining junction temps around 80°C instead of the 97°C I saw before I figured out cooling.

Why Not Just Use NVIDIA?#

Fair question. NVIDIA has better software support, more documentation, and CUDA is everywhere. But the MI60’s 32GB of HBM2 changes the math. An RTX 3090 has 24GB of GDDR6X and costs significantly more on the secondary market. The MI60 gives me more memory for less money, and for inference workloads, that memory matters more than raw compute throughput.

The software story isn’t as bad as you might expect. ROCm (AMD’s answer to CUDA) works, though you need to stick with version 5.6—AMD dropped MI60 support in newer releases. Most of the popular inference frameworks have ROCm support now. I’ve been running Ollama for smaller models and vLLM when I need to run the big ones across both cards.

What Can It Actually Do?#

Here are some real numbers from my setup, running vLLM with AWQ-quantized models:

Model Tokens/sec GPUs
Qwen3 8B ~90 1
Qwen3 32B ~31 1
Llama 3.3 70B ~26 2 (tensor parallel)

These are respectable numbers—the 8B and 32B models feel snappy for interactive use, and even the 70B is responsive enough for real work.

The 70B result is the interesting one. Most consumer GPUs can’t run this model at all. You could buy two 4090s, but that’s $3,000+ for 48GB of combined VRAM that still can’t do tensor parallelism across cards without NVLink. Two MI60s cost around $1,000 total, give you 64GB of HBM2, and actually run the model at usable speeds. You’ll need to solve the cooling problem, but that’s a one-time engineering exercise.

This Is Just the Beginning#

I’ll be writing more about this setup: the cooling solution I built, the software stack, how I switch between different model configurations, and the various dead ends I hit along the way. Spoiler: Stable Diffusion still locks up the GPU, and I haven’t gotten Whisper working yet.

If you’re looking to build affordable local AI infrastructure, old datacenter GPUs are worth considering. The MI60 isn’t the only option—there are MI50s, MI100s, and various NVIDIA Tesla cards floating around the secondary market. The key is finding the right balance of memory, compute, and software support for your use case.

For me, the MI60 hit the sweet spot.