How Much VRAM Do You Need to Run an LLM Locally?

LA

By Lefi Abdelmonem

Author · AI Local Check

The short answer

The main thing that decides whether you can run an AI model on your computer is how much memory your graphics card has. Bigger models need more memory, and compressing a model lets it fit in less, at some cost to quality. If a model is still too large, it can usually run anyway by spilling part of it onto your system memory, just more slowly. The table below shows what each model size typically requires.

Rule of thumb: at Q4_K_M quantization, a model needs roughly 0.7 GB of VRAM per billion parameters for its weights — plus a little extra for the KV cache and system overhead.

Model sizeExample modelVRAM (Q4_K_M)Min GPU
1BLlama-3.2-1B-Instruct1.68 GB4 GB
3BLlama-3.2-3B-Instruct3.12 GB4 GB
8BMeta-Llama-3.1-8B-Instruct5.88 GB6 GB
14BQwen2.5-14B-Instruct9.92 GB10 GB
32BQwen2.5-32B-Instruct20.29 GB24 GB
70BLlama-3.3-70B-Instruct41.65 GB48 GB

Worked example — Meta-Llama-3.1-8B-Instruct at Q4_K_M needs about 5.88 GB (6 GB GPU). Llama-3.3-70B-Instruct needs about 41.65 GB (48 GB GPU).

How VRAM use is calculated

Total memory use comes from three parts: the model's weights, the KV cache that grows as the prompt or conversation gets longer, and a fixed overhead for the runtime. The weights dominate, but the KV cache matters a lot for long contexts.

ComponentExample (Qwen2.5-7B, Q4_K_M, 4K ctx)
Model weights (real GGUF size)4.36 GB
KV cache (grows with context)0.22 GB
System / compute overhead0.8 GB
Total VRAM needed5.38 GB

The role of quantization

Quantization shrinks a model by storing its weights at lower precision. This cuts memory use dramatically, letting large models run on modest hardware. The trade-off is quality: the more aggressive the compression, the more answers can degrade, so the goal is the highest quality that still fits your memory.

QuantizationVRAM neededQuality
Q2_K3.83 GBLow
Q3_K_M4.57 GBFair
Q4_K_M5.38 GBGood
Q5_K_M6.09 GBVery good
Q6_K6.84 GBExcellent
Q8_08.56 GBExcellent

What you can run on each GPU

The memory on your graphics card sets a hard ceiling on the size of model you can run fully on the GPU. Choosing hardware is really about deciding which models you want to run, then picking a card with enough memory for them.

GPU VRAMLargest model (Q4)Comfortable sizes
8 GB8B1B, 3B, 8B
12 GB14B1B, 3B, 8B, 14B
16 GB14B1B, 3B, 8B, 14B
24 GB32B1B, 3B, 8B, 14B, 32B

Browse what each GPU can run

Context length and the KV cache

The KV cache is the most variable part of memory use: the more text the model handles at once, the larger it grows. If you plan to work with long documents or conversations, leave extra memory headroom beyond the weights alone.

When VRAM isn't enough: CPU offloading

When a model doesn't fit in your graphics card, part of it can be offloaded to your system's main memory. This lets you run models that would otherwise be too big, but the parts running on the CPU are much slower, so it is a trade-off between capability and speed.

Frequently asked questions

How much VRAM do I need for a 7B model?

About 5.38 GB at Q4_K_M and a 4096-token context: 4.36 GB of weights, 0.22 GB of KV cache and 0.8 GB of overhead.

How much VRAM for a 70B model?

Around 41.65 GB at Q4_K_M, which means a 48 GB GPU.

Can an 8 GB GPU run an 8B model?

Yes. Meta-Llama-3.1-8B-Instruct at Q4_K_M needs about 5.88 GB, which fits in 8 GB.

Does a higher quantization need more VRAM?

Yes. More bits per weight means better quality but more memory; lower quantization saves memory at some quality cost.