How Much VRAM Do You Need to Run an LLM Locally?

Author · AI Local Check

The short answer

The main thing that decides whether you can run an AI model on your computer is how much memory your graphics card has. Bigger models need more memory, and compressing a model lets it fit in less, at some cost to quality. If a model is still too large, it can usually run anyway by spilling part of it onto your system memory, just more slowly. The table below shows what each model size typically requires.

Rule of thumb: at Q4_K_M quantization, a model needs roughly 0.7 GB of VRAM per billion parameters for its weights — plus a little extra for the KV cache and system overhead.

Model size	Example model	VRAM (Q4_K_M)	Min GPU
1B	Llama-3.2-1B-Instruct	1.68 GB	4 GB
3B	Llama-3.2-3B-Instruct	3.12 GB	4 GB
8B	Meta-Llama-3.1-8B-Instruct	5.88 GB	6 GB
14B	Qwen2.5-14B-Instruct	9.92 GB	10 GB
32B	Qwen2.5-32B-Instruct	20.29 GB	24 GB
70B	Llama-3.3-70B-Instruct	41.65 GB	48 GB

Worked example — Meta-Llama-3.1-8B-Instruct at Q4_K_M needs about 5.88 GB (6 GB GPU). Llama-3.3-70B-Instruct needs about 41.65 GB (48 GB GPU).

How VRAM use is calculated

Total memory use comes from three parts: the model's weights, the KV cache that grows as the prompt or conversation gets longer, and a fixed overhead for the runtime. The weights dominate, but the KV cache matters a lot for long contexts.

Component	Example (Qwen2.5-7B, Q4_K_M, 4K ctx)
Model weights (real GGUF size)	4.36 GB
KV cache (grows with context)	0.22 GB
System / compute overhead	0.8 GB
Total VRAM needed	5.38 GB

The role of quantization

Quantization shrinks a model by storing its weights at lower precision. This cuts memory use dramatically, letting large models run on modest hardware. The trade-off is quality: the more aggressive the compression, the more answers can degrade, so the goal is the highest quality that still fits your memory.

Quantization	VRAM needed	Quality
Q2_K	3.83 GB	Low
Q3_K_M	4.57 GB	Fair
Q4_K_M	5.38 GB	Good
Q5_K_M	6.09 GB	Very good
Q6_K	6.84 GB	Excellent
Q8_0	8.56 GB	Excellent

What you can run on each GPU

The memory on your graphics card sets a hard ceiling on the size of model you can run fully on the GPU. Choosing hardware is really about deciding which models you want to run, then picking a card with enough memory for them.

GPU VRAM	Largest model (Q4)	Comfortable sizes
8 GB	8B	1B, 3B, 8B
12 GB	14B	1B, 3B, 8B, 14B
16 GB	14B	1B, 3B, 8B, 14B
24 GB	32B	1B, 3B, 8B, 14B, 32B

Browse what each GPU can run

Context length and the KV cache

The KV cache is the most variable part of memory use: the more text the model handles at once, the larger it grows. If you plan to work with long documents or conversations, leave extra memory headroom beyond the weights alone.

When VRAM isn't enough: CPU offloading

When a model doesn't fit in your graphics card, part of it can be offloaded to your system's main memory. This lets you run models that would otherwise be too big, but the parts running on the CPU are much slower, so it is a trade-off between capability and speed.

Frequently asked questions

How much VRAM do I need for a 7B model?

About 5.38 GB at Q4_K_M and a 4096-token context: 4.36 GB of weights, 0.22 GB of KV cache and 0.8 GB of overhead.

How much VRAM for a 70B model?

Around 41.65 GB at Q4_K_M, which means a 48 GB GPU.

Can an 8 GB GPU run an 8B model?

Yes. Meta-Llama-3.1-8B-Instruct at Q4_K_M needs about 5.88 GB, which fits in 8 GB.

Does a higher quantization need more VRAM?

Yes. More bits per weight means better quality but more memory; lower quantization saves memory at some quality cost.

→ Check what your exact GPU can run