How Much VRAM Do You Need to Run an LLM Locally?
Author · AI Local Check
The short answer
The main thing that decides whether you can run an AI model on your computer is how much memory your graphics card has. Bigger models need more memory, and compressing a model lets it fit in less, at some cost to quality. If a model is still too large, it can usually run anyway by spilling part of it onto your system memory, just more slowly. The table below shows what each model size typically requires.
Rule of thumb: at Q4_K_M quantization, a model needs roughly 0.7 GB of VRAM per billion parameters for its weights — plus a little extra for the KV cache and system overhead.
| Model size | Example model | VRAM (Q4_K_M) | Min GPU |
|---|---|---|---|
| 1B | Llama-3.2-1B-Instruct | 1.68 GB | 4 GB |
| 3B | Llama-3.2-3B-Instruct | 3.12 GB | 4 GB |
| 8B | Meta-Llama-3.1-8B-Instruct | 5.88 GB | 6 GB |
| 14B | Qwen2.5-14B-Instruct | 9.92 GB | 10 GB |
| 32B | Qwen2.5-32B-Instruct | 20.29 GB | 24 GB |
| 70B | Llama-3.3-70B-Instruct | 41.65 GB | 48 GB |
Worked example — Meta-Llama-3.1-8B-Instruct at Q4_K_M needs about 5.88 GB (6 GB GPU). Llama-3.3-70B-Instruct needs about 41.65 GB (48 GB GPU).
How VRAM use is calculated
Total memory use comes from three parts: the model's weights, the KV cache that grows as the prompt or conversation gets longer, and a fixed overhead for the runtime. The weights dominate, but the KV cache matters a lot for long contexts.
| Component | Example (Qwen2.5-7B, Q4_K_M, 4K ctx) |
|---|---|
| Model weights (real GGUF size) | 4.36 GB |
| KV cache (grows with context) | 0.22 GB |
| System / compute overhead | 0.8 GB |
| Total VRAM needed | 5.38 GB |
The role of quantization
Quantization shrinks a model by storing its weights at lower precision. This cuts memory use dramatically, letting large models run on modest hardware. The trade-off is quality: the more aggressive the compression, the more answers can degrade, so the goal is the highest quality that still fits your memory.
| Quantization | VRAM needed | Quality |
|---|---|---|
| Q2_K | 3.83 GB | Low |
| Q3_K_M | 4.57 GB | Fair |
| Q4_K_M | 5.38 GB | Good |
| Q5_K_M | 6.09 GB | Very good |
| Q6_K | 6.84 GB | Excellent |
| Q8_0 | 8.56 GB | Excellent |
What you can run on each GPU
The memory on your graphics card sets a hard ceiling on the size of model you can run fully on the GPU. Choosing hardware is really about deciding which models you want to run, then picking a card with enough memory for them.
| GPU VRAM | Largest model (Q4) | Comfortable sizes |
|---|---|---|
| 8 GB | 8B | 1B, 3B, 8B |
| 12 GB | 14B | 1B, 3B, 8B, 14B |
| 16 GB | 14B | 1B, 3B, 8B, 14B |
| 24 GB | 32B | 1B, 3B, 8B, 14B, 32B |
Context length and the KV cache
The KV cache is the most variable part of memory use: the more text the model handles at once, the larger it grows. If you plan to work with long documents or conversations, leave extra memory headroom beyond the weights alone.
When VRAM isn't enough: CPU offloading
When a model doesn't fit in your graphics card, part of it can be offloaded to your system's main memory. This lets you run models that would otherwise be too big, but the parts running on the CPU are much slower, so it is a trade-off between capability and speed.
Frequently asked questions
How much VRAM do I need for a 7B model?
About 5.38 GB at Q4_K_M and a 4096-token context: 4.36 GB of weights, 0.22 GB of KV cache and 0.8 GB of overhead.
How much VRAM for a 70B model?
Around 41.65 GB at Q4_K_M, which means a 48 GB GPU.
Can an 8 GB GPU run an 8B model?
Yes. Meta-Llama-3.1-8B-Instruct at Q4_K_M needs about 5.88 GB, which fits in 8 GB.
Does a higher quantization need more VRAM?
Yes. More bits per weight means better quality but more memory; lower quantization saves memory at some quality cost.