Run bartowski/Llama-3.2-3B-Instruct-GGUF locally
Llama-3.2-3B-Instruct is a 3.21 billion parameter AI model derived from the Llama 3.2 family, designed for instruction-following tasks. It operates under the llama3.2 license and has knowledge current through December 2023. The model is optimized for deployment in applications requiring structured responses and contextual understanding.
To run bartowski/Llama-3.2-3B-Instruct-GGUF locally at a 4,096-token context, its quantized versions need between 2.73 GB (IQ3_M, lowest quality) and 7.23 GB (F16, highest quality) of memory, weights plus KV cache and a system margin included.
For most users the best balance is F16, needing about 7.23 GB. That means bartowski/Llama-3.2-3B-Instruct-GGUF fits entirely in the VRAM of a 6 GB GPU or larger, running fully on the GPU.
All quantizations
| Quant. | Bits | Quality | Weights | KV | Total | Speed~ | Verdict |
|---|---|---|---|---|---|---|---|
| IQ3_M | 3.98 | Fair | 1.49 GB | 0.44 GB | 2.73 GB | 268.5 t/s | Fits in VRAM |
| Q3_K_L | 4.52 | Good | 1.69 GB | 0.44 GB | 2.93 GB | 236.6 t/s | Fits in VRAM |
| IQ4_XS | 4.55 | Good | 1.7 GB | 0.44 GB | 2.94 GB | 234.8 t/s | Fits in VRAM |
| Q3_K_XL | 4.76 | Good | 1.78 GB | 0.44 GB | 3.02 GB | 224.8 t/s | Fits in VRAM |
| Q4_0_4_4 | 4.77 | Good | 1.79 GB | 0.44 GB | 3.02 GB | 224.0 t/s | Fits in VRAM |
| Q4_0_4_8 | 4.77 | Good | 1.79 GB | 0.44 GB | 3.02 GB | 224.0 t/s | Fits in VRAM |
| Q4_0_8_8 | 4.77 | Good | 1.79 GB | 0.44 GB | 3.02 GB | 224.0 t/s | Fits in VRAM |
| Q4_0 | 4.79 | Good | 1.79 GB | 0.44 GB | 3.03 GB | 223.5 t/s | Fits in VRAM |
| Q4_K_S | 4.8 | Good | 1.8 GB | 0.44 GB | 3.03 GB | 222.7 t/s | Fits in VRAM |
| Q4_K_M | 5.03 | Very good | 1.88 GB | 0.44 GB | 3.12 GB | 212.7 t/s | Fits in VRAM |
| Q4_K_L | 5.27 | Very good | 1.97 GB | 0.44 GB | 3.21 GB | 203.1 t/s | Fits in VRAM |
| Q5_K_S | 5.65 | Very good | 2.11 GB | 0.44 GB | 3.35 GB | 189.2 t/s | Fits in VRAM |
| Q5_K_M | 5.78 | Very good | 2.16 GB | 0.44 GB | 3.4 GB | 185.0 t/s | Fits in VRAM |
| Q5_K_L | 6.02 | Very good | 2.25 GB | 0.44 GB | 3.49 GB | 177.7 t/s | Fits in VRAM |
| Q6_K | 6.58 | Excellent | 2.46 GB | 0.44 GB | 3.7 GB | 162.5 t/s | Fits in VRAM |
| Q6_K_L | 6.82 | Excellent | 2.55 GB | 0.44 GB | 3.79 GB | 156.8 t/s | Fits in VRAM |
| Q8_0 | 8.52 | Excellent | 3.19 GB | 0.44 GB | 4.42 GB | 125.5 t/s | Fits in VRAM |
| F16 | 16.02 | Excellent | 5.99 GB | 0.44 GB | 7.23 GB | 66.8 t/s | Fits in VRAM |
KV cache computed from the model's exact architecture. Speed is a rough estimate bounded by memory bandwidth.
Frequently asked questions
How much VRAM do you need to run bartowski/Llama-3.2-3B-Instruct-GGUF?
You need about 4.42 GB of VRAM to run bartowski/Llama-3.2-3B-Instruct-GGUF entirely on the GPU using the Q8_0 quantization (at a 4,096-token context). Smaller quantizations lower the requirement at the cost of quality.
Can I run bartowski/Llama-3.2-3B-Instruct-GGUF on an 8 GB GPU?
Yes. With 8 GB of VRAM you can run bartowski/Llama-3.2-3B-Instruct-GGUF fully on the GPU using F16 (about 7.23 GB).
Can I run bartowski/Llama-3.2-3B-Instruct-GGUF on a 16 GB GPU?
Yes. With 16 GB of VRAM you can run bartowski/Llama-3.2-3B-Instruct-GGUF fully on the GPU using F16 (about 7.23 GB).
Can I run bartowski/Llama-3.2-3B-Instruct-GGUF on a 24 GB GPU?
Yes. With 24 GB of VRAM you can run bartowski/Llama-3.2-3B-Instruct-GGUF fully on the GPU using F16 (about 7.23 GB).
What is the best quantization for bartowski/Llama-3.2-3B-Instruct-GGUF?
If memory allows, higher bits-per-weight means better quality. A common sweet spot is a Q4_K_M or Q5_K_M quantization, which keeps most of the quality while roughly halving the memory versus 8-bit. Pick the highest quantization that still fits in your VRAM.