Run bartowski/Llama-3.2-3B-Instruct-GGUF locally

License: llama3.2 ⬇ 222,373 ❤ 219
Parameters3.21B
Context131,072

Llama-3.2-3B-Instruct is a 3.21 billion parameter AI model derived from the Llama 3.2 family, designed for instruction-following tasks. It operates under the llama3.2 license and has knowledge current through December 2023. The model is optimized for deployment in applications requiring structured responses and contextual understanding.

To run bartowski/Llama-3.2-3B-Instruct-GGUF locally at a 4,096-token context, its quantized versions need between 2.73 GB (IQ3_M, lowest quality) and 7.23 GB (F16, highest quality) of memory, weights plus KV cache and a system margin included.

For most users the best balance is Q6_K_L, needing about 3.79 GB. That means bartowski/Llama-3.2-3B-Instruct-GGUF fits entirely in the VRAM of a 6 GB GPU or larger, running fully on the GPU.

→ Guide: How much VRAM do you need?

All quantizations

Quant.Bits QualityWeights KVTotal Speed~Verdict
IQ3_M 3.98 Fair 1.49 GB 0.44 GB 2.73 GB 268.5 t/s Fits in VRAM
Q3_K_L 4.52 Good 1.69 GB 0.44 GB 2.93 GB 236.6 t/s Fits in VRAM
IQ4_XS 4.55 Good 1.7 GB 0.44 GB 2.94 GB 234.8 t/s Fits in VRAM
Q3_K_XL 4.76 Good 1.78 GB 0.44 GB 3.02 GB 224.8 t/s Fits in VRAM
Q4_0_4_4 4.77 Good 1.79 GB 0.44 GB 3.02 GB 224.0 t/s Fits in VRAM
Q4_0_4_8 4.77 Good 1.79 GB 0.44 GB 3.02 GB 224.0 t/s Fits in VRAM
Q4_0_8_8 4.77 Good 1.79 GB 0.44 GB 3.02 GB 224.0 t/s Fits in VRAM
Q4_0 4.79 Good 1.79 GB 0.44 GB 3.03 GB 223.5 t/s Fits in VRAM
Q4_K_S 4.8 Good 1.8 GB 0.44 GB 3.03 GB 222.7 t/s Fits in VRAM
Q4_K_M 5.03 Very good 1.88 GB 0.44 GB 3.12 GB 212.7 t/s Fits in VRAM
Q4_K_L 5.27 Very good 1.97 GB 0.44 GB 3.21 GB 203.1 t/s Fits in VRAM
Q5_K_S 5.65 Very good 2.11 GB 0.44 GB 3.35 GB 189.2 t/s Fits in VRAM
Q5_K_M 5.78 Very good 2.16 GB 0.44 GB 3.4 GB 185.0 t/s Fits in VRAM
Q5_K_L 6.02 Very good 2.25 GB 0.44 GB 3.49 GB 177.7 t/s Fits in VRAM
Q6_K 6.58 Excellent 2.46 GB 0.44 GB 3.7 GB 162.5 t/s Fits in VRAM
Q6_K_L 6.82 Excellent 2.55 GB 0.44 GB 3.79 GB 156.8 t/s Fits in VRAM
Q8_0 8.52 Excellent 3.19 GB 0.44 GB 4.42 GB 15.7 t/s Offload
F16 16.02 Excellent 5.99 GB 0.44 GB 7.23 GB 8.3 t/s Offload

KV cache computed from the model's exact architecture. Speed is a rough estimate bounded by memory bandwidth.

Frequently asked questions

How much VRAM do you need to run bartowski/Llama-3.2-3B-Instruct-GGUF?

You need about 4.42 GB of VRAM to run bartowski/Llama-3.2-3B-Instruct-GGUF entirely on the GPU using the Q8_0 quantization (at a 4,096-token context). Smaller quantizations lower the requirement at the cost of quality.

Can I run bartowski/Llama-3.2-3B-Instruct-GGUF on an 8 GB GPU?

Yes. With 8 GB of VRAM you can run bartowski/Llama-3.2-3B-Instruct-GGUF fully on the GPU using F16 (about 7.23 GB).

Can I run bartowski/Llama-3.2-3B-Instruct-GGUF on a 16 GB GPU?

Yes. With 16 GB of VRAM you can run bartowski/Llama-3.2-3B-Instruct-GGUF fully on the GPU using F16 (about 7.23 GB).

Can I run bartowski/Llama-3.2-3B-Instruct-GGUF on a 24 GB GPU?

Yes. With 24 GB of VRAM you can run bartowski/Llama-3.2-3B-Instruct-GGUF fully on the GPU using F16 (about 7.23 GB).

What is the best quantization for bartowski/Llama-3.2-3B-Instruct-GGUF?

If memory allows, higher bits-per-weight means better quality. A common sweet spot is a Q4_K_M or Q5_K_M quantization, which keeps most of the quality while roughly halving the memory versus 8-bit. Pick the highest quantization that still fits in your VRAM.