Run unsloth/Qwen3.5-4B-GGUF locally
unsloth/Qwen3.5-4B-GGUF is a mid-size language model with 4.21 billion parameters, built on the qwen35 architecture. It is released under the apache-2.0 license and has been downloaded 998,890 times.
To run unsloth/Qwen3.5-4B-GGUF locally at a 4,096-token context, its quantized versions need between 1.8 GB (F16, lowest quality) and 9.65 GB (BF16, highest quality) of memory, weights plus KV cache and a system margin included.
For most users the best balance is Q8_K_XL, needing about 6.72 GB. That means unsloth/Qwen3.5-4B-GGUF fits entirely in the VRAM of a 6 GB GPU or larger, running fully on the GPU.
All quantizations
| Quant. | Bits | Quality | Weights | KV | Total | Speed~ | Verdict |
|---|---|---|---|---|---|---|---|
| F16 | 1.28 | Very low | 0.63 GB | 0.38 GB | 1.8 GB | 638.7 t/s | Fits in VRAM |
| F32 | 2.54 | Very low | 1.24 GB | 0.38 GB | 2.42 GB | 321.9 t/s | Fits in VRAM |
| IQ2_XXS | 2.89 | Low | 1.42 GB | 0.38 GB | 2.59 GB | 282.5 t/s | Fits in VRAM |
| IQ2_M | 3.35 | Fair | 1.64 GB | 0.38 GB | 2.81 GB | 244.0 t/s | Fits in VRAM |
| Q2_K_XL | 3.69 | Fair | 1.81 GB | 0.38 GB | 2.98 GB | 221.3 t/s | Fits in VRAM |
| IQ3_XXS | 3.71 | Fair | 1.82 GB | 0.38 GB | 2.99 GB | 220.4 t/s | Fits in VRAM |
| Q3_K_S | 4.01 | Fair | 1.96 GB | 0.38 GB | 3.14 GB | 204.0 t/s | Fits in VRAM |
| Q3_K_M | 4.36 | Good | 2.14 GB | 0.38 GB | 3.31 GB | 187.3 t/s | Fits in VRAM |
| Q3_K_XL | 4.63 | Good | 2.27 GB | 0.38 GB | 3.44 GB | 176.3 t/s | Fits in VRAM |
| IQ4_XS | 4.71 | Good | 2.31 GB | 0.38 GB | 3.48 GB | 173.4 t/s | Fits in VRAM |
| IQ4_NL | 4.91 | Good | 2.4 GB | 0.38 GB | 3.58 GB | 166.5 t/s | Fits in VRAM |
| Q4_0 | 4.91 | Good | 2.41 GB | 0.38 GB | 3.58 GB | 166.3 t/s | Fits in VRAM |
| Q4_K_S | 4.93 | Good | 2.41 GB | 0.38 GB | 3.59 GB | 165.8 t/s | Fits in VRAM |
| Q4_K_M | 5.21 | Very good | 2.55 GB | 0.38 GB | 3.73 GB | 156.7 t/s | Fits in VRAM |
| Q4_1 | 5.3 | Very good | 2.59 GB | 0.38 GB | 3.77 GB | 154.3 t/s | Fits in VRAM |
| Q4_K_XL | 5.54 | Very good | 2.71 GB | 0.38 GB | 3.89 GB | 147.5 t/s | Fits in VRAM |
| Q5_K_S | 5.75 | Very good | 2.82 GB | 0.38 GB | 3.99 GB | 142.0 t/s | Fits in VRAM |
| Q5_K_M | 5.98 | Very good | 2.93 GB | 0.38 GB | 4.1 GB | 136.6 t/s | Fits in VRAM |
| Q5_K_XL | 6.18 | Very good | 3.03 GB | 0.38 GB | 4.2 GB | 132.1 t/s | Fits in VRAM |
| Q6_K | 6.71 | Excellent | 3.28 GB | 0.38 GB | 4.46 GB | 121.8 t/s | Fits in VRAM |
| Q6_K_XL | 7.89 | Excellent | 3.86 GB | 0.38 GB | 5.04 GB | 103.6 t/s | Fits in VRAM |
| Q8_0 | 8.53 | Excellent | 4.17 GB | 0.38 GB | 5.35 GB | 95.8 t/s | Fits in VRAM |
| Q8_K_XL | 11.32 | Excellent | 5.54 GB | 0.38 GB | 6.72 GB | 72.2 t/s | Fits in VRAM |
| BF16 | 17.31 | Excellent | 8.48 GB | 0.38 GB | 9.65 GB | 5.9 t/s | Offload |
KV cache computed from the model's exact architecture. Speed is a rough estimate bounded by memory bandwidth.
Frequently asked questions
How much VRAM do you need to run unsloth/Qwen3.5-4B-GGUF?
You need about 5.35 GB of VRAM to run unsloth/Qwen3.5-4B-GGUF entirely on the GPU using the Q8_0 quantization (at a 4,096-token context). Smaller quantizations lower the requirement at the cost of quality.
Can I run unsloth/Qwen3.5-4B-GGUF on an 8 GB GPU?
Yes. With 8 GB of VRAM you can run unsloth/Qwen3.5-4B-GGUF fully on the GPU using Q8_K_XL (about 6.72 GB).
Can I run unsloth/Qwen3.5-4B-GGUF on a 16 GB GPU?
Yes. With 16 GB of VRAM you can run unsloth/Qwen3.5-4B-GGUF fully on the GPU using BF16 (about 9.65 GB).
Can I run unsloth/Qwen3.5-4B-GGUF on a 24 GB GPU?
Yes. With 24 GB of VRAM you can run unsloth/Qwen3.5-4B-GGUF fully on the GPU using BF16 (about 9.65 GB).
What is the best quantization for unsloth/Qwen3.5-4B-GGUF?
If memory allows, higher bits-per-weight means better quality. A common sweet spot is a Q4_K_M or Q5_K_M quantization, which keeps most of the quality while roughly halving the memory versus 8-bit. Pick the highest quantization that still fits in your VRAM.