Run unsloth/Qwen3-VL-2B-Instruct-GGUF locally

License: apache-2.0 ⬇ 851,190 ❤ 34

Parameters1.72B

Context262,144

unsloth/Qwen3-VL-2B-Instruct-GGUF is a compact instruction-tuned chat model with 1.72 billion parameters, built on the qwen3vl architecture. It is released under the apache-2.0 license and has been downloaded 851,190 times.

To run unsloth/Qwen3-VL-2B-Instruct-GGUF locally at a 4,096-token context, its quantized versions need between 1.74 GB (IQ1_S, lowest quality) and 5.21 GB (BF16, highest quality) of memory, weights plus KV cache and a system margin included.

For most users the best balance is BF16, needing about 5.21 GB. That means unsloth/Qwen3-VL-2B-Instruct-GGUF fits entirely in the VRAM of a 6 GB GPU or larger, running fully on the GPU.

→ Guide: How much VRAM do you need?

All quantizations

Quant.	Bits	Quality	Weights	KV	Total	Speed~	Verdict
IQ1_S	2.5	Very low	0.5 GB	0.44 GB	1.74 GB	798.6 t/s	Fits in VRAM
IQ1_M	2.61	Low	0.52 GB	0.44 GB	1.76 GB	764.3 t/s	Fits in VRAM
IQ2_XXS	2.82	Low	0.56 GB	0.44 GB	1.8 GB	709.0 t/s	Fits in VRAM
IQ2_M	3.3	Low	0.66 GB	0.44 GB	1.9 GB	606.0 t/s	Fits in VRAM
IQ3_XXS	3.56	Fair	0.71 GB	0.44 GB	1.95 GB	561.3 t/s	Fits in VRAM
Q2_K	3.62	Fair	0.72 GB	0.44 GB	1.96 GB	552.2 t/s	Fits in VRAM
Q2_K_L	3.62	Fair	0.72 GB	0.44 GB	1.96 GB	552.2 t/s	Fits in VRAM
Q2_K_XL	3.71	Fair	0.74 GB	0.44 GB	1.98 GB	538.3 t/s	Fits in VRAM
F16	3.81	Fair	0.76 GB	0.44 GB	2.0 GB	524.2 t/s	Fits in VRAM
Q3_K_S	4.03	Fair	0.81 GB	0.44 GB	2.05 GB	495.2 t/s	Fits in VRAM
Q3_K_M	4.37	Good	0.88 GB	0.44 GB	2.11 GB	457.1 t/s	Fits in VRAM
Q3_K_XL	4.51	Good	0.9 GB	0.44 GB	2.14 GB	443.3 t/s	Fits in VRAM
IQ4_XS	4.7	Good	0.94 GB	0.44 GB	2.18 GB	425.1 t/s	Fits in VRAM
IQ4_NL	4.9	Good	0.98 GB	0.44 GB	2.22 GB	407.3 t/s	Fits in VRAM
Q4_0	4.91	Good	0.98 GB	0.44 GB	2.22 GB	406.4 t/s	Fits in VRAM
Q4_K_S	4.93	Good	0.99 GB	0.44 GB	2.22 GB	405.1 t/s	Fits in VRAM
Q4_K_M	5.15	Very good	1.03 GB	0.44 GB	2.27 GB	387.8 t/s	Fits in VRAM
Q4_K_XL	5.25	Very good	1.05 GB	0.44 GB	2.29 GB	380.2 t/s	Fits in VRAM
Q4_1	5.31	Very good	1.06 GB	0.44 GB	2.3 GB	375.9 t/s	Fits in VRAM
Q5_K_S	5.72	Very good	1.15 GB	0.44 GB	2.38 GB	349.0 t/s	Fits in VRAM
Q5_K_M	5.85	Very good	1.17 GB	0.44 GB	2.41 GB	341.4 t/s	Fits in VRAM
Q5_K_XL	5.86	Very good	1.17 GB	0.44 GB	2.41 GB	340.5 t/s	Fits in VRAM
Q6_K	6.59	Excellent	1.32 GB	0.44 GB	2.56 GB	302.9 t/s	Fits in VRAM
Q6_K_XL	7.49	Excellent	1.5 GB	0.44 GB	2.74 GB	266.6 t/s	Fits in VRAM
F32	7.57	Excellent	1.52 GB	0.44 GB	2.75 GB	263.8 t/s	Fits in VRAM
Q8_0	8.53	Excellent	1.71 GB	0.44 GB	2.95 GB	234.1 t/s	Fits in VRAM
Q8_K_XL	10.85	Excellent	2.17 GB	0.44 GB	3.41 GB	184.1 t/s	Fits in VRAM
BF16	19.85	Excellent	3.98 GB	0.44 GB	5.21 GB	100.6 t/s	Fits in VRAM

KV cache computed from the model's exact architecture. Speed is a rough estimate bounded by memory bandwidth.

Frequently asked questions

How much VRAM do you need to run unsloth/Qwen3-VL-2B-Instruct-GGUF?

You need about 5.21 GB of VRAM to run unsloth/Qwen3-VL-2B-Instruct-GGUF entirely on the GPU using the BF16 quantization (at a 4,096-token context). Smaller quantizations lower the requirement at the cost of quality.

Can I run unsloth/Qwen3-VL-2B-Instruct-GGUF on an 8 GB GPU?

Yes. With 8 GB of VRAM you can run unsloth/Qwen3-VL-2B-Instruct-GGUF fully on the GPU using BF16 (about 5.21 GB).

Can I run unsloth/Qwen3-VL-2B-Instruct-GGUF on a 16 GB GPU?

Yes. With 16 GB of VRAM you can run unsloth/Qwen3-VL-2B-Instruct-GGUF fully on the GPU using BF16 (about 5.21 GB).

Can I run unsloth/Qwen3-VL-2B-Instruct-GGUF on a 24 GB GPU?

Yes. With 24 GB of VRAM you can run unsloth/Qwen3-VL-2B-Instruct-GGUF fully on the GPU using BF16 (about 5.21 GB).

What is the best quantization for unsloth/Qwen3-VL-2B-Instruct-GGUF?

If memory allows, higher bits-per-weight means better quality. A common sweet spot is a Q4_K_M or Q5_K_M quantization, which keeps most of the quality while roughly halving the memory versus 8-bit. Pick the highest quantization that still fits in your VRAM.