Run unsloth/gemma-4-E4B-it-GGUF locally

License: apache-2.0 ⬇ 687,378 ❤ 523

Parameters7.52B

Context131,072

unsloth/gemma-4-E4B-it-GGUF is a mid-size instruction-tuned chat model with 7.52 billion parameters, built on the gemma4 architecture. It is released under the apache-2.0 license and has been downloaded 687,378 times.

To run unsloth/gemma-4-E4B-it-GGUF locally at a 4,096-token context, its quantized versions need between 0.9 GB (GGUF, lowest quality) and 15.91 GB (BF16, highest quality) of memory, weights plus KV cache and a system margin included.

For most users the best balance is Q6_K_XL, needing about 7.75 GB. That means unsloth/gemma-4-E4B-it-GGUF fits entirely in the VRAM of a 6 GB GPU or larger, running fully on the GPU.

→ Guide: How much VRAM do you need?

All quantizations

Quant.	Bits	Quality	Weights	KV	Total	Speed~	Verdict
GGUF	0.1	Very low	0.09 GB	0.01 GB	0.9 GB	4353.6 t/s	Fits in VRAM
F16	1.24	Very low	1.08 GB	0.01 GB	1.89 GB	369.6 t/s	Fits in VRAM
F32	2.04	Very low	1.78 GB	0.01 GB	2.59 GB	224.6 t/s	Fits in VRAM
IQ2_M	3.77	Fair	3.3 GB	0.01 GB	4.11 GB	121.2 t/s	Fits in VRAM
IQ3_XXS	3.96	Fair	3.46 GB	0.01 GB	4.27 GB	115.5 t/s	Fits in VRAM
Q2_K_XL	4.0	Fair	3.5 GB	0.01 GB	4.31 GB	114.3 t/s	Fits in VRAM
Q3_K_S	4.11	Fair	3.6 GB	0.01 GB	4.4 GB	111.2 t/s	Fits in VRAM
Q3_K_M	4.32	Good	3.78 GB	0.01 GB	4.59 GB	105.8 t/s	Fits in VRAM
Q3_K_XL	4.88	Good	4.27 GB	0.01 GB	5.08 GB	93.6 t/s	Fits in VRAM
IQ4_XS	5.02	Very good	4.39 GB	0.01 GB	5.2 GB	91.1 t/s	Fits in VRAM
IQ4_NL	5.15	Very good	4.5 GB	0.01 GB	5.31 GB	88.8 t/s	Fits in VRAM
Q4_0	5.15	Very good	4.5 GB	0.01 GB	5.31 GB	88.8 t/s	Fits in VRAM
Q4_K_S	5.16	Very good	4.51 GB	0.01 GB	5.32 GB	88.7 t/s	Fits in VRAM
Q4_K_M	5.3	Very good	4.64 GB	0.01 GB	5.44 GB	86.3 t/s	Fits in VRAM
Q4_1	5.4	Very good	4.73 GB	0.01 GB	5.53 GB	84.6 t/s	Fits in VRAM
Q4_K_XL	5.45	Very good	4.77 GB	0.01 GB	5.58 GB	83.8 t/s	Fits in VRAM
Q5_K_S	5.75	Very good	5.03 GB	0.01 GB	5.84 GB	79.5 t/s	Fits in VRAM
Q5_K_M	5.83	Very good	5.11 GB	0.01 GB	5.91 GB	78.3 t/s	Fits in VRAM
Q5_K_XL	7.08	Excellent	6.2 GB	0.01 GB	7.01 GB	64.5 t/s	Fits in VRAM
Q6_K	7.53	Excellent	6.59 GB	0.01 GB	7.4 GB	60.7 t/s	Fits in VRAM
Q6_K_XL	7.94	Excellent	6.95 GB	0.01 GB	7.75 GB	57.6 t/s	Fits in VRAM
Q8_0	8.82	Excellent	7.72 GB	0.01 GB	8.53 GB	6.5 t/s	Offload
Q8_K_XL	9.27	Excellent	8.11 GB	0.01 GB	8.92 GB	6.2 t/s	Offload
BF16	17.26	Excellent	15.1 GB	0.01 GB	15.91 GB	3.3 t/s	Offload

KV cache computed from the model's exact architecture. Speed is a rough estimate bounded by memory bandwidth.

Frequently asked questions

How much VRAM do you need to run unsloth/gemma-4-E4B-it-GGUF?

You need about 5.91 GB of VRAM to run unsloth/gemma-4-E4B-it-GGUF entirely on the GPU using the Q5_K_M quantization (at a 4,096-token context). Smaller quantizations lower the requirement at the cost of quality.

Can I run unsloth/gemma-4-E4B-it-GGUF on an 8 GB GPU?

Yes. With 8 GB of VRAM you can run unsloth/gemma-4-E4B-it-GGUF fully on the GPU using Q6_K_XL (about 7.75 GB).

Can I run unsloth/gemma-4-E4B-it-GGUF on a 16 GB GPU?

Yes. With 16 GB of VRAM you can run unsloth/gemma-4-E4B-it-GGUF fully on the GPU using BF16 (about 15.91 GB).

Can I run unsloth/gemma-4-E4B-it-GGUF on a 24 GB GPU?

Yes. With 24 GB of VRAM you can run unsloth/gemma-4-E4B-it-GGUF fully on the GPU using BF16 (about 15.91 GB).

What is the best quantization for unsloth/gemma-4-E4B-it-GGUF?

If memory allows, higher bits-per-weight means better quality. A common sweet spot is a Q4_K_M or Q5_K_M quantization, which keeps most of the quality while roughly halving the memory versus 8-bit. Pick the highest quantization that still fits in your VRAM.