Run unsloth/gemma-4-31B-it-GGUF locally

License: apache-2.0 ⬇ 591,899 ❤ 505

Parameters30.7B

Context262,144

unsloth/gemma-4-31B-it-GGUF is a very large instruction-tuned chat model with 30.7 billion parameters, built on the gemma4 architecture. It is released under the apache-2.0 license and has been downloaded 591,899 times.

To run unsloth/gemma-4-31B-it-GGUF locally at a 4,096-token context, its quantized versions need between 3.95 GB (GGUF, lowest quality) and 62.68 GB (BF16, highest quality) of memory, weights plus KV cache and a system margin included.

For most users the best balance is F32, needing about 5.62 GB. That means unsloth/gemma-4-31B-it-GGUF fits entirely in the VRAM of a 6 GB GPU or larger, running fully on the GPU.

→ Guide: How much VRAM do you need?

All quantizations

Quant.	Bits	Quality	Weights	KV	Total	Speed~	Verdict
GGUF	0.13	Very low	0.48 GB	2.67 GB	3.95 GB	834.5 t/s	Fits in VRAM
F16	0.56	Very low	2.01 GB	2.67 GB	5.48 GB	199.4 t/s	Fits in VRAM
F32	0.6	Very low	2.14 GB	2.67 GB	5.62 GB	186.5 t/s	Fits in VRAM
IQ2_XXS	2.22	Very low	7.95 GB	2.67 GB	11.42 GB	6.3 t/s	Offload
IQ2_M	2.8	Low	10.01 GB	2.67 GB	13.49 GB	5.0 t/s	Offload
Q2_K_XL	3.07	Low	10.97 GB	2.67 GB	14.44 GB	4.6 t/s	Offload
IQ3_XXS	3.09	Low	11.02 GB	2.67 GB	14.5 GB	4.5 t/s	Offload
Q3_K_S	3.44	Fair	12.3 GB	2.67 GB	15.78 GB	4.1 t/s	Offload
Q3_K_M	3.84	Fair	13.72 GB	2.67 GB	17.2 GB	3.6 t/s	Offload
Q3_K_XL	4.01	Fair	14.32 GB	2.67 GB	17.8 GB	3.5 t/s	Offload
IQ4_XS	4.27	Good	15.25 GB	2.67 GB	18.72 GB	3.3 t/s	Offload
IQ4_NL	4.51	Good	16.1 GB	2.67 GB	19.57 GB	3.1 t/s	Offload
Q4_0	4.52	Good	16.15 GB	2.67 GB	19.62 GB	3.1 t/s	Offload
Q4_K_S	4.53	Good	16.2 GB	2.67 GB	19.68 GB	3.1 t/s	Offload
Q4_K_M	4.78	Good	17.07 GB	2.67 GB	20.54 GB	2.9 t/s	Offload
Q4_K_XL	4.91	Good	17.53 GB	2.67 GB	21.0 GB	2.9 t/s	Offload
Q4_1	4.98	Good	17.81 GB	2.67 GB	21.28 GB	2.8 t/s	Offload
Q5_K_S	5.51	Very good	19.67 GB	2.67 GB	23.15 GB	2.5 t/s	Offload
Q5_K_M	5.64	Very good	20.17 GB	2.67 GB	23.64 GB	2.5 t/s	Offload
Q5_K_XL	5.7	Very good	20.39 GB	2.67 GB	23.86 GB	2.5 t/s	Offload
Q6_K	6.57	Excellent	23.47 GB	2.67 GB	26.94 GB	—	Insufficient
Q6_K_XL	7.17	Excellent	25.63 GB	2.67 GB	29.1 GB	—	Insufficient
Q8_0	8.64	Excellent	30.87 GB	2.67 GB	34.35 GB	—	Insufficient
Q8_K_XL	9.13	Excellent	32.61 GB	2.67 GB	36.09 GB	—	Insufficient
BF16	16.57	Excellent	59.2 GB	2.67 GB	62.68 GB	—	Insufficient

KV cache estimated (architecture unavailable). Speed is a rough estimate bounded by memory bandwidth.

Frequently asked questions

How much VRAM do you need to run unsloth/gemma-4-31B-it-GGUF?

You need about 5.62 GB of VRAM to run unsloth/gemma-4-31B-it-GGUF entirely on the GPU using the F32 quantization (at a 4,096-token context). Smaller quantizations lower the requirement at the cost of quality.

Can I run unsloth/gemma-4-31B-it-GGUF on an 8 GB GPU?

Yes. With 8 GB of VRAM you can run unsloth/gemma-4-31B-it-GGUF fully on the GPU using F32 (about 5.62 GB).

Can I run unsloth/gemma-4-31B-it-GGUF on a 16 GB GPU?

Yes. With 16 GB of VRAM you can run unsloth/gemma-4-31B-it-GGUF fully on the GPU using Q3_K_S (about 15.78 GB).

Can I run unsloth/gemma-4-31B-it-GGUF on a 24 GB GPU?

Yes. With 24 GB of VRAM you can run unsloth/gemma-4-31B-it-GGUF fully on the GPU using Q5_K_XL (about 23.86 GB).

What is the best quantization for unsloth/gemma-4-31B-it-GGUF?

If memory allows, higher bits-per-weight means better quality. A common sweet spot is a Q4_K_M or Q5_K_M quantization, which keeps most of the quality while roughly halving the memory versus 8-bit. Pick the highest quantization that still fits in your VRAM.