Run unsloth/gpt-oss-120b-GGUF locally
unsloth/gpt-oss-120b-GGUF is a very large language model with 116.83 billion parameters, built on the gpt-oss architecture. It is released under the apache-2.0 license and has been downloaded 241,119 times.
To run unsloth/gpt-oss-120b-GGUF locally at a 4,096-token context, its quantized versions need between 59.35 GB (Q3_K_S, lowest quality) and 61.96 GB (F16, highest quality) of memory, weights plus KV cache and a system margin included.
For most users the best balance is F16, needing about 61.96 GB. On consumer GPUs it generally requires offloading part of the model to system RAM, which works but is slower.
All quantizations
| Quant. | Bits | Quality | Weights | KV | Total | Speed~ | Verdict |
|---|---|---|---|---|---|---|---|
| Q3_K_S | 4.28 | Good | 58.27 GB | 0.28 GB | 59.35 GB | 0.9 t/s | Offload |
| Q2_K | 4.28 | Good | 58.27 GB | 0.28 GB | 59.35 GB | 0.9 t/s | Offload |
| Q4_0 | 4.29 | Good | 58.32 GB | 0.28 GB | 59.4 GB | 0.9 t/s | Offload |
| Q3_K_M | 4.29 | Good | 58.33 GB | 0.28 GB | 59.41 GB | 0.9 t/s | Offload |
| Q4_1 | 4.29 | Good | 58.41 GB | 0.28 GB | 59.49 GB | 0.9 t/s | Offload |
| Q4_K_S | 4.3 | Good | 58.45 GB | 0.28 GB | 59.53 GB | 0.9 t/s | Offload |
| Q4_K_M | 4.3 | Good | 58.46 GB | 0.28 GB | 59.54 GB | 0.9 t/s | Offload |
| Q2_K_L | 4.3 | Good | 58.54 GB | 0.28 GB | 59.62 GB | 0.9 t/s | Offload |
| Q5_K_S | 4.31 | Good | 58.56 GB | 0.28 GB | 59.64 GB | 0.9 t/s | Offload |
| Q5_K_M | 4.31 | Good | 58.57 GB | 0.28 GB | 59.65 GB | 0.9 t/s | Offload |
| Q4_K_XL | 4.32 | Good | 58.69 GB | 0.28 GB | 59.77 GB | 0.9 t/s | Offload |
| Q6_K | 4.33 | Good | 58.94 GB | 0.28 GB | 60.02 GB | 0.8 t/s | Offload |
| Q6_K_XL | 4.33 | Good | 58.94 GB | 0.28 GB | 60.02 GB | 0.8 t/s | Offload |
| Q8_0 | 4.34 | Good | 59.03 GB | 0.28 GB | 60.12 GB | 0.8 t/s | Offload |
| Q8_K_XL | 4.41 | Good | 60.05 GB | 0.28 GB | 61.13 GB | 0.8 t/s | Offload |
| F16 | 4.48 | Good | 60.88 GB | 0.28 GB | 61.96 GB | 0.8 t/s | Offload |
KV cache computed from the model's exact architecture. Speed is a rough estimate bounded by memory bandwidth.
Frequently asked questions
Can I run unsloth/gpt-oss-120b-GGUF on an 8 GB GPU?
No. unsloth/gpt-oss-120b-GGUF does not fit on an 8 GB GPU, even with the smallest quantization and system RAM offloading.
Can I run unsloth/gpt-oss-120b-GGUF on a 16 GB GPU?
No. unsloth/gpt-oss-120b-GGUF does not fit on a 16 GB GPU, even with the smallest quantization and system RAM offloading.
Can I run unsloth/gpt-oss-120b-GGUF on a 24 GB GPU?
Partially. unsloth/gpt-oss-120b-GGUF only fits on a 24 GB GPU by offloading part of it to system RAM (with F16), which runs but is slower.
What is the best quantization for unsloth/gpt-oss-120b-GGUF?
If memory allows, higher bits-per-weight means better quality. A common sweet spot is a Q4_K_M or Q5_K_M quantization, which keeps most of the quality while roughly halving the memory versus 8-bit. Pick the highest quantization that still fits in your VRAM.