Run bartowski/Llama-3.3-70B-Instruct-GGUF locally
Llama-3.3-70B-Instruct is a 70.55 billion parameter AI model from the Llama family, designed for instruction-following tasks and licensed under the llama3.3 terms. It is optimized for general-purpose reasoning and interaction, with knowledge current as of December 2023. The model is compatible with tools like LM Studio and can be accessed via Hugging Face.
To run bartowski/Llama-3.3-70B-Instruct-GGUF locally at a 4,096-token context, its quantized versions need between 17.65 GB (IQ1_M, lowest quality) and 133.48 GB (F16, highest quality) of memory, weights plus KV cache and a system margin included.
For most users the best balance is IQ2_S, needing about 22.76 GB. That means bartowski/Llama-3.3-70B-Instruct-GGUF fits entirely in the VRAM of a 24 GB GPU or larger, running fully on the GPU.
All quantizations
| Quant. | Bits | Quality | Weights | KV | Total | Speed~ | Verdict |
|---|---|---|---|---|---|---|---|
| IQ1_M | 1.9 | Very low | 15.6 GB | 1.25 GB | 17.65 GB | 3.2 t/s | Offload |
| IQ2_XXS | 2.17 | Very low | 17.79 GB | 1.25 GB | 19.84 GB | 2.8 t/s | Offload |
| IQ2_XS | 2.4 | Very low | 19.69 GB | 1.25 GB | 21.74 GB | 2.5 t/s | Offload |
| IQ2_S | 2.52 | Very low | 20.71 GB | 1.25 GB | 22.76 GB | 2.4 t/s | Offload |
| IQ2_M | 2.73 | Low | 22.46 GB | 1.25 GB | 24.51 GB | — | Insufficient |
| Q2_K | 2.99 | Low | 24.56 GB | 1.25 GB | 26.61 GB | — | Insufficient |
| Q2_K_L | 3.11 | Low | 25.52 GB | 1.25 GB | 27.57 GB | — | Insufficient |
| IQ3_XXS | 3.11 | Low | 25.58 GB | 1.25 GB | 27.63 GB | — | Insufficient |
| IQ3_XS | 3.32 | Fair | 27.29 GB | 1.25 GB | 29.34 GB | — | Insufficient |
| Q3_K_S | 3.51 | Fair | 28.79 GB | 1.25 GB | 30.84 GB | — | Insufficient |
| IQ3_M | 3.62 | Fair | 29.74 GB | 1.25 GB | 31.79 GB | — | Insufficient |
| Q3_K_M | 3.89 | Fair | 31.91 GB | 1.25 GB | 33.96 GB | — | Insufficient |
| Q3_K_L | 4.21 | Good | 34.59 GB | 1.25 GB | 36.64 GB | — | Insufficient |
| IQ4_XS | 4.3 | Good | 35.3 GB | 1.25 GB | 37.35 GB | — | Insufficient |
| Q3_K_XL | 4.32 | Good | 35.45 GB | 1.25 GB | 37.5 GB | — | Insufficient |
| Q4_0_4_4 | 4.53 | Good | 37.22 GB | 1.25 GB | 39.27 GB | — | Insufficient |
| Q4_0_4_8 | 4.53 | Good | 37.22 GB | 1.25 GB | 39.27 GB | — | Insufficient |
| Q4_0_8_8 | 4.53 | Good | 37.22 GB | 1.25 GB | 39.27 GB | — | Insufficient |
| IQ4_NL | 4.54 | Good | 37.3 GB | 1.25 GB | 39.35 GB | — | Insufficient |
| Q4_0 | 4.55 | Good | 37.36 GB | 1.25 GB | 39.41 GB | — | Insufficient |
| Q4_K_S | 4.57 | Good | 37.58 GB | 1.25 GB | 39.63 GB | — | Insufficient |
| Q4_K_M | 4.82 | Good | 39.6 GB | 1.25 GB | 41.65 GB | — | Insufficient |
| Q4_K_L | 4.91 | Good | 40.33 GB | 1.25 GB | 42.38 GB | — | Insufficient |
| Q5_K_S | 5.52 | Very good | 45.32 GB | 1.25 GB | 47.37 GB | — | Insufficient |
| Q5_K_M | 5.66 | Very good | 46.52 GB | 1.25 GB | 48.57 GB | — | Insufficient |
| Q5_K_L | 5.74 | Very good | 47.12 GB | 1.25 GB | 49.17 GB | — | Insufficient |
| Q6_K | 6.56 | Excellent | 53.91 GB | 1.25 GB | 55.96 GB | — | Insufficient |
| Q6_K_L | 6.62 | Excellent | 54.39 GB | 1.25 GB | 56.44 GB | — | Insufficient |
| Q8_0 | 8.5 | Excellent | 69.83 GB | 1.25 GB | 71.88 GB | — | Insufficient |
| F16 | 16.0 | Excellent | 131.43 GB | 1.25 GB | 133.48 GB | — | Insufficient |
KV cache computed from the model's exact architecture. Speed is a rough estimate bounded by memory bandwidth.
Frequently asked questions
How much VRAM do you need to run bartowski/Llama-3.3-70B-Instruct-GGUF?
You need about 22.76 GB of VRAM to run bartowski/Llama-3.3-70B-Instruct-GGUF entirely on the GPU using the IQ2_S quantization (at a 4,096-token context). Smaller quantizations lower the requirement at the cost of quality.
Can I run bartowski/Llama-3.3-70B-Instruct-GGUF on an 8 GB GPU?
Partially. bartowski/Llama-3.3-70B-Instruct-GGUF only fits on an 8 GB GPU by offloading part of it to system RAM (with IQ2_S), which runs but is slower.
Can I run bartowski/Llama-3.3-70B-Instruct-GGUF on a 16 GB GPU?
Partially. bartowski/Llama-3.3-70B-Instruct-GGUF only fits on a 16 GB GPU by offloading part of it to system RAM (with Q5_K_S), which runs but is slower.
Can I run bartowski/Llama-3.3-70B-Instruct-GGUF on a 24 GB GPU?
Yes. With 24 GB of VRAM you can run bartowski/Llama-3.3-70B-Instruct-GGUF fully on the GPU using IQ2_S (about 22.76 GB).
What is the best quantization for bartowski/Llama-3.3-70B-Instruct-GGUF?
If memory allows, higher bits-per-weight means better quality. A common sweet spot is a Q4_K_M or Q5_K_M quantization, which keeps most of the quality while roughly halving the memory versus 8-bit. Pick the highest quantization that still fits in your VRAM.