Summary of Quantization from the Ground Up by Sam Rose
By Avik Mukherjee | Apr 2, 2026 · 12 min read · Updated Apr 2, 2026
I had heard the usual line for months: "just quantize to 4-bit and you'll be fine." I repeated that advice too, but I did not really understand why it worked, where it failed, or what the quality cliff actually looked like.
Then I read Sam Rose's original article, which is long (about 6.6k words) and deeply interactive. It finally connected the math, systems behavior, and benchmark outcomes for me.
Sam Rose is a Senior Developer Educator at ngrok, and this summary is my practical version of his post with the key ideas I want to remember when deploying local models.
What Makes LLMs So Large?#
LLMs are mostly parameters (weights). Every edge between connected nodes has a learned value, and modern models contain billions to trillions of those values.
- Qwen-3-Coder-Next (80B params) is about 159.4 GB.
- Rumored frontier models with 1T+ params would need TB-scale memory.
The major point: model size is not just "disk size." It directly affects whether your hardware can load and serve the model at all.
How Numbers Are Stored (Why Precision Costs Memory)#
Computers store numbers as bits. Integers are discrete and exact in range. Floats are a compromise between range and precision.
A float is split into three fields:
- sign bit
- exponent bits (range)
- significand bits (precision)
That split determines what numbers are representable and how much rounding error you accept.
Useful rule of thumb from the source article:
| Format | Bits | Typical Precision | Approx Range |
|---|---|---|---|
| float32 | 32 | ~7 significant figures | +-3.4e38 |
| float16 | 16 | ~3 significant figures | +-65504 |
| bfloat16 | 16 | ~2 significant figures | +-3.4e38 |
| float8 | 8 | ~1 significant figure | much smaller |
The important intuition: LLM weights cluster near zero, so we often do not need float32-level precision during inference.
What Quantization Actually Is#
Quantization maps values from a larger range into a smaller representable set. It is lossy compression.
Naive round-to-nearest can fail badly at low bit-widths because the available buckets are too coarse or poorly aligned with actual weight distributions.
Symmetric vs Asymmetric Quantization#
Symmetric quantization#
Simple and stable, but can waste representable range when data is skewed around zero.
Asymmetric quantization#
Adds zero-point offset so integer buckets fit the real data range better.
In the source examples, asymmetric 4-bit gave clearly lower average error than symmetric 4-bit on skewed toy values.
Why Outliers Matter (Block Quantization in Practice)#
The article highlights a critical failure mode: a few outlier weights can stretch global scale so much that most values collapse into poor buckets.
That is why practical pipelines quantize in blocks (commonly 32-256 parameters per block), each with local scale and often local zero-point.
Trade-off:
- smaller blocks: better local fit, higher metadata overhead
- larger blocks: lower overhead, higher quantization error risk
How Quality Loss Is Measured#
The source uses three complementary lenses:
- Perplexity: confidence on correct next tokens
- KL divergence: how much full probability distributions shift
- Task benchmarks: end-to-end behavior on real evaluations
Perplexity from source results (Qwen3.5 9B, wikitext-2):
| Format | Perplexity |
|---|---|
| bfloat16 | 8.186 |
| 8-bit symmetric | 8.193 |
| 4-bit asymmetric | 8.563 |
| 4-bit symmetric | 8.71 |
| 2-bit asymmetric | 66.1 |
Mean KL divergence from source results:
| Format | Mean KL |
|---|---|
| 8-bit symmetric | 0.0008 |
| 4-bit asymmetric | 0.0593 |
| 4-bit symmetric | 0.0675 |
| 2-bit asymmetric | 2.1447 |
GPQA Diamond outcomes in the source also show the same broad shape: 8-bit and 4-bit remain usable, 2-bit falls off hard.
| format | perplexity (lower) | verdict |
|---|---|---|
| bfloat16 | 8.186 | baseline |
| 8-bit symmetric | 8.193 | ok |
| 4-bit asymmetric | 8.563 | ok |
| 4-bit symmetric | 8.71 | broken |
| 2-bit asymmetric | 66.1 | broken |
Verdict updates by selected metric relative to bfloat16 baseline. KL uses absolute quality bands.
Why Quantized Models Are Often Faster Too#
I expected dequantization overhead to cancel most gains, but the article's measurements align with what people report in practice: memory bandwidth is often the bottleneck.
Smaller weights mean less data movement, which can dominate and improve throughput.
Lower precision speeds up inference because the model moves fewer bytes through memory.
Source benchmark snapshot (tokens/s):
| Format | M1 Max | H100 |
|---|---|---|
| bfloat16 | 19.45 | 106.85 |
| 8-bit symmetric | 32.36 | 141.61 |
| 4-bit asymmetric | 43.32 | 175.70 |
| 4-bit symmetric | 46.05 | 177.06 |
| 2-bit asymmetric | 40.25 | 166.90 |
Interesting caveat from the source: 2-bit was not always faster than 4-bit in these runs, showing that implementation details and quantization schemes can dominate expectations.
Practical Deployment Takeaways#
What changed in my workflow after reading and testing this:
- Start from 8-bit as the safest compression baseline.
- Evaluate 4-bit with workload-specific prompts and acceptance tests.
- Check perplexity/KL plus task behavior, not one metric only.
- Treat 2-bit as experimental unless proven stable for your use-case.
If you have not benchmarked on your own tasks, you do not know your actual quantization budget.
Let's Keep Talking#
This write up helped me finally connect the dots between model quality, memory bandwidth, and why 8-bit/4-bit choices feel so different in practice.
If you're tuning quantized models and want to compare notes on what actually holds up in production, I'd love to hear from you.
Share this, or find me on GitHub.
Feedback welcome. Call out mistakes. I'd rather be corrected than stay wrong.
Reach me on GitHub, X, Peerlist, or LinkedIn.