Avik
back|llm

Summary of Quantization from the Ground Up by Sam Rose

By Avik Mukherjee  |  Apr 2, 2026 · 12 min read · Updated Apr 2, 2026

I had heard the usual line for months: "just quantize to 4-bit and you'll be fine." I repeated that advice too, but I did not really understand why it worked, where it failed, or what the quality cliff actually looked like.

Then I read Sam Rose's original article, which is long (about 6.6k words) and deeply interactive. It finally connected the math, systems behavior, and benchmark outcomes for me.

Sam Rose is a Senior Developer Educator at ngrok, and this summary is my practical version of his post with the key ideas I want to remember when deploying local models.

What Makes LLMs So Large?#

LLMs are mostly parameters (weights). Every edge between connected nodes has a learned value, and modern models contain billions to trillions of those values.

  • Qwen-3-Coder-Next (80B params) is about 159.4 GB.
  • Rumored frontier models with 1T+ params would need TB-scale memory.

The major point: model size is not just "disk size." It directly affects whether your hardware can load and serve the model at all.

How Numbers Are Stored (Why Precision Costs Memory)#

Computers store numbers as bits. Integers are discrete and exact in range. Floats are a compromise between range and precision.

A float is split into three fields:

  • sign bit
  • exponent bits (range)
  • significand bits (precision)

That split determines what numbers are representable and how much rounding error you accept.

Float formats
float16
0
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
total bits
16
vs float32
2.0x smaller

Useful rule of thumb from the source article:

FormatBitsTypical PrecisionApprox Range
float3232~7 significant figures+-3.4e38
float1616~3 significant figures+-65504
bfloat1616~2 significant figures+-3.4e38
float88~1 significant figuremuch smaller

The important intuition: LLM weights cluster near zero, so we often do not need float32-level precision during inference.

What Quantization Actually Is#

Quantization maps values from a larger range into a smaller representable set. It is lossy compression.

Naive round-to-nearest can fail badly at low bit-widths because the available buckets are too coarse or poorly aligned with actual weight distributions.

Symmetric vs Asymmetric Quantization#

Symmetric quantization#

scale=max(w)2(b1)1,b=bits\operatorname{scale} = \frac{\max\left(\lvert w \rvert\right)}{2^{(b-1)} - 1},\quad b = \mathrm{bits}
q=round(wscale)q = \operatorname{round}\left(\frac{w}{\operatorname{scale}}\right)
dq=qscaledq = q \cdot \operatorname{scale}

Simple and stable, but can waste representable range when data is skewed around zero.

Asymmetric quantization#

scale=vmaxvminqmaxqmin\operatorname{scale} = \frac{v_{\max} - v_{\min}}{q_{\max} - q_{\min}}
zero=qminround(vminscale)\operatorname{zero} = q_{\min} - \operatorname{round}\left(\frac{v_{\min}}{\operatorname{scale}}\right)
q=round(wscale+zero)q = \operatorname{round}\left(\frac{w}{\operatorname{scale}} + \operatorname{zero}\right)
dq=scale(qzero)dq = \operatorname{scale} \cdot (q - \operatorname{zero})

Adds zero-point offset so integer buckets fit the real data range better.

Quantization explorer
4-bit
avg error
8.5%
size vs float32
8.0x smaller
param[0]-0.89 -> -0.91-2.2%
param[1]0.16 -> 0.14-12.5%
param[2]0.08 -> 0.07-12.5%
param[3]-0.13 -> -0.14-7.7%
param[4]0.16 -> 0.14-12.5%
param[5]-0.54 -> -0.56-3.7%

In the source examples, asymmetric 4-bit gave clearly lower average error than symmetric 4-bit on skewed toy values.

Why Outliers Matter (Block Quantization in Practice)#

The article highlights a critical failure mode: a few outlier weights can stretch global scale so much that most values collapse into poor buckets.

That is why practical pipelines quantize in blocks (commonly 32-256 parameters per block), each with local scale and often local zero-point.

Trade-off:

  • smaller blocks: better local fit, higher metadata overhead
  • larger blocks: lower overhead, higher quantization error risk

How Quality Loss Is Measured#

The source uses three complementary lenses:

  1. Perplexity: confidence on correct next tokens
  2. KL divergence: how much full probability distributions shift
  3. Task benchmarks: end-to-end behavior on real evaluations

Perplexity from source results (Qwen3.5 9B, wikitext-2):

FormatPerplexity
bfloat168.186
8-bit symmetric8.193
4-bit asymmetric8.563
4-bit symmetric8.71
2-bit asymmetric66.1

Mean KL divergence from source results:

FormatMean KL
8-bit symmetric0.0008
4-bit asymmetric0.0593
4-bit symmetric0.0675
2-bit asymmetric2.1447

GPQA Diamond outcomes in the source also show the same broad shape: 8-bit and 4-bit remain usable, 2-bit falls off hard.

Quality benchmarks
formatperplexity (lower)verdict
bfloat168.186baseline
8-bit symmetric8.193ok
4-bit asymmetric8.563ok
4-bit symmetric8.71broken
2-bit asymmetric66.1broken

Verdict updates by selected metric relative to bfloat16 baseline. KL uses absolute quality bands.

Why Quantized Models Are Often Faster Too#

I expected dequantization overhead to cancel most gains, but the article's measurements align with what people report in practice: memory bandwidth is often the bottleneck.

Smaller weights mean less data movement, which can dominate and improve throughput.

Tokens per second
bfloat16
19.4 tok/s
8-bit sym
32.4 tok/s
4-bit asym
43.3 tok/s
4-bit sym
46.0 tok/s
2-bit asym
40.3 tok/s

Lower precision speeds up inference because the model moves fewer bytes through memory.

Source benchmark snapshot (tokens/s):

FormatM1 MaxH100
bfloat1619.45106.85
8-bit symmetric32.36141.61
4-bit asymmetric43.32175.70
4-bit symmetric46.05177.06
2-bit asymmetric40.25166.90

Interesting caveat from the source: 2-bit was not always faster than 4-bit in these runs, showing that implementation details and quantization schemes can dominate expectations.

Practical Deployment Takeaways#

What changed in my workflow after reading and testing this:

  1. Start from 8-bit as the safest compression baseline.
  2. Evaluate 4-bit with workload-specific prompts and acceptance tests.
  3. Check perplexity/KL plus task behavior, not one metric only.
  4. Treat 2-bit as experimental unless proven stable for your use-case.

If you have not benchmarked on your own tasks, you do not know your actual quantization budget.

Let's Keep Talking#

This write up helped me finally connect the dots between model quality, memory bandwidth, and why 8-bit/4-bit choices feel so different in practice.

If you're tuning quantized models and want to compare notes on what actually holds up in production, I'd love to hear from you.

Share this, or find me on GitHub.

Feedback welcome. Call out mistakes. I'd rather be corrected than stay wrong.

Reach me on GitHub, X, Peerlist, or LinkedIn.