Summary of Quantization from the Ground Up by Sam Rose

By Avik Mukherjee|Apr 2, 2026·12 min read·Updated Apr 2, 2026

I had heard the usual line for months: "just quantize to 4-bit and you'll be fine." I repeated that advice too, but I did not really understand why it worked, where it failed, or what the quality cliff actually looked like.

Then I read Sam Rose's original article, which is long (about 6.6k words) and deeply interactive. It finally connected the math, systems behavior, and benchmark outcomes for me.

Sam Rose is a Senior Developer Educator at ngrok, and this summary is my practical version of his post with the key ideas I want to remember when deploying local models.

What Makes LLMs So Large?#

LLMs are mostly parameters (weights). Every edge between connected nodes has a learned value, and modern models contain billions to trillions of those values.

Qwen-3-Coder-Next (80B params) is about 159.4 GB.
Rumored frontier models with 1T+ params would need TB-scale memory.

The major point: model size is not just "disk size." It directly affects whether your hardware can load and serve the model at all.

How Numbers Are Stored (Why Precision Costs Memory)#

Computers store numbers as bits. Integers are discrete and exact in range. Floats are a compromise between range and precision.

A float is split into three fields:

sign bit
exponent bits (range)
significand bits (precision)

That split determines what numbers are representable and how much rounding error you accept.

Float formats

explore formatfloat16

total bits

vs float32

2.0x smaller

Useful rule of thumb from the source article:

Format	Bits	Typical Precision	Approx Range
float32	32	~7 significant figures	+-3.4e38
float16	16	~3 significant figures	+-65504
bfloat16	16	~2 significant figures	+-3.4e38
float8	8	~1 significant figure	much smaller

The important intuition: LLM weights cluster near zero, so we often do not need float32-level precision during inference.

What Quantization Actually Is#

Quantization maps values from a larger range into a smaller representable set. It is lossy compression.

Naive round-to-nearest can fail badly at low bit-widths because the available buckets are too coarse or poorly aligned with actual weight distributions.

Symmetric vs Asymmetric Quantization#

Symmetric quantization#

\operatorname{scale} = \frac{\max\left(\lvert w \rvert\right)}{2^{(b-1)} - 1},\quad b = \mathrm{bits}

q = \operatorname{round}\left(\frac{w}{\operatorname{scale}}\right)

dq = q \cdot \operatorname{scale}

Simple and stable, but can waste representable range when data is skewed around zero.

Asymmetric quantization#

\operatorname{scale} = \frac{v_{\max} - v_{\min}}{q_{\max} - q_{\min}}

\operatorname{zero} = q_{\min} - \operatorname{round}\left(\frac{v_{\min}}{\operatorname{scale}}\right)

q = \operatorname{round}\left(\frac{w}{\operatorname{scale}} + \operatorname{zero}\right)

dq = \operatorname{scale} \cdot (q - \operatorname{zero})

Adds zero-point offset so integer buckets fit the real data range better.

Quantization explorer

Precision (bits)4-bit

avg error

8.5%

size vs float32

8.0x smaller

param[0]-0.89 -> -0.91-2.2%

param[1]0.16 -> 0.14-12.5%

param[2]0.08 -> 0.07-12.5%

param[3]-0.13 -> -0.14-7.7%

param[4]0.16 -> 0.14-12.5%

param[5]-0.54 -> -0.56-3.7%

In the source examples, asymmetric 4-bit gave clearly lower average error than symmetric 4-bit on skewed toy values.

Why Outliers Matter (Block Quantization in Practice)#

The article highlights a critical failure mode: a few outlier weights can stretch global scale so much that most values collapse into poor buckets.

That is why practical pipelines quantize in blocks (commonly 32-256 parameters per block), each with local scale and often local zero-point.

Trade-off:

smaller blocks: better local fit, higher metadata overhead
larger blocks: lower overhead, higher quantization error risk

How Quality Loss Is Measured#

The source uses three complementary lenses:

Perplexity: confidence on correct next tokens
KL divergence: how much full probability distributions shift
Task benchmarks: end-to-end behavior on real evaluations

Perplexity from source results (Qwen3.5 9B, wikitext-2):

Format	Perplexity
bfloat16	8.186
8-bit symmetric	8.193
4-bit asymmetric	8.563
4-bit symmetric	8.71
2-bit asymmetric	66.1

Mean KL divergence from source results:

Format	Mean KL
8-bit symmetric	0.0008
4-bit asymmetric	0.0593
4-bit symmetric	0.0675
2-bit asymmetric	2.1447

GPQA Diamond outcomes in the source also show the same broad shape: 8-bit and 4-bit remain usable, 2-bit falls off hard.

Quality benchmarks

format	perplexity (lower)	verdict
bfloat16	8.186	baseline
8-bit symmetric	8.193	ok
4-bit asymmetric	8.563	ok
4-bit symmetric	8.71	broken
2-bit asymmetric	66.1	broken

Verdict updates by selected metric relative to bfloat16 baseline. KL uses absolute quality bands.

Why Quantized Models Are Often Faster Too#

I expected dequantization overhead to cancel most gains, but the article's measurements align with what people report in practice: memory bandwidth is often the bottleneck.

Smaller weights mean less data movement, which can dominate and improve throughput.

Tokens per second

bfloat16

19.4 tok/s

8-bit sym

32.4 tok/s

4-bit asym

43.3 tok/s

4-bit sym

46.0 tok/s

2-bit asym

40.3 tok/s

Lower precision speeds up inference because the model moves fewer bytes through memory.

Source benchmark snapshot (tokens/s):

Format	M1 Max	H100
bfloat16	19.45	106.85
8-bit symmetric	32.36	141.61
4-bit asymmetric	43.32	175.70
4-bit symmetric	46.05	177.06
2-bit asymmetric	40.25	166.90

Interesting caveat from the source: 2-bit was not always faster than 4-bit in these runs, showing that implementation details and quantization schemes can dominate expectations.

Practical Deployment Takeaways#

What changed in my workflow after reading and testing this:

Start from 8-bit as the safest compression baseline.
Evaluate 4-bit with workload-specific prompts and acceptance tests.
Check perplexity/KL plus task behavior, not one metric only.
Treat 2-bit as experimental unless proven stable for your use-case.

If you have not benchmarked on your own tasks, you do not know your actual quantization budget.

Let's Keep Talking#

This write up helped me finally connect the dots between model quality, memory bandwidth, and why 8-bit/4-bit choices feel so different in practice.

If you're tuning quantized models and want to compare notes on what actually holds up in production, I'd love to hear from you.

Share this, or find me on GitHub.

Feedback welcome. Call out mistakes. I'd rather be corrected than stay wrong.

Reach me on GitHub, X, Peerlist, or LinkedIn.

Sponsor

Support my open-source work

If my projects, blog posts, or tools have helped you, consider sponsoring me on GitHub. Every contribution keeps the side projects shipping.

Sponsor on GitHub

Summary of Quantization from the Ground Up by Sam Rose

Related posts

Sponsor

Support my open-source work