Hands-On LLM Quantization: transformers & bitsandbytes

Part 2 : Libraries, Coding and Architecture

Jul 19, 2025

In the first part of this series, we explored the foundational concepts of LLM quantization, covering the what, why, and how behind this powerful technique.

If you’re landing directly on this article or would like a quick refresher on the theory, I highly recommend starting with Part 1: A Beginner’s Guide to LLM Quantization.

Now, let’s get hands-on. In this article, we’ll introduce the essential libraries you need for quantization and walk you through a complete code example to show how to implement quantization in practice and transform architecture details.

Tools/Libraries for LLM Quantization

PyTorch & TensorFlow: Both major deep learning frameworks have built-in quantization modules (torch.quantization, TensorFlow Lite). These are great if you are building a model from scratch and need more granular control.

Apart from this, here are some of the top open-source tools and libraries for LLM quantization:

For this code walkthrough, I will focus on two essential libraries: Hugging Face transformers and bitsandbytes. Together, they allow us to load and quantize a model with just a few simple lines of code. Both transformers and bitsandbytes are built on top of torch. It's the foundation for everything.

transformers

This is the main, high-level library from Hugging Face. It provides the easy interface for almost everything we want to do with a language model.

bitsandbytes

This is a lower-level, specialized library that performs efficient, low-precision math on GPUs. It contains the actual code that knows how to convert numbers to 4-bit and 8-bit formats and how to use them in calculations.

accelerate

This is another Hugging Face library that makes running models on different hardware (CPU, single GPU, multiple GPUs, Apple Silicon) seamless. Just a argument device_map = “auto” in the code can figures out the best way to load the model onto specific hardware without having to write complicated code.

A Code Walkthrough

For the purpose of learning, let’s use a small, friendly model called distilgpt2.

It's small enough to run on almost any latest home machines / laptops configuration. I have ran this code example directly in a Google Colab notebook.

Load the Tokenizer

Load the Model ( Model Architecture )

There are two main components in original Transformer architecture.

Decoder
Encoder

Since we using GPT-2 variants distilgpt2 , It was built using only the decoder part of the original Transformer architecture. Token embeddings represent the meaning of individual words while positional embeddings encode the order of words in a sequence.

wte: Word Token Embedding = 50,257 vocabulary tokens to a 768-dimensional embedding space.
wpe: Positional Embedding = position indices (up to 1024) to 768-dim vectors.
h: ModuleList of 6 GPT2Block layers – Each is a Transformer decoder block.
Each GPT2Block includes
- Layer Normalization
- Self-Attention Layer
- Feed-Forward Network
Final Layers

Load the Quantized 4-bit Model

Check memory usage

Full Precision Model Memory: 333.94 MB
4-bit Quantized Model Memory: 106.42 MB
Memory saved: 68.13% 🎉

Test the Model Output Quality

Result

Problem:

4-bit quantization, especially on CPU, reduces the precision of model weights. This can flatten the attention distribution and degrade language logical sense. The model defaults to greedy decoding, which