Running 70B Models on 4GB VRAM: The AirLLM Layer-Swap Hack
AirLLM trades disk I/O for VRAM, letting developers run massive models locally without renting enterprise GPU clusters.
The VRAM wall is the bane of local LLM development. To run an unquantized 70B parameter model, you need roughly 130GB of VRAM—meaning you either need a cluster of high-end enterprise GPUs or a steady stream of cloud credits just to see if your prompt engineering works.
For developers without access to "big iron," this hardware barrier has effectively gated local testing of state-of-the-art models. But a library called AirLLM challenges this paradigm. By trading disk I/O for memory capacity, AirLLM allows you to run a 70B model on a single 4GB GPU, and even the colossal Llama 3.1 405B on an 8GB GPU.
This is not a magic compression trick; it is a clever engineering hack that exploits the sequential nature of transformer architectures. Here is how it works, what the real-world performance trade-offs are, and how to integrate it into your workflow.
The Anatomy of the Layer-Swap
To understand how AirLLM pulls this off, we have to look at how a Transformer model executes. A typical 70B model (like Llama 2 or Llama 3) consists of an embedding layer, followed by 80 identical, sequential transformer layers, and a final normalization/projection layer.
During inference, these layers are executed one after another. The output of Layer 1 becomes the input to Layer 2, and so on. At any given millisecond of execution, the GPU only needs to perform calculations for a single layer. Keeping the other 79 layers sitting idle in precious VRAM is highly inefficient if you are memory-constrained.
AirLLM implements a "divide and conquer" strategy:
- Layer-by-Layer Loading: It loads only the active layer from disk into GPU memory.
- Execution: The GPU computes the activations for that layer.
- Eviction: The layer is immediately purged from VRAM, and the next layer is read from disk.
By keeping only one layer in memory at a time, the VRAM floor drops from the full 130GB model size to the size of a single layer—roughly 1.6GB for a 70B model.
What about the KV cache? To avoid recalculating past tokens, the key-value (KV) cache must remain in VRAM. However, the KV cache footprint is surprisingly small for short-to-medium context lengths. The formula for KV cache size is:
$$\text{Size} = 2 \times \text{sequence_length} \times \text{num_layers} \times \text{num_heads} \times \text{vector_dim} \times 4\text{ bytes}$$
For an input length of 100 tokens on a 70B model, this cache takes up only about 30MB of VRAM. Combined with the 1.6GB layer size, the entire execution easily fits within a 4GB VRAM envelope.
Under the Hood: Meta Devices and Safetensors
Orchestrating this constant swapping without crashing PyTorch requires some low-level maneuvering. AirLLM relies on two key technologies under the hood: Hugging Face Accelerate and safetensors.
First, AirLLM uses Accelerate's Meta Device feature (init_empty_weights()). A meta device is a virtual target that allows PyTorch to load a model's skeleton and execution graph without actually allocating memory for the weights. The memory footprint starts at zero:
from accelerate import init_empty_weights
with init_empty_weights():
# Model structure is loaded, but weights are empty
my_model = ModelClass(...)
Second, standard Hugging Face model shards are typically saved in 10GB chunks. If AirLLM had to read a 10GB file just to extract a 1.6GB layer, the disk overhead would make inference completely unusable. To solve this, AirLLM pre-processes the model, splitting it strictly layer-by-layer and saving it using safetensors. Because safetensors maps files directly to memory, AirLLM can stream individual layers into VRAM with minimal serialization overhead.
The Elephant in the Room: The Disk I/O Bottleneck
Let's be completely clear: AirLLM is slow. Your PCIe bus and NVMe SSD—not your GPU cores—become the absolute bottleneck of the system. Reading 130GB of weights from disk for every single token generated means your generation speed will be measured in seconds per token, not tokens per second.
To mitigate this, AirLLM introduced two critical optimizations:
- Prefetching: Introduced in version 2.5, this overlaps disk reads with GPU execution. While the GPU is busy calculating Layer $N$, AirLLM is already streaming Layer $N+1$ from the SSD into system RAM or a GPU staging area. This simple concurrency trick yields a ~10% speed improvement.
- Block-wise Weight-Only Quantization: By using bitsandbytes to compress the model to 4-bit or 8-bit, AirLLM reduces the physical size of the files on disk. Because the bottleneck is disk-read speed rather than compute, shrinking the file size by 4x translates directly to a nearly 3x speedup in inference time.
Unlike standard quantization (which quantizes both weights and activations to speed up matrix multiplication), AirLLM only quantizes the weights on disk to speed up loading. This preserves model accuracy far better, avoiding the typical degradation associated with aggressive quantization.
Developer Workflow: Getting Hands-On
Adopting AirLLM is straightforward. It exposes an API that mimics standard Hugging Face transformers, meaning you don't have to rewrite your generation loops.
First, install the package and its compression dependencies:
pip install -U airllm bitsandbytes
Then, initialize the model using AutoModel. The first time you run this, AirLLM will download the model, decompose it layer-by-layer, and cache the sharded safetensors files locally. Make sure you have ample SSD space for this initial compilation step.
from airllm import AutoModel
MAX_LENGTH = 128
# Initialize with 4-bit weight compression to speed up disk loading
model = AutoModel.from_pretrained(
"garage-bAInd/Platypus2-70B-instruct",
compression='4bit'
)
input_text = ["What is the capital of United States?"]
input_tokens = model.tokenizer(
input_text,
return_tensors="pt",
truncation=True,
max_length=MAX_LENGTH
)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True
)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
If you are tight on disk space, you can pass the delete_original=True flag to remove the raw Hugging Face files once the layer-wise sharding process is complete.
The Verdict: When to Use It
AirLLM is a brilliant engineering achievement, but it is not a silver bullet. It is vital to understand where it fits in the developer toolkit:
- What it is NOT for: Real-time chat applications, high-throughput production APIs, or interactive UI testing. The latency is simply too high.
- What it IS for: Local prototyping, prompt validation, offline batch processing, and privacy-first pipelines.
If you need to run a massive model locally and have a decent CPU with plenty of system RAM, tools like llama.cpp (using GGUF format) remain the standard for interactive speeds, though they require heavy quantization. However, if you lack system RAM, want to test unquantized weights, or need to validate how a 70B or 405B model behaves before deploying it to an expensive cloud cluster, AirLLM is an incredibly elegant way to bypass the VRAM tax.
Sources & further reading
- lyogavin/airllm — github.com
- Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique — huggingface.co
- AirLLM: Run Massive LLMs on Single 4GB GPU, Free and Open-source — medevel.com
- GitHub - lyogavin/airllm: AirLLM 70B inference with single 4GB GPU — spreaker.com
- lyogavin/airllm — GitHub trending stats & insights — trendshift.io
Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.
Discussion 1
i'd love to see some actual benchmark numbers on the disk i/o overhead and how it affects training times, the 'trading disk i/o for memory capacity' claim sounds intriguing but needs some concrete data to back it up 📊