Core Assessment
AirLLM (repository lyogavin/airllm) is one of the few LLM inference libraries that can “run a 70B model on a single 4GB card without quantization, distillation, or pruning.” Its core approach is straightforward: split the large model into layers and load them block by block. Typically, only the parameters for the layer currently being computed are loaded into VRAM, while other layers remain idle on NVMe/SSD storage.
Its moat lies in three “hard skills” that competitors haven’t achieved:
- 70B on 4GB VRAM: No quantization (preserves accuracy), no distillation (preserves capability), no pruning (preserves generality); relies purely on time-sliced swapping.
- 405B Llama3.1 on 8GB VRAM: The 405B scale represents the upper limit of industrial models; AirLLM compresses it from “8x H100” to a “single 8GB GPU.”
- Block-wise Quantization (4bit/8bit) 3x Speedup: Its proprietary block-wise quantization only compresses weights, not activations, effectively eliminating the disk loading bottleneck in one shot.
Important Premise: AirLLM is not a “lossless acceleration” tool—layer-by-layer swapping means throughput is relatively low (running 70B takes approximately 4-6 seconds for 20 tokens). Its positioning is “can run on a single card / edge device,” rather than “high QPS service.”
Project Map
| Dimension | Key Information |
|---|---|
| Repository | lyogavin/airllm |
| PyPI | pypi.org/project/airllm |
| License | Apache-2.0 |
| Stars | 19,070 / Forks 2,086 |
| Supported Models | Llama / Llama2 / Llama3 / Llama3.1 (including 405B) / QWen / ChatGLM / Baichuan / Mistral / InternLM / Mixtral |
Version Milestones
| Time | Event |
|---|---|
| 2023-11 | AirLLM Initial Release |
| 2023-12 | v2.0: safetensors support, block-wise quantization 3x speedup |
| 2023-12 | Full support for ChatGLM / QWen / Baichuan / Mistral / InternLM |
| 2023-12 | v2.6: AutoModel automatic model type recognition |
| 2023-12 | v2.7: Mixtral support |
| 2023-12 | v2.8.2: Running 70B on macOS |
| 2024-04 | Native support for Llama3 70B on a single 4GB card |
| 2024-07 | Llama3.1 405B on 8GB VRAM + 8bit/4bit quantization |
| 2024-08 | v2.10: CPU inference support |
| 2024-08 | v2.11: Qwen2.5 support |
Quick Start
Installation
pip install airllm
If you want to use quantization for acceleration:
pip install -U bitsandbytes
Running 70B in 4 Lines of Code
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")
input_text = ["What is the capital of United States?"]
input_tokens = model.tokenizer(
input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False,
)
generation_output = model.generate(
input_tokens["input_ids"].cuda(),
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True,
)
print(model.tokenizer.decode(generation_output.sequences[0]))
The first run will automatically split the model layer by layer and save it to disk—please ensure there is sufficient disk space in the HF cache directory (70B ≈ 140GB).
405B Llama3.1 (8GB Single GPU + 4bit)
from airllm import AutoModel
model = AutoModel.from_pretrained(
"meta-llama/Meta-Llama-3.1-405B-Instruct",
compression="4bit", # or "8bit"
)
# ... same tokenizer + generate as above
Core Mechanisms
1. Layer-wise Loading
AirLLM does not load the entire model into VRAM at once. The process is as follows:
- Split the model into $N$ sharded checkpoints based on transformer blocks.
- During inference, only when computing layer $i$, read layer $i$ from NVMe/SSD into GPU VRAM.
- After computing layer $i$, and before computing layer $i+1$, release the memory for layer $i$.
- The KV cache remains resident in VRAM across layers (this part cannot be saved).
This explains why “running a 70B model on 4GB VRAM” is possible: Single-layer weight of 70B / 80 ≈ 1.7GB + 1GB KV cache = 2.7GB, which fits within 4GB of VRAM.
2. Block-wise Quantization
Block-wise quantization, introduced in arXiv:2212.09720, is the killer feature added to AirLLM 2.0.
model = AutoModel.from_pretrained(
"garage-bAInd/Platypus2-70B-instruct",
compression="4bit", # or "8bit"
)
Key differences from standard quantization:
| Approach | Quantization Scope | Inference Acceleration | Accuracy Loss | Adaptation Difficulty |
|---|---|---|---|---|
| Standard INT8/INT4 Quantization | Weights + Activations | High | Medium-High (sensitive to outliers) | High (requires calibration) |
| AirLLM Block-wise Quantization | Weights Only | Medium (3x acceleration in disk loading) | Low | Zero (calibration-free) |
Design philosophy: The bottleneck is disk I/O, not computation. Therefore, compressing only the weights yields a 3x speedup; leaving activations untouched prevents outliers from degrading accuracy.
3. Prefetching
model = AutoModel.from_pretrained(
"garage-bAInd/Platypus2-70B-instruct",
prefetching=True, # Default is True
)
This creates a pipeline overlap between “loading layer $i+1$” and “computing layer $i$,” resulting in approximately a 10% speed increase.
4. MacOS / Apple Silicon Support
# Native support for Apple Silicon (M1/M2/M3/M4)
# Requires installing mlx and torch first
pip install mlx torch
See run_on_macos.ipynb for details.
Configuration Options Quick Reference
| Parameter | Purpose | Default Value |
|---|---|---|
compression | "4bit" / "8bit" / None | None |
profiling_mode | Print time taken per layer | False |
layer_shards_saving_path | Path to save split model shards | HF cache |
hf_token | Used for pulling gated models | – |
prefetching | Load-compute pipeline overlap | True |
delete_original | Delete original model after splitting, saving half the disk space | False |
delete_original=Trueis a lifesaver when disk space is tight—the HF download will be deleted, keeping only the split version (saving ~50% disk).
Performance Intuition
| Scenario | Hardware | Approx. Throughput | Notes |
|---|---|---|---|
| 70B Llama2 / Llama3 | Single 4GB GPU | ~4-6 s / 20 tokens | Layer swapping is the bottleneck |
| 70B + 4bit | Single 4GB GPU | ~1.5-2 s / 20 tokens | Block-wise quantization removes disk I/O bottleneck |
| 405B Llama3.1 + 4bit | Single 8GB GPU | ~10-15 s / 20 tokens | Due to the sheer size of 405B |
| 70B Mac M-series | 32GB Unified Memory | Similar to single GPU | Apple Silicon is suitable for individual developers |
| 70B CPU only | Multi-core x86 | Extremely slow | Temporary solution, not recommended |
Do not use AirLLM as a production inference server—its design goal is “to make it run,” not “high QPS.”
Typical Scenarios
Scenario A: Individual Developer Testing 70B Locally
# 1. Requires 4GB GPU + 256GB Disk (HF cache)
pip install airllm
# 2. Run
python -c "
from airllm import AutoModel
m = AutoModel.from_pretrained('garage-bAInd/Platypus2-70B-instruct', compression='4bit')
"
# First run downloads and splits the model (~1-2 hours)
# Subsequent launches use the split version directly
Scenario B: Testing Llama3 on MacBook M3
from airllm import AutoModel
m = AutoModel.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct", compression="4bit")
# ... Apple Silicon uses unified memory; after 4-bit quantization, usage is approx. 35GB
Scenario C: Edge / Embedded Deployment
Burn the sharded checkpoint of a 70B model with 4-bit quantization into a Jetson Orin (32GB shared memory). With prefetching=True, you can achieve a “conversational” experience on edge devices.
Scenario D: Research / Evaluation
# profiling_mode prints time taken per layer
model = AutoModel.from_pretrained(
"...",
profiling_mode=True,
compression="4bit",
)
# Output: layer 0: 0.3s, layer 1: 0.28s, ...
# Helps identify GPU or IO bottlenecks
Boundaries and Blind Spots
- Low Throughput: Swapping in a 70B model layer-by-layer takes 4-6 seconds to generate 20 tokens; not suitable for real-time chat.
- Large Disk Footprint: A split 70B model is ~140GB; a split 405B model is ~800GB.
- No Batch Support > 1: Layer-wise loading assumes batch=1.
- KV Cache Cannot Be Offloaded: Long contexts combined with large models will exhaust VRAM.
- MacOS Supports Apple Silicon Only: Intel Macs are not supported.
- Quantization Does Not Compress Activations: Compared to vLLM / TensorRT-LLM, pure computational acceleration is not directly comparable.
Comparison with Similar Tools
| Tool | Core Approach | Run 70B on 4GB | Throughput | Ease of Use | Suitable Scenarios |
|---|---|---|---|---|---|
| AirLLM | Layer-wise swapping + Block-wise quantization | ✅ | Low | High (one line of code) | Edge / Individual / Research |
| vLLM | PagedAttention + Continuous batching | ❌ (requires full model) | Very High | Medium | Production Services |
| TensorRT-LLM | Compilation optimization + Quantization | ❌ | Very High | Low | NVIDIA Production |
| llama.cpp | GGUF quantization + CPU/GPU hybrid | ✅ (requires GGUF) | Medium | Medium | Cross-platform |
| Ollama | Wrapper around llama.cpp | ✅ | Medium | Very High | Local Chatting |
| SGLang | RadixAttention + Structured Generation | ❌ | Very High | Medium | High QPS Services |
AirLLM’s true niche is the “run large models on 4GB-8GB VRAM” segment—llama.cpp / Ollama are suitable for models under 13B, vLLM / TensorRT-LLM are for production services, and AirLLM is for “my GPU is small but I want to run 70B/405B.”
Frequently Asked Questions (FAQ)
1. MetadataIncompleteBuffer
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
Disk space is full. Model extraction consumes significant disk space. Run du -sh ~/.cache/huggingface to check usage, expand your storage if necessary, delete the cache, and rerun.
2. ValueError: max() arg is an empty sequence
You may have used AirLLMLlama2 to load QWen or ChatGLM models. Use AutoModel consistently:
from airllm import AutoModel # Do not use AirLLMLlama2
m = AutoModel.from_pretrained("Qwen/Qwen-7B")
3. 401 Client Error: ... is gated
Some models are gated (e.g., meta-llama/Llama-2) and require an Hugging Face token:
m = AutoModel.from_pretrained(
"meta-llama/Llama-2-7b-hf",
hf_token="hf_xxx",
)
4. ValueError: Asking to pad but the tokenizer does not have a padding token
input_tokens = model.tokenizer(
input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False, # Disable padding
)
Adoption Guidelines
Suitable For
- Users with 4-8GB VRAM cards (M1/M2/M3, small RTX 3060/4060, Jetson Orin) who want to run 70B/405B models
- Personal research, paper experiments, or edge demos
- Those wanting to evaluate large models while preserving precision (no quantization, distillation, or pruning)
Not Suitable For
- Production-level QPS requirements (>10 req/s) — use vLLM / TensorRT-LLM instead
- Users with 24GB+ GPUs (e.g., RTX 4090/A5000) — using
transformers.from_pretraineddirectly is simpler - Workloads requiring batch size > 1 — use vLLM / SGLang
Implementation Steps
- Install and test first: Run
pip install airllmand try Platypus2-70B to get a feel for it. - Enable 4-bit acceleration: Use
compression="4bit"to observe performance gains. - Upgrade to 405B models: With 8GB VRAM + 4-bit compression, run Meta-Llama-3.1-405B (provided disk space allows).
- Use
delete_original=True: When disk space is tight, keep only the sharded versions.
One-Sentence Summary
AirLLM is currently the optimal solution in the niche of “running ultra-large models on low VRAM” — running 70B on 4GB and 405B on 8GB without quantization to preserve precision; the trade-off is lower throughput, high disk usage, and unsuitability for production. It is a pragmatic choice for individual researchers and edge deployment scenarios.
Comments
Sign in to join the discussion and leave a comment.
Sign in with Google