AirLLM Complete Guide: Running a 70B Model with 4GB VRAM

Author Info

Elena Volkov

Machine Learning Research Editor

Ph.D. Machine Learning (ETH Zürich); published work on efficient training and evaluation

Elena explains model architecture, training economics, and benchmark design for a technical audience. She reads primary papers and official technical reports, then summarizes assumptions, datasets, and known failure modes. She avoids hype by pairing capability claims with reproducibility notes.

#Model Architecture #Benchmarks #Training Economics #Open-Source Models

Full author profile →

Core Assessment

AirLLM (repository lyogavin/airllm) is one of the few LLM inference libraries that can “run a 70B model on a single 4GB card without quantization, distillation, or pruning.” Its core approach is straightforward: split the large model into layers and load them block by block. Typically, only the parameters for the layer currently being computed are loaded into VRAM, while other layers remain idle on NVMe/SSD storage.

Its moat lies in three “hard skills” that competitors haven’t achieved:

  1. 70B on 4GB VRAM: No quantization (preserves accuracy), no distillation (preserves capability), no pruning (preserves generality); relies purely on time-sliced swapping.
  2. 405B Llama3.1 on 8GB VRAM: The 405B scale represents the upper limit of industrial models; AirLLM compresses it from “8x H100” to a “single 8GB GPU.”
  3. Block-wise Quantization (4bit/8bit) 3x Speedup: Its proprietary block-wise quantization only compresses weights, not activations, effectively eliminating the disk loading bottleneck in one shot.

Important Premise: AirLLM is not a “lossless acceleration” tool—layer-by-layer swapping means throughput is relatively low (running 70B takes approximately 4-6 seconds for 20 tokens). Its positioning is “can run on a single card / edge device,” rather than “high QPS service.”

Project Map

DimensionKey Information
Repositorylyogavin/airllm
PyPIpypi.org/project/airllm
LicenseApache-2.0
Stars19,070 / Forks 2,086
Supported ModelsLlama / Llama2 / Llama3 / Llama3.1 (including 405B) / QWen / ChatGLM / Baichuan / Mistral / InternLM / Mixtral

Version Milestones

TimeEvent
2023-11AirLLM Initial Release
2023-12v2.0: safetensors support, block-wise quantization 3x speedup
2023-12Full support for ChatGLM / QWen / Baichuan / Mistral / InternLM
2023-12v2.6: AutoModel automatic model type recognition
2023-12v2.7: Mixtral support
2023-12v2.8.2: Running 70B on macOS
2024-04Native support for Llama3 70B on a single 4GB card
2024-07Llama3.1 405B on 8GB VRAM + 8bit/4bit quantization
2024-08v2.10: CPU inference support
2024-08v2.11: Qwen2.5 support

Quick Start

Installation

pip install airllm

If you want to use quantization for acceleration:

pip install -U bitsandbytes

Running 70B in 4 Lines of Code

from airllm import AutoModel

MAX_LENGTH = 128
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")

input_text = ["What is the capital of United States?"]
input_tokens = model.tokenizer(
    input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH,
    padding=False,
)

generation_output = model.generate(
    input_tokens["input_ids"].cuda(),
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True,
)
print(model.tokenizer.decode(generation_output.sequences[0]))

The first run will automatically split the model layer by layer and save it to disk—please ensure there is sufficient disk space in the HF cache directory (70B ≈ 140GB).

405B Llama3.1 (8GB Single GPU + 4bit)

from airllm import AutoModel

model = AutoModel.from_pretrained(
    "meta-llama/Meta-Llama-3.1-405B-Instruct",
    compression="4bit",  # or "8bit"
)
# ... same tokenizer + generate as above

Core Mechanisms

1. Layer-wise Loading

AirLLM does not load the entire model into VRAM at once. The process is as follows:

  1. Split the model into $N$ sharded checkpoints based on transformer blocks.
  2. During inference, only when computing layer $i$, read layer $i$ from NVMe/SSD into GPU VRAM.
  3. After computing layer $i$, and before computing layer $i+1$, release the memory for layer $i$.
  4. The KV cache remains resident in VRAM across layers (this part cannot be saved).

This explains why “running a 70B model on 4GB VRAM” is possible: Single-layer weight of 70B / 80 ≈ 1.7GB + 1GB KV cache = 2.7GB, which fits within 4GB of VRAM.

2. Block-wise Quantization

Block-wise quantization, introduced in arXiv:2212.09720, is the killer feature added to AirLLM 2.0.

model = AutoModel.from_pretrained(
    "garage-bAInd/Platypus2-70B-instruct",
    compression="4bit",  # or "8bit"
)

Key differences from standard quantization:

ApproachQuantization ScopeInference AccelerationAccuracy LossAdaptation Difficulty
Standard INT8/INT4 QuantizationWeights + ActivationsHighMedium-High (sensitive to outliers)High (requires calibration)
AirLLM Block-wise QuantizationWeights OnlyMedium (3x acceleration in disk loading)LowZero (calibration-free)

Design philosophy: The bottleneck is disk I/O, not computation. Therefore, compressing only the weights yields a 3x speedup; leaving activations untouched prevents outliers from degrading accuracy.

3. Prefetching

model = AutoModel.from_pretrained(
    "garage-bAInd/Platypus2-70B-instruct",
    prefetching=True,  # Default is True
)

This creates a pipeline overlap between “loading layer $i+1$” and “computing layer $i$,” resulting in approximately a 10% speed increase.

4. MacOS / Apple Silicon Support

# Native support for Apple Silicon (M1/M2/M3/M4)
# Requires installing mlx and torch first
pip install mlx torch

See run_on_macos.ipynb for details.

Configuration Options Quick Reference

ParameterPurposeDefault Value
compression"4bit" / "8bit" / NoneNone
profiling_modePrint time taken per layerFalse
layer_shards_saving_pathPath to save split model shardsHF cache
hf_tokenUsed for pulling gated models
prefetchingLoad-compute pipeline overlapTrue
delete_originalDelete original model after splitting, saving half the disk spaceFalse

delete_original=True is a lifesaver when disk space is tight—the HF download will be deleted, keeping only the split version (saving ~50% disk).

Performance Intuition

ScenarioHardwareApprox. ThroughputNotes
70B Llama2 / Llama3Single 4GB GPU~4-6 s / 20 tokensLayer swapping is the bottleneck
70B + 4bitSingle 4GB GPU~1.5-2 s / 20 tokensBlock-wise quantization removes disk I/O bottleneck
405B Llama3.1 + 4bitSingle 8GB GPU~10-15 s / 20 tokensDue to the sheer size of 405B
70B Mac M-series32GB Unified MemorySimilar to single GPUApple Silicon is suitable for individual developers
70B CPU onlyMulti-core x86Extremely slowTemporary solution, not recommended

Do not use AirLLM as a production inference server—its design goal is “to make it run,” not “high QPS.”

Typical Scenarios

Scenario A: Individual Developer Testing 70B Locally

# 1. Requires 4GB GPU + 256GB Disk (HF cache)
pip install airllm

# 2. Run
python -c "
from airllm import AutoModel
m = AutoModel.from_pretrained('garage-bAInd/Platypus2-70B-instruct', compression='4bit')
"
# First run downloads and splits the model (~1-2 hours)
# Subsequent launches use the split version directly

Scenario B: Testing Llama3 on MacBook M3

from airllm import AutoModel
m = AutoModel.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct", compression="4bit")
# ... Apple Silicon uses unified memory; after 4-bit quantization, usage is approx. 35GB

Scenario C: Edge / Embedded Deployment

Burn the sharded checkpoint of a 70B model with 4-bit quantization into a Jetson Orin (32GB shared memory). With prefetching=True, you can achieve a “conversational” experience on edge devices.

Scenario D: Research / Evaluation

# profiling_mode prints time taken per layer
model = AutoModel.from_pretrained(
    "...",
    profiling_mode=True,
    compression="4bit",
)
# Output: layer 0: 0.3s, layer 1: 0.28s, ...
# Helps identify GPU or IO bottlenecks

Boundaries and Blind Spots

  • Low Throughput: Swapping in a 70B model layer-by-layer takes 4-6 seconds to generate 20 tokens; not suitable for real-time chat.
  • Large Disk Footprint: A split 70B model is ~140GB; a split 405B model is ~800GB.
  • No Batch Support > 1: Layer-wise loading assumes batch=1.
  • KV Cache Cannot Be Offloaded: Long contexts combined with large models will exhaust VRAM.
  • MacOS Supports Apple Silicon Only: Intel Macs are not supported.
  • Quantization Does Not Compress Activations: Compared to vLLM / TensorRT-LLM, pure computational acceleration is not directly comparable.

Comparison with Similar Tools

ToolCore ApproachRun 70B on 4GBThroughputEase of UseSuitable Scenarios
AirLLMLayer-wise swapping + Block-wise quantizationLowHigh (one line of code)Edge / Individual / Research
vLLMPagedAttention + Continuous batching❌ (requires full model)Very HighMediumProduction Services
TensorRT-LLMCompilation optimization + QuantizationVery HighLowNVIDIA Production
llama.cppGGUF quantization + CPU/GPU hybrid✅ (requires GGUF)MediumMediumCross-platform
OllamaWrapper around llama.cppMediumVery HighLocal Chatting
SGLangRadixAttention + Structured GenerationVery HighMediumHigh QPS Services

AirLLM’s true niche is the “run large models on 4GB-8GB VRAM” segment—llama.cpp / Ollama are suitable for models under 13B, vLLM / TensorRT-LLM are for production services, and AirLLM is for “my GPU is small but I want to run 70B/405B.”

Frequently Asked Questions (FAQ)

1. MetadataIncompleteBuffer

safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

Disk space is full. Model extraction consumes significant disk space. Run du -sh ~/.cache/huggingface to check usage, expand your storage if necessary, delete the cache, and rerun.

2. ValueError: max() arg is an empty sequence

You may have used AirLLMLlama2 to load QWen or ChatGLM models. Use AutoModel consistently:

from airllm import AutoModel  # Do not use AirLLMLlama2
m = AutoModel.from_pretrained("Qwen/Qwen-7B")

3. 401 Client Error: ... is gated

Some models are gated (e.g., meta-llama/Llama-2) and require an Hugging Face token:

m = AutoModel.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    hf_token="hf_xxx",
)

4. ValueError: Asking to pad but the tokenizer does not have a padding token

input_tokens = model.tokenizer(
    input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH,
    padding=False,  # Disable padding
)

Adoption Guidelines

Suitable For

  • Users with 4-8GB VRAM cards (M1/M2/M3, small RTX 3060/4060, Jetson Orin) who want to run 70B/405B models
  • Personal research, paper experiments, or edge demos
  • Those wanting to evaluate large models while preserving precision (no quantization, distillation, or pruning)

Not Suitable For

  • Production-level QPS requirements (>10 req/s) — use vLLM / TensorRT-LLM instead
  • Users with 24GB+ GPUs (e.g., RTX 4090/A5000) — using transformers.from_pretrained directly is simpler
  • Workloads requiring batch size > 1 — use vLLM / SGLang

Implementation Steps

  1. Install and test first: Run pip install airllm and try Platypus2-70B to get a feel for it.
  2. Enable 4-bit acceleration: Use compression="4bit" to observe performance gains.
  3. Upgrade to 405B models: With 8GB VRAM + 4-bit compression, run Meta-Llama-3.1-405B (provided disk space allows).
  4. Use delete_original=True: When disk space is tight, keep only the sharded versions.

One-Sentence Summary

AirLLM is currently the optimal solution in the niche of “running ultra-large models on low VRAM” — running 70B on 4GB and 405B on 8GB without quantization to preserve precision; the trade-off is lower throughput, high disk usage, and unsuitability for production. It is a pragmatic choice for individual researchers and edge deployment scenarios.

References

  1. AirLLM on GitHub
  2. AirLLM on PyPI
  3. Block-wise quantization paper (arXiv:2212.09720)
  4. macOS example notebook

Comments