AirLLM Complete Guide: Running a 70B Model with 4GB VRAM

Models & Benchmarks · Published: Jun 04, 2026 · Elena Volkov · ~8 min read

Author Info

Machine Learning Research Editor

Ph.D. Machine Learning (ETH Zürich); published work on efficient training and evaluation

Elena explains model architecture, training economics, and benchmark design for a technical audience. She reads primary papers and official technical reports, then summarizes assumptions, datasets, and known failure modes. She avoids hype by pairing capability claims with reproducibility notes.

#Model Architecture #Benchmarks #Training Economics #Open-Source Models

Full author profile →

Core Assessment

AirLLM (repository lyogavin/airllm) is one of the few LLM inference libraries that can “run a 70B model on a single 4GB card without quantization, distillation, or pruning.” Its core approach is straightforward: split the large model into layers and load them block by block. Typically, only the parameters for the layer currently being computed are loaded into VRAM, while other layers remain idle on NVMe/SSD storage.

Its moat lies in three “hard skills” that competitors haven’t achieved:

70B on 4GB VRAM: No quantization (preserves accuracy), no distillation (preserves capability), no pruning (preserves generality); relies purely on time-sliced swapping.
405B Llama3.1 on 8GB VRAM: The 405B scale represents the upper limit of industrial models; AirLLM compresses it from “8x H100” to a “single 8GB GPU.”
Block-wise Quantization (4bit/8bit) 3x Speedup: Its proprietary block-wise quantization only compresses weights, not activations, effectively eliminating the disk loading bottleneck in one shot.

Important Premise: AirLLM is not a “lossless acceleration” tool—layer-by-layer swapping means throughput is relatively low (running 70B takes approximately 4-6 seconds for 20 tokens). Its positioning is “can run on a single card / edge device,” rather than “high QPS service.”

Project Map

Dimension	Key Information
Repository	lyogavin/airllm
PyPI	pypi.org/project/airllm
License	Apache-2.0
Stars	19,070 / Forks 2,086
Supported Models	Llama / Llama2 / Llama3 / Llama3.1 (including 405B) / QWen / ChatGLM / Baichuan / Mistral / InternLM / Mixtral

Version Milestones

Time	Event
2023-11	AirLLM Initial Release
2023-12	v2.0: safetensors support, block-wise quantization 3x speedup
2023-12	Full support for ChatGLM / QWen / Baichuan / Mistral / InternLM
2023-12	v2.6: `AutoModel` automatic model type recognition
2023-12	v2.7: Mixtral support
2023-12	v2.8.2: Running 70B on macOS
2024-04	Native support for Llama3 70B on a single 4GB card
2024-07	Llama3.1 405B on 8GB VRAM + 8bit/4bit quantization
2024-08	v2.10: CPU inference support
2024-08	v2.11: Qwen2.5 support

Quick Start

Installation

pip install airllm

If you want to use quantization for acceleration:

pip install -U bitsandbytes

Running 70B in 4 Lines of Code

from airllm import AutoModel

MAX_LENGTH = 128
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")

input_text = ["What is the capital of United States?"]
input_tokens = model.tokenizer(
    input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH,
    padding=False,
)

generation_output = model.generate(
    input_tokens["input_ids"].cuda(),
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True,
)
print(model.tokenizer.decode(generation_output.sequences[0]))

The first run will automatically split the model layer by layer and save it to disk—please ensure there is sufficient disk space in the HF cache directory (70B ≈ 140GB).

405B Llama3.1 (8GB Single GPU + 4bit)

from airllm import AutoModel

model = AutoModel.from_pretrained(
    "meta-llama/Meta-Llama-3.1-405B-Instruct",
    compression="4bit",  # or "8bit"
)
# ... same tokenizer + generate as above

Core Mechanisms

1. Layer-wise Loading

AirLLM does not load the entire model into VRAM at once. The process is as follows:

Split the model into $N$ sharded checkpoints based on transformer blocks.
During inference, only when computing layer $i$, read layer $i$ from NVMe/SSD into GPU VRAM.
After computing layer $i$, and before computing layer $i+1$, release the memory for layer $i$.
The KV cache remains resident in VRAM across layers (this part cannot be saved).

This explains why “running a 70B model on 4GB VRAM” is possible: Single-layer weight of 70B / 80 ≈ 1.7GB + 1GB KV cache = 2.7GB, which fits within 4GB of VRAM.

2. Block-wise Quantization

Block-wise quantization, introduced in arXiv:2212.09720, is the killer feature added to AirLLM 2.0.

model = AutoModel.from_pretrained(
    "garage-bAInd/Platypus2-70B-instruct",
    compression="4bit",  # or "8bit"
)

Key differences from standard quantization:

Approach	Quantization Scope	Inference Acceleration	Accuracy Loss	Adaptation Difficulty
Standard INT8/INT4 Quantization	Weights + Activations	High	Medium-High (sensitive to outliers)	High (requires calibration)
AirLLM Block-wise Quantization	Weights Only	Medium (3x acceleration in disk loading)	Low	Zero (calibration-free)

Design philosophy: The bottleneck is disk I/O, not computation. Therefore, compressing only the weights yields a 3x speedup; leaving activations untouched prevents outliers from degrading accuracy.

3. Prefetching

model = AutoModel.from_pretrained(
    "garage-bAInd/Platypus2-70B-instruct",
    prefetching=True,  # Default is True
)

This creates a pipeline overlap between “loading layer $i+1$” and “computing layer $i$,” resulting in approximately a 10% speed increase.

4. MacOS / Apple Silicon Support

# Native support for Apple Silicon (M1/M2/M3/M4)
# Requires installing mlx and torch first
pip install mlx torch

See run_on_macos.ipynb for details.

Configuration Options Quick Reference

Parameter	Purpose	Default Value
`compression`	`"4bit"` / `"8bit"` / `None`	`None`
`profiling_mode`	Print time taken per layer	`False`
`layer_shards_saving_path`	Path to save split model shards	HF cache
`hf_token`	Used for pulling gated models	–
`prefetching`	Load-compute pipeline overlap	`True`
`delete_original`	Delete original model after splitting, saving half the disk space	`False`

delete_original=True is a lifesaver when disk space is tight—the HF download will be deleted, keeping only the split version (saving ~50% disk).

Performance Intuition

Scenario	Hardware	Approx. Throughput	Notes
70B Llama2 / Llama3	Single 4GB GPU	~4-6 s / 20 tokens	Layer swapping is the bottleneck
70B + 4bit	Single 4GB GPU	~1.5-2 s / 20 tokens	Block-wise quantization removes disk I/O bottleneck
405B Llama3.1 + 4bit	Single 8GB GPU	~10-15 s / 20 tokens	Due to the sheer size of 405B
70B Mac M-series	32GB Unified Memory	Similar to single GPU	Apple Silicon is suitable for individual developers
70B CPU only	Multi-core x86	Extremely slow	Temporary solution, not recommended

Do not use AirLLM as a production inference server—its design goal is “to make it run,” not “high QPS.”

Typical Scenarios

Scenario A: Individual Developer Testing 70B Locally

# 1. Requires 4GB GPU + 256GB Disk (HF cache)
pip install airllm

# 2. Run
python -c "
from airllm import AutoModel
m = AutoModel.from_pretrained('garage-bAInd/Platypus2-70B-instruct', compression='4bit')
"
# First run downloads and splits the model (~1-2 hours)
# Subsequent launches use the split version directly

Scenario B: Testing Llama3 on MacBook M3

from airllm import AutoModel
m = AutoModel.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct", compression="4bit")
# ... Apple Silicon uses unified memory; after 4-bit quantization, usage is approx. 35GB

Scenario C: Edge / Embedded Deployment

Burn the sharded checkpoint of a 70B model with 4-bit quantization into a Jetson Orin (32GB shared memory). With prefetching=True, you can achieve a “conversational” experience on edge devices.

Scenario D: Research / Evaluation

# profiling_mode prints time taken per layer
model = AutoModel.from_pretrained(
    "...",
    profiling_mode=True,
    compression="4bit",
)
# Output: layer 0: 0.3s, layer 1: 0.28s, ...
# Helps identify GPU or IO bottlenecks

Low Throughput: Swapping in a 70B model layer-by-layer takes 4-6 seconds to generate 20 tokens; not suitable for real-time chat.
Large Disk Footprint: A split 70B model is ~140GB; a split 405B model is ~800GB.
No Batch Support > 1: Layer-wise loading assumes batch=1.
KV Cache Cannot Be Offloaded: Long contexts combined with large models will exhaust VRAM.
MacOS Supports Apple Silicon Only: Intel Macs are not supported.
Quantization Does Not Compress Activations: Compared to vLLM / TensorRT-LLM, pure computational acceleration is not directly comparable.

Comparison with Similar Tools

Tool	Core Approach	Run 70B on 4GB	Throughput	Ease of Use	Suitable Scenarios
AirLLM	Layer-wise swapping + Block-wise quantization	✅	Low	High (one line of code)	Edge / Individual / Research
vLLM	PagedAttention + Continuous batching	❌ (requires full model)	Very High	Medium	Production Services
TensorRT-LLM	Compilation optimization + Quantization	❌	Very High	Low	NVIDIA Production
llama.cpp	GGUF quantization + CPU/GPU hybrid	✅ (requires GGUF)	Medium	Medium	Cross-platform
Ollama	Wrapper around llama.cpp	✅	Medium	Very High	Local Chatting
SGLang	RadixAttention + Structured Generation	❌	Very High	Medium	High QPS Services

AirLLM’s true niche is the “run large models on 4GB-8GB VRAM” segment—llama.cpp / Ollama are suitable for models under 13B, vLLM / TensorRT-LLM are for production services, and AirLLM is for “my GPU is small but I want to run 70B/405B.”

Frequently Asked Questions (FAQ)

1. `MetadataIncompleteBuffer`

safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

Disk space is full. Model extraction consumes significant disk space. Run du -sh ~/.cache/huggingface to check usage, expand your storage if necessary, delete the cache, and rerun.

2. `ValueError: max() arg is an empty sequence`

You may have used AirLLMLlama2 to load QWen or ChatGLM models. Use AutoModel consistently:

from airllm import AutoModel  # Do not use AirLLMLlama2
m = AutoModel.from_pretrained("Qwen/Qwen-7B")

3. `401 Client Error: ... is gated`

Some models are gated (e.g., meta-llama/Llama-2) and require an Hugging Face token:

m = AutoModel.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    hf_token="hf_xxx",
)

4. `ValueError: Asking to pad but the tokenizer does not have a padding token`

input_tokens = model.tokenizer(
    input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH,
    padding=False,  # Disable padding
)

Adoption Guidelines

Suitable For

Users with 4-8GB VRAM cards (M1/M2/M3, small RTX 3060/4060, Jetson Orin) who want to run 70B/405B models
Personal research, paper experiments, or edge demos
Those wanting to evaluate large models while preserving precision (no quantization, distillation, or pruning)

Not Suitable For

Production-level QPS requirements (>10 req/s) — use vLLM / TensorRT-LLM instead
Users with 24GB+ GPUs (e.g., RTX 4090/A5000) — using transformers.from_pretrained directly is simpler
Workloads requiring batch size > 1 — use vLLM / SGLang

Implementation Steps

Install and test first: Run pip install airllm and try Platypus2-70B to get a feel for it.
Enable 4-bit acceleration: Use compression="4bit" to observe performance gains.
Upgrade to 405B models: With 8GB VRAM + 4-bit compression, run Meta-Llama-3.1-405B (provided disk space allows).
Use delete_original=True: When disk space is tight, keep only the sharded versions.

One-Sentence Summary

AirLLM is currently the optimal solution in the niche of “running ultra-large models on low VRAM” — running 70B on 4GB and 405B on 8GB without quantization to preserve precision; the trade-off is lower throughput, high disk usage, and unsuitability for production. It is a pragmatic choice for individual researchers and edge deployment scenarios.

References

AirLLM on GitHub
AirLLM on PyPI
Block-wise quantization paper (arXiv:2212.09720)
macOS example notebook

AirLLM Complete Guide: Running a 70B Model with 4GB VRAM

Author Info

Core Assessment

Project Map

Version Milestones

Quick Start

Installation

Running 70B in 4 Lines of Code

405B Llama3.1 (8GB Single GPU + 4bit)

Core Mechanisms

1. Layer-wise Loading

2. Block-wise Quantization

3. Prefetching

4. MacOS / Apple Silicon Support

Configuration Options Quick Reference

Performance Intuition

Typical Scenarios

Scenario A: Individual Developer Testing 70B Locally

Scenario B: Testing Llama3 on MacBook M3

Scenario C: Edge / Embedded Deployment

Scenario D: Research / Evaluation

Boundaries and Blind Spots

Comparison with Similar Tools

Frequently Asked Questions (FAQ)

1. `MetadataIncompleteBuffer`

2. `ValueError: max() arg is an empty sequence`

3. `401 Client Error: ... is gated`

4. `ValueError: Asking to pad but the tokenizer does not have a padding token`

Adoption Guidelines

Suitable For

Not Suitable For

Implementation Steps

One-Sentence Summary

References

Comments

AirLLM Complete Guide: Running a 70B Model with 4GB VRAM

Author Info

Core Assessment

Project Map

Version Milestones

Quick Start

Installation

Running 70B in 4 Lines of Code

405B Llama3.1 (8GB Single GPU + 4bit)

Core Mechanisms

1. Layer-wise Loading

2. Block-wise Quantization

3. Prefetching

4. MacOS / Apple Silicon Support

Configuration Options Quick Reference

Performance Intuition

Typical Scenarios

Scenario A: Individual Developer Testing 70B Locally

Scenario B: Testing Llama3 on MacBook M3

Scenario C: Edge / Embedded Deployment

Scenario D: Research / Evaluation

Boundaries and Blind Spots

Comparison with Similar Tools

Frequently Asked Questions (FAQ)

1. MetadataIncompleteBuffer

2. ValueError: max() arg is an empty sequence

3. 401 Client Error: ... is gated

4. ValueError: Asking to pad but the tokenizer does not have a padding token

Adoption Guidelines

Suitable For

Not Suitable For

Implementation Steps

One-Sentence Summary

References

Comments

Related News

Latest Headlines

1. `MetadataIncompleteBuffer`

2. `ValueError: max() arg is an empty sequence`

3. `401 Client Error: ... is gated`

4. `ValueError: Asking to pad but the tokenizer does not have a padding token`