DeepSeek-V4 Local Deployment: SGLang Delivers Exceptional Performance

Author Info

Priya Sharma

Enterprise AI & Governance Editor

JD (technology policy focus); CIPP/US; former in-house counsel at a cloud provider

Priya writes about regulation, enterprise procurement, and responsible deployment. She separates legal fact from commentary, flags jurisdictional limits, and works with external counsel on high-risk governance topics. Her articles emphasize what changed, who is accountable, and what practitioners should verify locally.

#AI Regulation #Enterprise Adoption #Risk & Compliance #Policy Analysis

Full author profile →

Earlier coverage of DeepSeek-V4 focused on Flash-model smoke tests on dual H20 GPUs and why vLLM deployments are so finicky. This piece takes a different angle: why the V4 release is hard to serve, and how SGLang shipped day-zero support.

What Changed in V4

DeepSeek released two variants at once:

VariantTotal paramsActive paramsSingle-node floor
DeepSeek-V4-Flash284B13BB200 / GB200 / GB300 / 4× H200
DeepSeek-V4-Pro1.6T49B8× B200 / 8× GB200 (2 nodes) / 4× GB300 / 8× H200 (FP4)

Both Instruct checkpoints use FP4 MoE expert weights plus FP8 attention/dense layers. One weight bundle runs across FP4-capable Hopper, Blackwell, AMD, and NPU hardware under the MIT license, with 1M context and 32T+ pretraining tokens.

The architectural stack has three major pieces:

  • Hybrid sparse attention (CSA + HCA): Every layer combines a 128-token sliding window (SWA) with one of two compression paths—C4 (4:1 compression + top-512 sparsity) or C128 (128:1 compression + dense). At 1M context, V4-Pro needs only 27% of V3.2 FLOPs per token and roughly 10% of the KV cache.
  • mHC (manifold-constrained hyper-connections): Residual paths become a mixture of parallel branches with Sinkhorn-normalized weights, improving gradient flow and representation quality.
  • Native FP4 expert weights: MoE experts run on Blackwell FP4 tensor cores, reducing decode bandwidth bottlenecks on small batches.

V4 also ships a single-layer MTP head for speculative decoding, plus three reasoning modes: Non-think (fast answers), Think High (chain-of-thought), and Think Max (maximum reasoning; ≥384K context recommended).

DeepSeek-V4 Local Deployment: SGLang Delivers Exceptional Performance — figure 4

Hybrid attention scope per V4 layer (example with N=1024).

What SGLang Built Under the Hood

The painful part of hybrid attention is three heterogeneous KV pools plus two compressed-state pools that must stay consistent across prefill, decode, and speculative passes. Classic prefix-cache assumptions no longer hold.

ShadowRadix: prefix cache for hybrid attention

SGLang indexes virtual full-token slots in a radix tree, then projects them into physical pools (SWA / C4 / C128). Compressed-state ring buffers nest inside SWA page indices with address swa_page * ring_size + pos % ring_size. When an SWA page is freed, the ring invalidates automatically—no extra bookkeeping.

Each node tracks full_lock_ref (source + C4/C128 shadows) and swa_lock_ref (sliding window only). When the SWA count hits zero, SWA slots are tombstoned while compressed shadows remain reusable on the tree. A 10K-token request therefore keeps only 128 SWA tokens plus full C4/C128 compression—that compressed KV is what gets reused.

DeepSeek-V4 Local Deployment: SGLang Delivers Exceptional Performance — figure 5

ShadowRadix storage layout.

Speculative decoding adds a subtle bug: draft tokens land in the ring before verification; rejected drafts can overwrite live slots on retry. SGLang doubles ring size under spec mode (C4: 8→16, C128: 128→256), so EAGLE works out of the box.

HiSparse: offload inactive KV to CPU

C4 layers activate only a small top-k of compressed positions each step—most KV is cold at any moment. HiSparse mirrors the C4 pool on CPU, keeps a small GPU working set, and asynchronously pages data each step with LRU eviction. On dual B200 with 200K input / 20K output, peak throughput improves up to .

DeepSeek-V4 Local Deployment: SGLang Delivers Exceptional Performance — figure 6

HiSparse architecture and peak throughput gains.

MTP speculative decoding with in-graph metadata

Hybrid attention metadata is heavy—SWA page indices, shadow maps, compressor/indexer plans, per-pool write positions. Preparing it eagerly on the scheduler thread kills speculative launch overhead.

SGLang embeds metadata preparation inside CUDA Graphs. Each replay copies raw batch state into fixed buffers; indexing arithmetic runs in device kernels. Combined with CPU-side overlap scheduling, speculative startup overhead drops sharply.

DeepSeek-V4 Local Deployment: SGLang Delivers Exceptional Performance — figure 7

Decode throughput across context lengths.

ShadowRadix + in-graph spec metadata push SGLang decode throughput from 4K tokens through ~900K, near the 1M window. On B200, throughput falls from 199 to 180 tok/s; on H200, from 266 to 240—under 10% drop on both. That flat curve is rare for long-context serving.

Kernel-level work

Other notable integrations:

  • FlashMLA extensions: SWA and extra attention (C4/C128) in one kernel call with shared metadata.
  • Flash Compressor: Compresses five HBM round trips to one on-chip pass (5→2), reaching ~80% peak H200 bandwidth—10×+ over naive PyTorch pipelines.
  • Lightning TopK: At 1M context, the indexer selects top-512 from 256K candidates in ~15µs via cluster-of-8 radix select (vs 100µs+ naive).
  • FlashInfer TRTLLM-Gen MoE: MXFP8 activations × MXFP4 experts on Blackwell FP4 cores.
  • DeepGEMM Mega MoE: Fuses EP dispatch, first FP8×FP4 GEMM, SwiGLU, second GEMM, and EP combine with overlapped NVLink.
  • TileLang mHC kernels (split-K): Recovers pre-GEMM bottlenecks on low-latency decode.
  • DP/TP/CP attention, DeepEP MoE, PD disaggregation: Full parallel and disaggregated serving options.

How to Deploy

SGLang publishes per-platform Docker images:

HardwareImage
NVIDIA B300lmsysorg/sglang:deepseek-v4-b300
NVIDIA B200lmsysorg/sglang:deepseek-v4-blackwell
NVIDIA GB200/GB300lmsysorg/sglang:deepseek-v4-grace-blackwell
NVIDIA H200lmsysorg/sglang:deepseek-v4-hopper

Minimal launch template:

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your-hf-token>" \
    --ipc=host \
    lmsysorg/sglang:deepseek-v4-blackwell \
    sglang serve <use args below>

Use the official interactive command generator for exact flags:

DeepSeek-V4 Local Deployment: SGLang Delivers Exceptional Performance — figure 8

https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4#3-1-basic-configuration

Three main recipes:

  • low-latency: MTP steps=3, draft-tokens=4—best at batch size 1.
  • balanced: MTP steps=1, draft-tokens=2—better at high batch.
  • max-throughput: MTP off—when verify cost dominates at saturation.

Specialized options: cp (prefill context parallelism for long context) and pd-disagg (prefill/decode disaggregation).

After launch, call the standard OpenAI-compatible endpoint:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "What is 15% of 240?"}]
  }'

Add the deepseek-v4 reasoning parser to split reasoning_content and content; use the deepseekv4 tool parser for structured tool calls.

Production Pitfalls

  • DeepEP dispatch buffer: Require max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violations corrupt buffers under steady load; defaults are conservative—tune upward after smoke tests.
  • Hopper (H200) paths: Original FP4 checkpoints use Marlin w4a16 MoE (TP only; full Pro fits one node). For more parallelism, use SGLang FP8 conversions (sgl-project/DeepSeek-V4-Flash-FP8 / Pro-FP8).
  • PD-Disagg on H200: docker run needs --privileged --ulimit memlock=-1 (or IB device + IPC_LOCK); otherwise Mooncake falls back to TCP and large checkpoints may corrupt KV transfers.
  • Base models: Set SGLANG_FIX_DSV4_BASE_MODEL_LOAD=1.
  • GB300 cross-pod NVLink: If Mooncake reports nvlink_transport.cpp:497 Requested address ... not found!, add MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1 on both prefill and decode pods.

For implementation details, see sgl-project/sglang#23600—from V4Config registration through hybrid-attention kernels.

Takeaways

DeepSeek pushed aggressive architecture to make 1M context affordable—27% FLOPs and ~10% KV cache versus V3.2 at scale. The cost: inference engines must rebuild KV/cache/attention paths. SGLang’s day-zero stack—ShadowRadix, HiSparse, in-graph spec metadata, and new kernels—is systems engineering, not a patch set.

LMSYS day-zero charts show SGLang leading another open engine at 30K context single-batch decode, though rivals were still tuning MTP and long-context configs. SGLang’s own curves matter more: V4-Pro on B200 stays near-flat from 4K to 900K; V4-Flash on H200 drops less than 10%—the kind of stability that makes long context deployable in production.

References

  1. DeepSeek-V4 — SGLang Documentation
  2. SGLang DeepSeek V4 merge PR #23600

Comments