Zero-Copy Model Loading

How to make model loading 20x faster? The magic of mmap.

mmap Zero-Copy Memory Mapping
8 min read
Chapter 01

The Problem with Traditional Loading

When launching a large language model, the traditional loading path follows a simple but heavy sequence: read file from disk → allocate memory → copy bytes into memory → model ready. This process seems natural, but hides serious performance issues.

Problem 1: 2x Memory Peak

During loading, the system must hold both the old disk buffer and the new model memory simultaneously. A 14 GB FP16 7B model actually consumes about 28 GB of peak memory during loading. The buffer is only freed after loading completes, dropping back to 14 GB.

Problem 2: Slow

Even on an SSD, fully reading 14 GB into memory and then copying it byte-by-byte into the model weight structures takes several seconds or even tens of seconds. For interactive applications, this is an unacceptable wait.

Problem 3: Model Size Limited by RAM

Due to the 2x memory peak, a device with 24 GB of RAM can theoretically only load models up to 12 GB. Beyond that threshold, the system crashes with an out-of-memory error.

// INTERACTIVE: Traditional Loading Timeline

Observe the four stages of traditional loading and the memory peak spike:

Traditional
Disk Read
Alloc
Copy
Ready
Memory usage over time (7B FP16 = 14 GB)
Disk Read
14 GB (buffer)
Copying
28 GB (2x peak!)
Ready
14 GB (buffer freed)

Traditional loading is like moving house by first loading everything from the old house onto a truck, then unloading from the truck into the new house — you need two spaces at once, and you move everything twice.

Chapter 02

mmap — Memory-Mapped Files

mmap (memory-mapped file) is an OS mechanism that maps a disk file directly into a process's virtual address space. The program doesn't need to explicitly call read() to load data — it accesses file contents directly through memory addresses, and the OS loads the corresponding pages on demand in the background.

Core Principles

No explicit read, no copy. When the program accesses a mapped address whose data isn't yet in physical memory, the CPU triggers a page fault. The OS then loads that page from disk into physical memory. The entire process is transparent to the program.

Lazy Loading. Only pages that are actually accessed get loaded into physical memory. For a 14 GB model file, if you only access 2 GB of weights, physical memory usage is just 2 GB.

The OS manages everything. Page swapping, cache eviction, disk I/O scheduling — all handled by the OS virtual memory subsystem. The program simply accesses data as if it were regular memory.

// INTERACTIVE: Traditional vs mmap

Compare the data flow paths of both approaches:

TRADITIONAL
Disk File (14 GB)
read() + copy
RAM Buffer (14 GB)
memcpy byte-by-byte
Model Weights (14 GB)
MMAP
Disk File (14 GB)
mmap() virtual mapping
Direct access = direct use

mmap is like reading a book at the library — you don't need to photocopy the entire book and take it home. Just flip to the page you need. The book stays on the shelf; you only "borrow" the page you're currently reading.

Chapter 03

The Magic of Zero-Copy — 20x Speedup

AtomGradient's OptMLX research applies mmap zero-copy loading to on-device model inference, achieving up to 20x loading speedup.

Why 20x Faster?

No memory peak. Model weights are read-only and mapped directly from the file — there's no "old buffer + new memory" double overhead. Memory usage stays at 1x instead of 2x.

Instant "loading." The mmap() system call returns almost immediately — it only establishes the virtual address-to-file mapping, with no actual I/O involved. Real data transfer happens lazily on subsequent access.

First inference slightly slower, then blazing fast. During the first inference, accessed weight pages are loaded from disk (page faults) with some latency. But once pages enter physical memory, subsequent accesses run at pure memory speed — identical to post-load traditional performance.

// INTERACTIVE: Load Speed & Memory Peak Comparison

Compare traditional vs mmap loading performance across model sizes:

7B Model (14 GB FP16) ~20x
Traditional: ~4.2s
mmap: ~0.2s
9B Model (18 GB FP16) ~18x
Traditional: ~5.4s
mmap: ~0.3s
35B Model (70 GB FP16) ~15x
Traditional: ~21s
mmap: ~1.4s
Memory Peak Comparison
Traditional Loading (2x peak)
28 GB
mmap Loading (no peak)
14 GB

Traditional loading is "download the entire movie before playing." mmap is "stream it instantly" — the content is still on disk, but you're already watching.

Chapter 04

Why This Matters Most for Edge Devices

Edge devices (phones, laptops, embedded systems) typically have 8-32 GB of RAM, shared with the OS and other applications. In this constrained environment, the traditional 2x memory peak becomes a fatal bottleneck.

Real-World Impact

A MacBook with 24 GB of RAM has about 6 GB used by the OS and apps, leaving 18 GB available. With traditional loading and its 2x peak, the maximum loadable model is only 9 GB. But with mmap zero-copy, the same device can load 18 GB or even larger models (since lazy loading doesn't require all weights in memory simultaneously).

Quantization + Zero-Copy = Optimal Solution

Combining model quantization (e.g., Q4, 4-bit quantization) with mmap zero-copy multiplies the benefits. A 7B model quantized to Q4 is only about 3.5 GB — via mmap it's ready almost instantly, with a memory peak of just 3.5 GB. This transforms devices from "barely runs" to "runs comfortably."

// INTERACTIVE: Device Memory Capacity Analysis

Compare what models each device can run with traditional vs zero-copy loading:

System
Model (mmap)
Model (traditional)
2x peak
Exceeds RAM

Quantization is "compressing your luggage." Zero-copy is "using it in place without unpacking." Together, they're the most efficient on-device deployment strategy.

Chapter 05

Summary

🧲

mmap Eliminates Memory Peaks

Memory-mapped file access bypasses the traditional 2x memory spike, maximizing the value of limited RAM.

20x Loading Speedup

The mmap system call returns instantly with lazy on-demand loading — model "load" time drops from seconds to milliseconds.

📱

Edge Devices Benefit Most

Memory-constrained edge devices can run models that were previously "too large to fit" thanks to zero-copy loading.

🎯

Quantization + Zero-Copy = Best

Q4 quantization compresses model size while mmap eliminates loading overhead — dual optimization for peak on-device efficiency.

Zero-copy loading transforms models from "waiting to load" to "instant-on" — this isn't optimization, it's a paradigm shift.

OptMLX Research
Next: RAG