Unified Memory Architecture

Why is Apple Silicon uniquely suited for running AI?

Apple Silicon Unified Memory UMA
About 10 min read
Chapter 01

The Bottleneck of Traditional Architectures

CPU and GPU Each Have Their Own Memory

In traditional PC and server architectures, the CPU uses system memory (RAM) while a discrete GPU has its own video memory (VRAM). The two are connected via the PCIe bus. When an AI model needs to run on the GPU, data must first be copied from RAM to VRAM — and this transfer is a major bottleneck.

PCIe 4.0 x16 offers a theoretical bandwidth of about 32 GB/s, while GPU VRAM itself can deliver several TB/s internally. This means data transfer is far slower than computation — the GPU is often waiting for data.

NVIDIA's high-end GPUs use HBM (High Bandwidth Memory), delivering an impressive 2-4 TB/s bandwidth. But capacity is limited: consumer GPUs typically have only 8-24 GB VRAM, and even the data-center-grade H100 has just 80 GB. For large language models, if the model parameters don't fit entirely in VRAM, it simply cannot run.

// Traditional Discrete Architecture

CPU
PCIe ~32 GB/s
GPU
RAM
64-128 GB
VRAM
8-80 GB
⚠ Data must be copied between RAM and VRAM

Traditional architecture is like two kitchens connected by a narrow hallway — ingredients must be carried back and forth, and the chef (GPU) has to wait no matter how fast they can cook.

Chapter 02

The Apple Silicon Revolution

One Chip, One Memory Pool

Apple Silicon uses a Unified Memory Architecture (UMA). Under this design, the CPU, GPU, and Apple Neural Engine (ANE) all share a single pool of memory. There is no distinction between RAM and VRAM — every processor accesses the same physical memory at the same address space.

This means no data copying is needed. Once the CPU loads model weights into memory, the GPU can read the same data directly, and the ANE can access it simultaneously — zero-copy. This eliminates the biggest bottleneck in traditional architectures.

More importantly, unified memory offers far greater capacity than traditional VRAM. The M2 Ultra provides up to 192 GB of unified memory, and the M4 Max offers up to 128 GB — far more than any consumer GPU's VRAM.

// Apple Silicon Unified Memory Architecture

CPU
GPU
ANE
Unified Memory Pool
Up to 192 GB · Shared by all processors
✓ No data copying · Same address space · Zero-latency sharing

Unified memory is like one large open kitchen — all chefs (CPU, GPU, ANE) work around the same counter, with ingredients right at hand, no carrying needed.

Chapter 03

Memory Bandwidth — The Hidden Ceiling of AI Inference

LLM inference is memory-bandwidth bound, not compute-bound. During the decode phase (token-by-token generation), the entire model's weights must be read for each token generated. Therefore, memory bandwidth directly determines inference speed (tok/s).

Apple Silicon's memory bandwidth has improved steadily across generations:

// Memory Bandwidth Comparison

The key trade-off:

NVIDIA GPUs offer higher bandwidth but limited VRAM capacity. Apple Silicon has lower bandwidth but much larger unified memory — enabling larger models to run, even if somewhat slower. For on-device AI, "can run" matters more than "runs fastest."

Chapter 04

Why This Matters for On-Device AI

Large models (30B+ parameters) still require 15-20 GB of memory after Q4 quantization, and 70B models need 35-40 GB. This far exceeds any consumer GPU's VRAM, but fits comfortably in the M2 Ultra's 192 GB or M4 Max's 128 GB unified memory.

The zero-copy property enables CPU/GPU/ANE hybrid inference. Different layers of the model can be assigned to the most suitable processor without any data transfer between them. (See our ANE Hybrid Inference article for details.)

Mixture-of-Experts (MoE) models benefit especially from unified memory. Take Qwen 35B MoE as an example: all 35B parameters must reside in memory, but only about 3B are activated per inference step. Traditional architectures need all 35B parameters in VRAM; on Apple Silicon, unified memory easily holds all parameters while the GPU only computes on the active 3B — extremely efficient.

AtomGradient's research data shows that the M2 Ultra runs a 35B MoE model at 49.9 tok/s — data-center-level performance on consumer hardware.

// Model Size vs Device Memory

Which models fit on which devices? (Memory requirements after Q4 quantization)

Unified memory makes Apple Silicon the ideal platform for MoE — like having a massive bookshelf where you only read a few books at a time, but every book is within arm's reach, no trips to the warehouse needed.

Chapter 05

Summary

🔗

Unified > Discrete

Unified memory eliminates the data-copy bottleneck between CPU and GPU — all processors share one memory pool.

Bandwidth Drives Speed

LLM inference is memory-bandwidth bound — higher bandwidth means more tokens per second.

📦

More Capacity = Bigger Models

192 GB of unified memory makes running 70B+ models on consumer devices a reality.

🔄

Zero-Copy Enables Hybrid Inference

CPU/GPU/ANE can collaborate on inference without data transfers — MoE models benefit the most.

Unified memory lets Apple Silicon achieve data-center-level model deployment on consumer hardware.

Next: Zero-Copy Model Loading