Research

Original research in on-device AI — model compression, hardware-aware inference, and personal data integration.

MLXLayerStream

Layer-Streaming Offloading: Running 9B+ LLMs on 8GB Edge Devices

Per-layer weight streaming from NVMe storage enables models exceeding device memory to run inference on iPad and iPhone. 88% peak memory reduction with verified bandwidth scaling across Apple Silicon devices.

▶60–88% memory reduction: 27B model runs with only 1.7 GB peak memory
▶9B-6bit OOM on 8GB iPad proves streaming is necessary for 9B+ models
▶iPad/iPhone TPS ratio = 1.92x perfectly matches 2x bandwidth ratio

speculative-moe-research

Does Speculative Decoding Help Mixture-of-Experts?

306-run empirical study showing that speculative decoding provides 1.18–1.30× speedup on Qwen3.5-35B-A3B MoE despite <4% draft acceptance, through a batch verification amortization mechanism that reduces memory bandwidth cost.

▶1.30× MoE speedup with 0.8B draft at γ=16, <0.2% acceptance
▶Speedup scales with total params (memory bandwidth), not active params
▶Batch verification amortization: new SD mechanism beyond acceptance rate

apple-silicon-llm-inference

Efficient On-Device LLM Inference on Apple Silicon: From Quantization to Speculative Decoding

Systematic benchmarking of 7 GGUF quantization levels and speculative decoding for Qwen3.5 on three Apple Silicon machines (M2 Ultra, M1 Max, M2 Pro), establishing Q6_K as Pareto-optimal and a ≥2.5× draft/target speed ratio as the SD viability rule.

▶Q6_K Pareto-optimal: 1.68× faster, 59% smaller, 0.54% PPL loss
▶+25.7% throughput via speculative decoding (0.8B→9B, k=4)
▶GGML_RPC cross-device SD: 79% overhead — not production-viable

Prism

Cross-Domain Personal Data Integration on Consumer Hardware

Integrating finance, diet, mood, and reading data entirely on consumer Apple Silicon, producing emergent cross-domain insights with zero data leakage.

▶1.48x cross-domain insight emergence (IIR)
▶125.5x federation compression, zero data leakage
▶49.9 TPS real-time inference (35B on M2 Ultra)

hybird-batch-prefill-on-ane

ANE Batch Prefill for On-Device Parallel LLM Inference

Enabling concurrent ANE prefill and GPU decode on Apple Silicon via fused batch matrix-vector kernels, achieving 11.3x speedup over sequential dispatch.

▶11.3x batch dispatch speedup (268 tok/s)
▶79% power reduction with concurrent pipeline
▶27ms TTFT on multi-turn conversations

hybrid-ane-mlx-bench

Disaggregated LLM Inference on Apple Silicon

Benchmarking CoreML ANE prefill + MLX GPU decode for Qwen3.5 on Apple Silicon, with four inference strategies compared.

▶ANE prefill matches GPU at ~410 tokens
▶282x GPU power reduction during prefill
▶4 inference pipelines benchmarked

swift-qwen3-tts

On-Device Text-to-Speech

Native Swift implementation of Qwen3 TTS 0.6B for real-time, on-device speech synthesis.

▶67% model compression (2.35 GB → 808 MB)
▶Real-time synthesis (RTF 0.68x)
▶12 languages supported

Gemma-Prune

On-Device Vision Language Model

Multi-stage compression pipeline for deploying Gemma 3 4B VLM on consumer hardware.

▶25% model compression (2.8 GB → 2.1 GB)
▶110 tok/s text generation
▶3.4x image processing speedup

OptMLX

MLX Memory Optimization Research

Exploring memory optimization techniques for the MLX framework on Apple Silicon.

▶Up to 20x faster mmap loading
▶Zero-copy model loading
▶Comprehensive benchmarks