Research

Original research in on-device AI — model compression, hardware-aware inference, and personal data integration.

MLXLayerStream

Layer-Streaming Offloading: Running 9B+ LLMs on 8GB Edge Devices

Per-layer weight streaming from NVMe storage enables models exceeding device memory to run inference on iPad and iPhone. 88% peak memory reduction with verified bandwidth scaling across Apple Silicon devices.

  • 60–88% memory reduction: 27B model runs with only 1.7 GB peak memory
  • 9B-6bit OOM on 8GB iPad proves streaming is necessary for 9B+ models
  • iPad/iPhone TPS ratio = 1.92x perfectly matches 2x bandwidth ratio
speculative-moe-research

Does Speculative Decoding Help Mixture-of-Experts?

306-run empirical study showing that speculative decoding provides 1.18–1.30× speedup on Qwen3.5-35B-A3B MoE despite <4% draft acceptance, through a batch verification amortization mechanism that reduces memory bandwidth cost.

  • 1.30× MoE speedup with 0.8B draft at γ=16, <0.2% acceptance
  • Speedup scales with total params (memory bandwidth), not active params
  • Batch verification amortization: new SD mechanism beyond acceptance rate
apple-silicon-llm-inference

Efficient On-Device LLM Inference on Apple Silicon: From Quantization to Speculative Decoding

Systematic benchmarking of 7 GGUF quantization levels and speculative decoding for Qwen3.5 on three Apple Silicon machines (M2 Ultra, M1 Max, M2 Pro), establishing Q6_K as Pareto-optimal and a ≥2.5× draft/target speed ratio as the SD viability rule.

  • Q6_K Pareto-optimal: 1.68× faster, 59% smaller, 0.54% PPL loss
  • +25.7% throughput via speculative decoding (0.8B→9B, k=4)
  • GGML_RPC cross-device SD: 79% overhead — not production-viable
Prism

Cross-Domain Personal Data Integration on Consumer Hardware

Integrating finance, diet, mood, and reading data entirely on consumer Apple Silicon, producing emergent cross-domain insights with zero data leakage.

  • 1.48x cross-domain insight emergence (IIR)
  • 125.5x federation compression, zero data leakage
  • 49.9 TPS real-time inference (35B on M2 Ultra)
hybird-batch-prefill-on-ane

ANE Batch Prefill for On-Device Parallel LLM Inference

Enabling concurrent ANE prefill and GPU decode on Apple Silicon via fused batch matrix-vector kernels, achieving 11.3x speedup over sequential dispatch.

  • 11.3x batch dispatch speedup (268 tok/s)
  • 79% power reduction with concurrent pipeline
  • 27ms TTFT on multi-turn conversations
hybrid-ane-mlx-bench

Disaggregated LLM Inference on Apple Silicon

Benchmarking CoreML ANE prefill + MLX GPU decode for Qwen3.5 on Apple Silicon, with four inference strategies compared.

  • ANE prefill matches GPU at ~410 tokens
  • 282x GPU power reduction during prefill
  • 4 inference pipelines benchmarked
swift-qwen3-tts

On-Device Text-to-Speech

Native Swift implementation of Qwen3 TTS 0.6B for real-time, on-device speech synthesis.

  • 67% model compression (2.35 GB → 808 MB)
  • Real-time synthesis (RTF 0.68x)
  • 12 languages supported
Gemma-Prune

On-Device Vision Language Model

Multi-stage compression pipeline for deploying Gemma 3 4B VLM on consumer hardware.

  • 25% model compression (2.8 GB → 2.1 GB)
  • 110 tok/s text generation
  • 3.4x image processing speedup
OptMLX

MLX Memory Optimization Research

Exploring memory optimization techniques for the MLX framework on Apple Silicon.

  • Up to 20x faster mmap loading
  • Zero-copy model loading
  • Comprehensive benchmarks