Research
Original research in on-device AI — model compression, hardware-aware inference, and personal data integration.
Layer-Streaming Offloading: Running 9B+ LLMs on 8GB Edge Devices
Per-layer weight streaming from NVMe storage enables models exceeding device memory to run inference on iPad and iPhone. 88% peak memory reduction with verified bandwidth scaling across Apple Silicon devices.
- ▶60–88% memory reduction: 27B model runs with only 1.7 GB peak memory
- ▶9B-6bit OOM on 8GB iPad proves streaming is necessary for 9B+ models
- ▶iPad/iPhone TPS ratio = 1.92x perfectly matches 2x bandwidth ratio
Does Speculative Decoding Help Mixture-of-Experts?
306-run empirical study showing that speculative decoding provides 1.18–1.30× speedup on Qwen3.5-35B-A3B MoE despite <4% draft acceptance, through a batch verification amortization mechanism that reduces memory bandwidth cost.
- ▶1.30× MoE speedup with 0.8B draft at γ=16, <0.2% acceptance
- ▶Speedup scales with total params (memory bandwidth), not active params
- ▶Batch verification amortization: new SD mechanism beyond acceptance rate
Efficient On-Device LLM Inference on Apple Silicon: From Quantization to Speculative Decoding
Systematic benchmarking of 7 GGUF quantization levels and speculative decoding for Qwen3.5 on three Apple Silicon machines (M2 Ultra, M1 Max, M2 Pro), establishing Q6_K as Pareto-optimal and a ≥2.5× draft/target speed ratio as the SD viability rule.
- ▶Q6_K Pareto-optimal: 1.68× faster, 59% smaller, 0.54% PPL loss
- ▶+25.7% throughput via speculative decoding (0.8B→9B, k=4)
- ▶GGML_RPC cross-device SD: 79% overhead — not production-viable
Cross-Domain Personal Data Integration on Consumer Hardware
Integrating finance, diet, mood, and reading data entirely on consumer Apple Silicon, producing emergent cross-domain insights with zero data leakage.
- ▶1.48x cross-domain insight emergence (IIR)
- ▶125.5x federation compression, zero data leakage
- ▶49.9 TPS real-time inference (35B on M2 Ultra)
ANE Batch Prefill for On-Device Parallel LLM Inference
Enabling concurrent ANE prefill and GPU decode on Apple Silicon via fused batch matrix-vector kernels, achieving 11.3x speedup over sequential dispatch.
- ▶11.3x batch dispatch speedup (268 tok/s)
- ▶79% power reduction with concurrent pipeline
- ▶27ms TTFT on multi-turn conversations
Disaggregated LLM Inference on Apple Silicon
Benchmarking CoreML ANE prefill + MLX GPU decode for Qwen3.5 on Apple Silicon, with four inference strategies compared.
- ▶ANE prefill matches GPU at ~410 tokens
- ▶282x GPU power reduction during prefill
- ▶4 inference pipelines benchmarked
On-Device Text-to-Speech
Native Swift implementation of Qwen3 TTS 0.6B for real-time, on-device speech synthesis.
- ▶67% model compression (2.35 GB → 808 MB)
- ▶Real-time synthesis (RTF 0.68x)
- ▶12 languages supported
On-Device Vision Language Model
Multi-stage compression pipeline for deploying Gemma 3 4B VLM on consumer hardware.
- ▶25% model compression (2.8 GB → 2.1 GB)
- ▶110 tok/s text generation
- ▶3.4x image processing speedup
MLX Memory Optimization Research
Exploring memory optimization techniques for the MLX framework on Apple Silicon.
- ▶Up to 20x faster mmap loading
- ▶Zero-copy model loading
- ▶Comprehensive benchmarks