We're working around the clock to bring you the ultimate on-device AI experience. Below are our latest real-device benchmarks — every number measured on actual hardware, not simulators.
9B parameter model, 20-turn continuous conversation, 20,000+ token context — equivalent to a short book.
tok/s = tokens generated per second, higher is smoother. Standard = stock open-source MLX.
| Turn | Context | Edge Runtime | Standard | Gain |
|---|---|---|---|---|
| T1 | 1K | 16.5 | 17.8 | -7% |
| T5 | 5K | 9.6 | 6.8 | +41% |
| T10 | 10K | 6.9 | 5.4 | +28% |
| T15 | 16K | 4.8 | 3.6 | +33% |
| T20 | 21K | 3.5 | 1.8 | +94% |
Qwen3.5-9B · 20-turn deep technical discussion · Edge Runtime
Standard implementation crashes on turn 2 on iPhones
Without our core inference algorithms, the stock implementation crashes on turn 2 on both iPhone 17 Pro and iPhone 17e. Edge Runtime's core inference algorithms complete all 20 turns smoothly on the same devices.
20 turns tested at 21,000+ tokens. At our current algorithm capacity, we can sustain ~26,000 tokens of continuous conversation — equivalent to reading and remembering an entire book, all on your phone.
Note: iPhone Air, iPhone 17 Pro, and Pro Max all have 12GB of physical RAM, yet iOS limits each app to roughly 6GB. We're not sure why Apple imposes this ceiling. If Apple unlocks more memory for apps in the future, we believe on-device AI performance will be even more impressive.
AI-powered model analysis & surgical optimization
Intelligent analysis engine that automatically detects redundant layers, inefficient neurons, and optimization opportunities. Our proprietary 7-step progressive pipeline performs neuron-level surgical pruning — not coarse-grained compression — with real-time perplexity monitoring to guarantee output quality.
Proprietary inference algorithms for Apple Silicon
Purpose-built inference engine with proprietary ANE-GPU co-scheduling, disaggregated inference architecture, and zero-copy model loading. Not a wrapper — original algorithms that achieve 11.3x prefill speedup and 79% GPU power reduction through ANE batch dispatch and concurrent pipeline execution.
The definitive edge AI deployment solution
The only end-to-end pipeline from optimized model to published App Store app. Integrates Edge Runtime's proprietary inference, On-Demand Resources for intelligent model delivery, and built-in ESG carbon tracking — a complete deployment solution that no other platform offers.
import EdgeInference
let engine = LLMEngine()
try await engine.load(config: .find(modelID: "qwen3.5-0.8b")!)
for try await chunk in engine.generate(
messages: [.user("What is edge AI?")]
) {
print(chunk.text, terminator: "")
}5 lines of Swift — load a model, stream tokens. That's it.
Sign up to be notified when the AtomGradient Edge suite is publicly available. We'll send setup guides and invite you to our developer preview.
Hundreds of developers already on the waitlist
AtomGradient is bringing intelligence to every edge, NOT JUST Apple — stay tuned.