Back to Blog
Engineering|
May 24, 2026
12 min read

Two Phones, 200 Turns of Deep Conversation: The On-Device AI Computing Revolution

We ran a 9-billion-parameter large language model on an iPhone Air and an iPhone 17 Pro, each completing 200 rounds of continuous deep technical conversation. No cloud, no network, no interruption. About 5–7 hours, 200,000 tokens, with no sustained speed degradation as context grew.

What Does This Mean?

Imagine sitting in a café with a senior architect, designing a distributed message queue system from scratch. You talk for an entire afternoon—from core architecture to consensus protocols, from disaster recovery to multi-region deployment, from performance tuning to capacity planning. 200 back-and-forth exchanges, each one a deep technical discussion.

All of this happened on the phone in your pocket.

No WiFi needed. No 5G needed. No need to send your conversation to any server. Your thoughts, your questions, your data never left your device.

The Data Speaks

We ran the same test on two different iPhones to verify reproducibility:

MetriciPhone AiriPhone 17 Pro
ChipA19 ProA19 Pro
Memory12 GB12 GB
Model9B params (4-bit quantized)Same
Turns200200
Total output204,800 tokens204,800 tokens
Duration~6.5 hours~5.1 hours
Avg speed8.2 tokens/sec11.5 tokens/sec
T200 speed8.75 tokens/sec11.4 tokens/sec
Peak memory5.5 GB5.5 GB
Memory growth+166 MB+166 MB
Crashes00

Three counterintuitive findings:

1. Speed did not keep declining—it recovered. On iPhone Air, T1 speed was ~9.1 tokens/sec, dropping to ~7.3 at T20, then steadily recovering to 8.75 at T200. iPhone 17 Pro stayed stable at 11–13 tokens/sec throughout. No sustained degradation.

2. Memory barely grew. Both devices added only 166 MB over ~6 hours. No linear memory bloat, no system memory limit triggered across the entire 200-turn window.

3. Hours later, the model still retained threads from early discussions. In diagnostic recall checkpoints, the model could still cover early architectural points (iPhone Air ~85%, iPhone 17 Pro ~77%). This is an auxiliary metric, not a complete long-term memory evaluation, but it demonstrates the memory retention capability of on-device hybrid inference.

We Also Found the Hardware’s True Boundary

During testing, we tried increasing the per-turn generation limit from 1,024 to 2,048 tokens. Result: iPhone 17 Pro OOM-crashed at turn 13 with only 630 MB available.

This tells us: under current engine and model conditions, 1,024 tokens per turn is the verified stable operating point; 2,048 exposed insufficient memory safety margin. KV cache and intermediate tensor growth made longer single-turn generation unreliable.

This finding itself is valuable—it helps us precisely define the “safe operating zone” for on-device 9B inference, and points the direction for future memory optimization.

Why This Matters

Computational Sustainability: From “Use and Discard” to “Continuous Companionship”

Today’s AI assistants are mostly stateless—you ask, it answers, close the window and everything resets. Even the most advanced cloud models show “forgetting” and “degradation” in ultra-long conversations.

Our tests proved a different possibility: AI can work alongside you for hours like a real colleague, retaining a significant proportion of early context signals across long discussions.

This is not the future. This is now, happening on a phone.

Computational Universality: From Data Centers to Everyone’s Pocket

Previously, stable long-duration inference with a 9-billion-parameter model required desktop-class devices, dedicated GPUs, or cloud services. This test happened on two consumer iPhones—12 GB of physical memory, constrained to ~6 GB jetsam windows—yet each completed 200 turns and 200,000 tokens of continuous conversation.

This means world-class AI capability is no longer the privilege of tech companies and research institutions. A rural teacher can have an AI teaching assistant without internet; an indie developer can pair-program with AI on a plane; a doctor can use AI to organize medical records in areas with no signal.

Democratizing computing power isn’t about giving everyone access to cloud APIs—it’s about making every device inherently intelligent enough.

Computational Equity: Privacy Is Not a Luxury

When AI capabilities run entirely on-device, a profound change follows: your data never needs to leave your phone.

No uploads, no server logs, no third party peeking at your AI conversations. Your medical consultations, financial planning, personal journal, startup ideas—all exist only on the device in your hand.

In an era of frequent data breaches, this isn’t a feature—it’s a right. On-device AI significantly reduces the data leakage surface from cloud transmission and server logs, transforming privacy protection from “trusting a company’s promise” to “architectural-level assurance.”

What We Did Technically

We didn’t use bigger chips or more memory. iPhone Air and iPhone 17 Pro aren’t servers or GPU workstations—they’re mass-produced consumer phones.

What we did was make software smarter about using hardware:

Intelligent Memory Management. Traditional LLM attention caches grow linearly with conversation length, eventually exhausting memory. Our inference algorithm, designed for next-generation hybrid architecture models (many fixed-state recurrent layers + few attention layers), implements controlled KV memory policies that prevent memory from growing unboundedly with conversation length. Like the human brain: short-term memory constantly refreshes while core understanding remains stable.

Computational Efficiency Optimization. Through proprietary Metal kernel and inference scheduling optimizations, we reduced dispatch and materialization overhead during inference. Less overhead means more compute power goes to actual computation.

Long-Range Stable Inference Engine. Traditional inference engines slow down in long conversations because attention computation grows with context length. Our hybrid inference algorithm breaks this limit—leveraging the constant computation cost of recurrent state layers to significantly reduce sustained degradation from long contexts. This is why speed remains stable at turn 200.

This Is Just the Beginning

200 rounds of conversation show that for this class of long-duration continuous inference tasks, hardware is no longer the only bottleneck. The real frontier is the model’s “long-term memory”—we want to do better.

Our next steps:

  • Memory Enhancement: Help models actively summarize and compress early information in ultra-long conversations instead of passively forgetting.
  • Multi-Device Collaboration: Your iPhone handles daily conversations, MacBook handles complex reasoning, Mac Studio handles training—all devices forming a private AI network.
  • Continuous Learning and Personalization: We are applying our proprietary RPP algorithm and HALO architecture to on-device continuous learning and personalization validation, with the goal of models that gradually understand your preferences, habits, and knowledge structure locally. Unlike cloud-based personalization—your learning data never leaves the device, and personalized updates happen entirely on-device.

On-device AI is not a cheap substitute for cloud AI. It is an entirely new computing paradigm—always on, fully private, continuously evolving, belonging to everyone.

AtomGradient — Bringing AI to the Edge | AtomGradient