exo: Run Llama 405B on Old Phones and Laptops

Cloud APIs drain budgets. Single devices hit memory limits. exo lets you cluster everyday hardware—phones, old laptops, Raspberry Pis—to run models like Llama 405B without configuration. But with 160 unresolved issues and network latency concerns, it's exciting alpha software, not production-ready infrastructure.

Featured Repository Screenshot

Running Llama 405B through OpenAI's API costs thousands for non-trivial workloads. Single machines hit memory walls trying to load 70B+ parameter models. exo takes a different approach: cluster your old phones, laptops, and Raspberry Pis into a distributed inference engine that runs massive models locally—no configuration required.

Since launching in June 2024, the project hit 40,000+ stars and grabbed the #1 trending spot on GitHub with 615 stars in a single day. But 160 open issues versus 49 closed in 90 days tells a different story.

The Cloud Bill Problem

As model sizes pushed past 100B parameters, inference costs scaled with them. Developers running Llama 70B locally watch RAM counters max out. Cloud APIs drain budgets on anything beyond prototyping. Large model inference became too expensive to rent, too big to run.

exo's core mechanism targets that gap. It dynamically partitions models across heterogeneous devices using pipeline parallelism—splitting layers across whatever hardware you connect. An iPhone talks to a MacBook talks to a Raspberry Pi, all discovered automatically through peer-to-peer networking.

What You Can Actually Run

The project supports iOS, Android, Mac, and Linux. Developers report running Llama 405B across four to five mixed devices. LLaVA works on heterogeneous clusters. For faster interconnect between compatible machines, exo supports RDMA over Thunderbolt.

It exposes a ChatGPT-compatible API endpoint. Route your existing inference calls to localhost instead of api.openai.com.

The Reality Check

The backlog of unresolved issues shows hardware compatibility problems, model loading errors, and performance optimization needs. Network latency over consumer WiFi creates unpredictable delays during autoregressive token generation—each token waits for network round-trips across your home cluster.

Early-stage bugs hit platform stability. The Hacker News launch thread surfaced concerns about distributed inference reliability on consumer networks. No companies or open source projects are publicly listed as using exo in production environments.

Where exo Stands Apart

vLLM handles tensor and pipeline parallelism for GPU clusters. Ray distributes LLM inference across enterprise infrastructure. Both assume datacenter hardware and network conditions.

exo targets everyday devices without GPU requirements or master-worker hierarchies. Devices contribute equally regardless of specs. The zero-config P2P discovery removes deployment friction that makes enterprise frameworks impractical for home labs.

Production Readiness: Not Yet

Stability isn't there. Latency remains unpredictable on home networks. The rapid issue accumulation suggests core reliability problems still need solving. Production workloads requiring SLAs should look elsewhere.

Who Should Use exo Now (and Who Should Wait)

This fits ML tinkerers avoiding cloud lock-in with spare hardware lying around. Indie developers experimenting with local inference on a budget. Anyone learning distributed systems concepts with tangible hardware.

Wait if you need production reliability or predictable performance. The gap between GitHub stars and closed issues signals alpha software, not infrastructure you'd bet a product launch on. But for cutting cloud bills during experimentation—while accepting rough edges—exo delivers on its promise: running models too large for one machine by making your drawer of old devices useful again.


exo-exploreEX

exo-explore/exo

Run frontier AI locally.

40.7kstars
2.7kforks