LLMs at the edge - a false promise
Disclaimer: I am an amateur technologist. My writing should be treated as an opinion piece rather than an evidence-based rigorous analysis.
Llama was a watershed moment in open-source AI. Even though the restrictive license prohibits anyone from using it or any derivatives for commercial applications, the model proved to be a high-quality open-source LLM that could be built on publicly. With computationally efficient finetuning like LoRA, Llama set off an explosion of model diversity. But the open-source hype train had one more strain - an interest in edge inferencing. I want to review where edge LLMs might be appropriate.
Edge inferencing has several benefits:
Latency: Network delays are a large and often the largest component of response times. Moreover, network delays (as anyone who has tried to consume cloud APIs knows) are highly variable frustrating the end user. By inferencing at the edge, we solve the latency problem to a large extent.
Compute: GPUs are in high demand right now. Even if a company manages to get its hands on enough GPUs to power inferencing for millions of users, these chips are not cheap. By shifting this compute to the edge, the marginal cost of inferencing is effectively zero.
Low internet dependence: This one is often cited as a benefit of edge applications but I think in the case of LLMs, the gains here are minimal. It is safe to assume that most people using code generation, text processing, etc are connected to the internet.
Edge inferencing does have some glaring drawbacks too:
Updates: In a field as dynamic as AI, model checkpoints are created almost daily. A superior model can easily be adopted in a cloud architecture by providing a simple consumption API abstracting model details (case in point, HuggingFace inference API). In contrast, even assuming that the package and config requirements are constant, edge devices have to download new checkpoints - some of which can run into multiple gigabytes in size even after compression.
Quality: This is perhaps the single biggest drawback of edge LLMs. The most popular LLMs for text generation in terms of usage and quality are the GPT class from OpenAI. In the open-source world, Llama and Falcon are the leaders with Llama even sporting impressive edge performance on Mac. Even still, the speed is slower than what a typical consumer might be used to - especially on mobile devices.
We don’t need it: AI is a sustaining innovation. It is more suited to creating features in existing platforms and products rather than delivering new products from scratch. In this case, if you are shipping SaaS or a consumer product over the web or an internet-enabled app, it is easier to integrate and deliver the feature over the internet as the rest of the product is. Not to mention, if you are using a proprietary model or fine-tune over a closed dataset to power your AI features, you must deliver it over the cloud.
To conclude, I think edge LLMs will have commercially viable use cases - especially if Apple makes it more efficient to run those on the A-series chips. But unless we beat the scaling laws and only use open data and models (unlikely), it will make sense more often than not to use LLMs in the cloud - whether public or private.