GPU frameworks waste 92% of their time on overhead — sending tasks one by one instead of all at once. I proved it and fixed it. Two preprints, two domains, one technique.
No jargon. Here's the intuition.
GPU frameworks (PyTorch, JAX, WebLLM) send one small task to the GPU, wait for it to finish, send the next one. For a 64-token generation with 4 layers, that's 1,024 separate round-trips. Each round-trip takes longer than the actual math.
Pack the entire computation — all tokens, all layers, all operations — into a single GPU instruction. The GPU loops internally. No round-trips. No waiting. Same math, same result.
Paper 1: 159-720× on sequential fitness evaluation across CUDA, WebGPU, JAX, Triton. Paper 2: 66-458× on transformer inference with a parallel kernel. Both on the same hardware as the baselines. All code and data public.
16,000 tokens per second at D=32 in Chrome. No Python, no CUDA, no cloud. A laptop running a browser outperforms PyTorch on the same GPU by up to 161×.
Not theoretical. Here's what's different tomorrow.
ChatGPT in your browser types 5 words per second. You assume your laptop isn't powerful enough.
Your GPU was idle 92% of the time. The waiting is eliminated. Same GPU, 6-458× faster.
Running AI locally means installing Python, CUDA, PyTorch, downloading model weights, debugging driver conflicts.
Open a browser tab. That's it. The AI runs on the GPU you already have, at near-native speed.
Every AI feature costs $2-4/hour in cloud GPU. 100K users = $50K/month in servers.
The user's GPU does the work. Server cost: $0. The browser IS the infrastructure.
A student in rural India can't afford a GPU cluster or cloud API credits to learn AI.
A $300 phone with Chrome can run transformer inference locally. No internet needed after model download.
Browser-based AI assistants could respond 6-458× faster. Not by buying better hardware — by fixing how the software talks to your GPU.
Run AI models live in the classroom. Every student's laptop becomes an AI workstation. No lab, no cloud account, no IT department.
Ship a live demo of your model as a URL. Reviewers run it in their browser instead of fighting with your Docker container.
Add AI features to your web app without GPU servers. Your users' devices do the compute. Scale to millions at zero marginal cost.
Healthcare, legal, finance — the AI runs on the device. Data never leaves the laptop. Compliance by architecture.
3 billion people have a WebGPU-capable device. Browser-native AI makes intelligence a capability your device already has, not a service you rent.
Run the benchmarks on your hardware, right now.