Two research preprints

Your GPU is fast enough.
The software isn't.

Naïve browser GPU pipelines waste 92%+ of their time on dispatch overhead — sending tasks one by one instead of all at once. Kernel fusion eliminates that. I shipped it for WebGPU and measured the result across 92 unique devices and 7 GPU vendors.

71×

Apple Silicon median 92 unique devices, 7 vendors

20×

Qualcomm Adreno median Android phones

things to install just open Chrome

How it works (simply)

No jargon. Here's the intuition.

The problem: 92%+ overhead on naïve dispatch

Eager browser GPU pipelines (TF.js, hand-written WebGPU loops) send one small task to the GPU, wait for it to finish, send the next one. For a 64-token generation with 4 layers, that's 1,024 separate round-trips. Each round-trip takes longer than the actual math. Compilers like TVM, XLA, and Burn fuse some of this — but rarely the whole graph, and the WebGPU backend is the least-tuned target across the board.

The fix: one dispatch

Pack the entire computation — all tokens, all layers, all operations — into a single GPU instruction. The GPU loops internally. No round-trips. No waiting. Same math, same result.

The proof: 92 unique devices, 7 vendors

Two preprints, then 92 unique devices ran it across 7 GPU vendors. Median speedups (the typical experience): Apple Silicon 71×, NVIDIA 56×, ARM Mali 55×, Intel 43×, AMD 40×, Qualcomm Adreno 20×. Tested across Chrome, Firefox, Safari on macOS, Windows, Linux, Android, and iOS.

The result: AI on a phone

213,000 tokens per second peak on a phone. 15,000 average across all mobile devices. No Python, no CUDA, no cloud. A browser tab outperforms PyTorch on the same hardware.

What actually changes

Not theoretical. Here's what's different tomorrow.

Before

ChatGPT in your browser types 5 words per second. You assume your laptop isn't powerful enough.

→

After

Your GPU was idle 92%+ of the time. The waiting is eliminated. Same GPU, 20-71× faster on typical devices, peaking 226-402× on best.

Before

Running AI locally means installing Python, CUDA, PyTorch, downloading model weights, debugging driver conflicts.

→

After

Open a browser tab. That's it. The AI runs on the GPU you already have, at near-native speed.

Before

Every AI feature costs $2-4/hour in cloud GPU. 100K users = $50K/month in servers.

→

After

The user's GPU does the work. Server cost: $0. The browser IS the infrastructure.

Before

A student in rural India can't afford a GPU cluster or cloud API credits to learn AI.

→

After

A $300 phone with Chrome can run transformer inference locally. No internet needed after model download.

Who this is for

💬

Anyone who uses AI chatbots

Browser-based AI assistants could respond 20-71× faster on typical devices (peaks 226× on Apple Silicon, 402× on NVIDIA). Not by buying better hardware — by fixing how the software talks to your GPU.

🏫

Teachers and students

Run AI models live in the classroom. Every student's laptop becomes an AI workstation. No lab, no cloud account, no IT department.

🔬

AI researchers

Ship a live demo of your model as a URL. Reviewers run it in their browser instead of fighting with your Docker container.

🚀

Startups building AI products

Add AI features to your web app without GPU servers. Your users' devices do the compute. Scale to millions at zero marginal cost.

🔒

Privacy-sensitive industries

Healthcare, legal, finance — the AI runs on the device. Data never leaves the laptop. Compliance by architecture.

🌍

The developing world

3 billion people have a WebGPU-capable device. Browser-native AI makes intelligence a capability your device already has, not a service you rent.

Why are the real-world numbers bigger than the paper?

The papers measured on 2 machines. The real world has hundreds of different GPUs. Here's why that matters.

The paper tested on 2 devices

An Apple M2 Pro laptop and a Tesla T4 server. Both are fast desktop/server GPUs with efficient command dispatching. The paper measured 159–720× speedup on those machines.

92 unique devices ran it on everything else

Phones, tablets, Chromebooks, gaming rigs, office laptops — across 7 GPU vendors and 4 operating systems. Devices with GPUs that were never designed for compute workloads. These GPUs have much worse dispatch overhead than the ones in the paper.

The mechanism is hardware-agnostic

Kernel fusion eliminates dispatch overhead. The mechanism holds across every vendor we've tested. Median speedup (the typical experience): NVIDIA 56×, Apple Silicon 71×, ARM Mali 55×, Intel 43×, AMD 40×, Qualcomm Adreno 20×. Peak observed: 402× on NVIDIA, 226× on Apple Silicon, 103× on Qualcomm. Smaller in relative terms on phones because absolute throughput is lower — but still an order of magnitude faster than unfused.

Related work

Kernel fusion isn't new. Here's the lineage and where this work fits.

Apache TVM / MLC-LLM / WebLLM (since ~2020; WebLLM arXiv:2412.15803, Dec 2024)

TVM and MLC-LLM compile Python models through graph-level fusion down to WebGPU kernels. WebLLM uses this stack to ship LLMs in the browser. This work fuses the entire autoregressive decoder by hand in WGSL — no Python toolchain — and ablates the limit of single-dispatch fusion.

Burn / CubeCL (Tracel AI, tensor-stream fusion, Mar 2024)

Rust-side tensor-op stream fusion targeting CUDA, Metal, ROCm, Vulkan, and WGPU. Reported up to 78× on the WGPU backend for elementwise operators. This work targets WebGPU directly without the Rust compilation step.

ONNX Runtime Web (WebGPU EP) (Microsoft, v1.17, Feb 2024)

WebGPU execution provider with graph-level node fusion (Conv+Add and similar patterns). Production-grade, used by Transformers.js. Fuses operator patterns; this work fuses the full transformer block.

Maczan (arXiv:2604.02344, Feb 2026) — closest peer

Cross-vendor WebGPU dispatch overhead study across 4 GPU vendors, 3 backends, 3 browsers. Found 53% throughput improvement from fusion on Vulkan and no benefit on CUDA. Per-dispatch API overhead 24–71µs. This work covers 92 devices and 7 vendors, and fuses the entire decoding loop into one dispatch.

nnJIT (Jia et al, arXiv:2309.08978, Sept 2023)

First paper to explicitly target WebGPU for browser DL inference, via JIT kernel generation. Optimizes individual kernels but does not fuse them. Establishes the WebGPU compute path; this work uses fusion on top.

See it for yourself

Run the benchmarks on your hardware, right now.

Transformer Benchmark GPU Compute Benchmarks All Results (Open Data)

Every result from every device is public. No cherry-picking. Verify any claim yourself.