Kernel fusion eliminates per-dispatch overhead by packing entire computations into single GPU instructions. Proven across evolutionary algorithms and transformer inference — with up to 720× speedup. Zero installation. Any browser.
How frameworks work
dispatch
step 1 → wait → dispatch step 2 → wait... × 1,500 steps = 22,500 round-trips
92% of time = waiting, not computing
Kernel fusion
dispatch once
→ GPU loops internally1,500 steps in 1 round-trip
100% of time = computing
via WebGPU Compute Shaders
Fusing sequential fitness evaluations into single GPU dispatches eliminates per-step kernel launch overhead. Proven across CUDA, WebGPU, JAX/XLA, and Triton on two hardware platforms.
via WebGPU Compute Shaders
Browser LLM engines dispatch 1,024 separate GPU kernels per generation. We fuse everything into one dispatch. Single-threaded: 6.6-13.5×. Parallel kernel (64 threads + shared memory): 66-458×. Beats PyTorch MPS by 7.5-161× at all tested sizes up to D=256. 16,410 tok/s at D=32.