805ms → 23ms
10,316 samples/s
PhysioNet 2012
NVIDIA + AMD
This talk presents the systems story behind real-time ICU early-warning: why a sub-50ms bedside budget is so hard to hit, and how co-designing preprocessing with model inference clears it. The obstacle isn't the neural network. ICU vital-sign streams are irregularly sampled with 30%+ missing values, and the interpolation needed to handle that collapses into sequential GPU work that consumes over 85% of total inference time. Fusing that preprocessing and the state-space model into a single GPU kernel removes the bottleneck, taking end-to-end latency from 805ms to 23ms.
The bottleneck, in one number
Profiling shows sequential preprocessing (neighbor search plus time-aware interpolation) eating over 85% of wall-clock time, leaving the GPU's compute capacity idle behind memory-bound, Python-driven work. The talk frames this as the central thesis: in clinical time-series pipelines the preprocessing, not the model architecture, is the dominant cost. Fix the preprocessing and the whole pipeline clears the real-time budget.
The fix: one fused kernel
A single Triton kernel fuses the entire time-aware interpolation and feeds the state-space model in one launch. It eliminates the redundant trips to global memory and the per-operation launch overhead that dominated latency, keeping intermediates on-chip. The result is a kernel that is bandwidth-bound and runs near the hardware ceiling, with three properties that matter for the bedside: it eliminates sequential preprocessing, maximizes GPU utilization, and drastically reduces latency.
What the results show
Meets the bedside target
End-to-end latency drops from 805ms to 23ms at batch 32, comfortably under the 50ms real-time budget for 20 Hz bedside updates.
Throughput scaling
Throughput climbs from ~124 to 10,316 samples/s, an 82.8× improvement, scaling near-linearly with batch size on a single GPU.
Better predictions
On PhysioNet 2012 the SSM beats GRU-D by +0.037 AUROC (95% CI excludes zero), and the same kernel adds +0.024 macro-AUROC on MIMIC-III 25-task phenotyping.
Portable across vendors
The identical Triton source compiles and runs on NVIDIA (CUDA) and AMD (ROCm) with under 8% latency overhead on AMD and no source-level changes.
The talk's takeaway: from 805ms to 23ms, from bottleneck to breakthrough. Co-designing preprocessing and inference into one fused GPU kernel makes real-time, deep-learning ICU monitoring practical on commodity hardware from either vendor.
This talk covers the same work as the SD4H · ICML 2026 paper. For the full methodology, ablations, statistical protocol, and tables, see the detailed paper findings →