Work done at Embedded LLM Embedded LLM
← Research Details
ACM ICS 2026 Arch4Health workshop United Kingdom · Accepted talk

From 805ms to 23ms: Accelerating State-Space Models for Real-Time ICU Monitoring

35.7×
lower latency
805ms → 23ms
82.8×
higher throughput
10,316 samples/s
+0.037
AUROC over GRU-D
PhysioNet 2012
2
vendors, one source
NVIDIA + AMD

This talk presents the systems story behind real-time ICU early-warning: why a sub-50ms bedside budget is so hard to hit, and how co-designing preprocessing with model inference clears it. The obstacle isn't the neural network. ICU vital-sign streams are irregularly sampled with 30%+ missing values, and the interpolation needed to handle that collapses into sequential GPU work that consumes over 85% of total inference time. Fusing that preprocessing and the state-space model into a single GPU kernel removes the bottleneck, taking end-to-end latency from 805ms to 23ms.

Infographic: Accelerating State-Space Models for Real-Time ICU Monitoring. 35.7× faster inference (805ms to 23ms), 82.8× higher throughput, hardware portable across NVIDIA and AMD, better predictions; the challenge of irregular and missing ICU data; the 85% preprocessing bottleneck; the fused Triton kernel pipeline; and results for latency, predictive performance, and throughput.
OVERVIEW The talk in one view. Headline gains at top; the challenge of irregular, missing ICU data and the 85% preprocessing bottleneck; the fused kernel pipeline that combines interpolation and SSM inference into a single launch; and results across latency, predictive performance, and throughput, plus the key design choices and cross-vendor portability.

The bottleneck, in one number

Profiling shows sequential preprocessing (neighbor search plus time-aware interpolation) eating over 85% of wall-clock time, leaving the GPU's compute capacity idle behind memory-bound, Python-driven work. The talk frames this as the central thesis: in clinical time-series pipelines the preprocessing, not the model architecture, is the dominant cost. Fix the preprocessing and the whole pipeline clears the real-time budget.

The fix: one fused kernel

A single Triton kernel fuses the entire time-aware interpolation and feeds the state-space model in one launch. It eliminates the redundant trips to global memory and the per-operation launch overhead that dominated latency, keeping intermediates on-chip. The result is a kernel that is bandwidth-bound and runs near the hardware ceiling, with three properties that matter for the bedside: it eliminates sequential preprocessing, maximizes GPU utilization, and drastically reduces latency.

What the results show

23ms

Meets the bedside target

End-to-end latency drops from 805ms to 23ms at batch 32, comfortably under the 50ms real-time budget for 20 Hz bedside updates.

82.8×

Throughput scaling

Throughput climbs from ~124 to 10,316 samples/s, an 82.8× improvement, scaling near-linearly with batch size on a single GPU.

+0.037

Better predictions

On PhysioNet 2012 the SSM beats GRU-D by +0.037 AUROC (95% CI excludes zero), and the same kernel adds +0.024 macro-AUROC on MIMIC-III 25-task phenotyping.

2

Portable across vendors

The identical Triton source compiles and runs on NVIDIA (CUDA) and AMD (ROCm) with under 8% latency overhead on AMD and no source-level changes.

The talk's takeaway: from 805ms to 23ms, from bottleneck to breakthrough. Co-designing preprocessing and inference into one fused GPU kernel makes real-time, deep-learning ICU monitoring practical on commodity hardware from either vendor.


This talk covers the same work as the SD4H · ICML 2026 paper. For the full methodology, ablations, statistical protocol, and tables, see the detailed paper findings →