From 805 ms to 23 ms: ICU State-Space Models

35.7×

lower end-to-end latency
805ms → 23ms

84.6×

peak kernel speedup
vs. compiled PyTorch

+0.037

AUROC over GRU-D
95% CI excludes zero

10,316s/s

peak throughput
single GPU

Real-time ICU early-warning systems need to update at the bedside fast enough to matter, with a practical budget under 50ms per inference. The obstacle turns out not to be the neural network. ICU vital-sign streams are irregularly sampled with 30%+ missing values, and the preprocessing needed to handle that (neighbor search plus interpolation) collapses into sequential GPU work that eats over 85% of total wall-clock time. Even a fast model is clinically unusable behind that bottleneck.

This work removes the bottleneck by co-designing preprocessing and inference. A single fused GPU kernel performs time-aware interpolation of the irregular series and feeds a state-space model in one launch, eliminating the intermediate memory traffic and Python-level overhead that dominated latency. The same kernel source runs on both NVIDIA and AMD hardware without changes.

End-to-end pipeline: ICU monitoring streams flow through irregular/missing time series into a fused GPU time-aware interpolation kernel, then a 4-layer state-space model produces a real-time mortality risk score. — **FIG 1** The end-to-end system. Irregular physiological streams (heart rate, blood pressure, SpO₂) are interpolated by a single fused kernel that coalesces memory access, runs bidirectional neighbor search, and performs time-aware interpolation in one pass. A 4-layer SSM then produces a real-time in-hospital mortality risk score. Bottom: headline systems and predictive gains.

The core idea

For each missing value, the kernel locates the nearest valid observations within a bounded window of K=10 timesteps and computes a time-weighted interpolation, so each output depends only on a local temporal window and the whole operation is fully parallel. The interpolation is fused into one Triton kernel that launches one thread block per (batch, time, feature-tile) combination, exposing parallelism across all three dimensions. Each block does a coalesced load of a 128-feature tile, a bidirectional search with early-exit flags, the time-delta weighting, and a coalesced store, with no intermediate trips back to global memory.

Key findings

53.6–84.6×

Kernel-level acceleration

The fused kernel runs in 0.02–0.05ms across sequence lengths while the optimized PyTorch baseline scales linearly upward. Speedup grows with sequence length, with numerical outputs matching to within 5×10⁻⁷.

23ms

Meets the bedside target

End-to-end latency drops from 805ms to 23ms at batch 32, comfortably under the 50ms real-time budget. p99 tail latency stays at 44.6ms, where the PyTorch baseline violates 50ms on 100% of calls.

+0.037

Significant AUROC gain

On PhysioNet 2012 across 5 seeds, the Triton SSM beats GRU-D by +0.037 AUROC (paired bootstrap 95% CI [0.018, 0.058], Wilcoxon p=0.008) and +0.087 AUPRC, using 2.4× fewer parameters.

1.6×

BF16 is free speed

Switching FP32 → BF16 yields a further 1.6× kernel speedup with AUROC unchanged within seed variance. FP16 matches the speed but shows sporadic NaNs on extreme outliers, so BF16 is the recommended precision.

Key findings dashboard: end-to-end acceleration, kernel speedup breakdown by sequence length, predictive performance on PhysioNet 2012, ablation study on kernel design choices, and robustness/portability results. — **FIG 2** Results at a glance. (1) End-to-end acceleration and training/throughput gains. (2) Kernel speedup grows with sequence length, up to 84.6× with numerical error below 5×10⁻⁷. (3) +0.037 AUROC over GRU-D on PhysioNet 2012. (4) Ablations across lookback window K, tile size F, and precision. (5) Robustness to missingness and gap size, plus cross-vendor portability on NVIDIA RTX PRO 6000 and AMD MI300X.

Head-to-head on PhysioNet 2012

Under a matched parameter budget and averaged over 5 seeds, the state-space model is both faster and more accurate than the GRU-D baseline, while using fewer parameters.

Metric	Triton SSM	GRU-D	Δ vs. GRU-D
Parameters	827,266	2,012,865	2.4× fewer
Training time	149.9s	343.3s	2.29× faster
AUROC	0.659 ± 0.011	0.622 ± 0.014	+0.037 (p=0.008)
AUPRC	0.328 ± 0.014	0.241 ± 0.018	+0.087 (p=0.012)
Latency	34.1ms	117.5ms	3.44× lower

It generalizes beyond mortality

The same fused kernel transfers to the MIMIC-III 25-task phenotyping benchmark without modification, with only the input feature dimension and SSM head re-instantiated. It delivers +0.024 macro-AUROC over GRU-D (95% CI [0.014, 0.036], p<0.01) and cuts per-stay inference latency from 142ms to 27ms (5.3× lower).

Why it's fast

A roofline analysis confirms the kernel is bandwidth-bound (arithmetic intensity 0.30 FLOP/byte) and operates near the hardware ceiling, reaching 1,344 GB/s of effective HBM bandwidth (75% of peak) versus 143 GB/s (8%) for the PyTorch baseline. The remaining gap to the roofline comes from control-flow divergence in the bidirectional neighbor search and partial write masking at irregular sampling boundaries.

Runs on both vendors

The identical Triton source compiles and runs on AMD MI300X with under 8% latency overhead versus NVIDIA, with no source-level changes. On MI300X the much larger HBM3 capacity (192 GB) permits batch sizes up to 4,096 without spilling, and the higher peak bandwidth partially offsets the wavefront divergence in neighbor search.

The thesis in one line: in clinical time-series pipelines the preprocessing, not the model architecture, is the dominant cost, and fusing it into a single GPU kernel turns an order-of-magnitude latency problem into a solved one, on commodity hardware from either vendor.

Honest limitations

The 0.66 AUROC on PhysioNet 2012 trails state-of-the-art ensembles that use richer engineered features. This work optimizes the systems substrate rather than the predictive model itself. Cross-vendor evaluation is also limited to a single AMD SKU, and the energy figures are steady-state averages from NVML/ROCm-SMI sampling.

Part of an ongoing line of work on portable GPU kernel acceleration for biological and medical AI, benchmarked across NVIDIA H100 and AMD MI300X. See more on the research page.

From 805 ms to 23 ms: Accelerating State-Space Models for Real-Time ICU Monitoring with Fused Triton Kernels

The core idea

Key findings

Kernel-level acceleration

Meets the bedside target

Significant AUROC gain

BF16 is free speed

Head-to-head on PhysioNet 2012

It generalizes beyond mortality

Why it's fast

Runs on both vendors

Honest limitations