Research — Manpreet Singh

COLM 2026 DAIH workshop On-Prem Healthcare AI Energy-Aware

Deploying Clinical Language and Vision-Language Models Where the Data Lives: An Energy-Aware, Cross-Vendor Protocol for On-Premises Healthcare AI

A vendor-portable fused-operator protocol runs clinical LMs and VLMs on-premises so PHI never leaves the hospital: 1.64× faster encoding at half the energy per slide, 2.03× slides cleared per shift, unchanged across NVIDIA H100 and AMD MI300X with ΔAUC ≤ 0.001.

COLM 2026 · San Francisco, USApaperposterView findings →

ICML 2026 SD4H workshop State-Space Models Clinical AI

From 805ms to 23ms: Accelerating State-Space Models for Real-Time ICU Monitoring

A fused GPU kernel folds irregular-sampling interpolation and SSM inference into a single launch, cutting end-to-end latency 35.7× and clearing the sub-50ms bedside target while improving AUROC over GRU-D.

ICML 2026 · South KoreapaperView findings →

EurIPS 2025 SimBioChem workshop Molecular Dynamics ML Force Fields

Accelerating Molecular Simulations with Triton: Fused GPU Kernels for TensorNet Neural Potentials

Profiling-driven kernel fusion folds 3–8 TensorNet operations into single GPU launches, cutting kernel launches by 67–88% for a 2.82× end-to-end speedup that turns a 13-hour MD run into 4.6 hours, with physical accuracy preserved exactly.

EurIPS 2025 · DenmarkpaperreviewerView findings →

ISCA 2026 HotInfra workshop Infrastructure

When the LLM-Tuned Stack Misses: An Infrastructure View of Biological Foundation Model Inference Across NVIDIA and AMD

A measurement study across six bio models and three GPUs: 60–80% of runtime sits outside dense compute, vendor-stack asymmetry turns a trivial bug into a 38.8× swing, and one Triton source closes the gap on both NVIDIA and AMD.

ISCA 2026 · U.S.A.papertalkView findings →

ISC High Performance 2026 Cross-Vendor Bio Foundation Models

Portable GPU Kernel Acceleration for Biological Foundation Models & Algorithms using OpenAI Triton

One portable Triton fusion framework accelerates six biological models and algorithms (DualBind, AlphaGenome, Enformer, ESM-2, ProtBERT, Needleman-Wunsch) up to 720× across NVIDIA and AMD GPUs, with zero accuracy loss.

ISC 2026 · GermanyposterView findings →

RECOMB 2026 ARCH Hardware-Algorithm Co-design Bioinformatics

Hardware-Portable Fused GPU Kernels for High-Throughput Biological Foundation Models

Per-model drop-in Triton kernels accelerate protein and genomic foundation models plus classical algorithms (Smith-Waterman, Needleman-Wunsch, k-mer indexing, BWT) up to 2,986× across NVIDIA and AMD, with correctness preserved to machine precision.

RECOMB-ARCH 2026 · GreeceposterView findings →

MLSys 2026 YPS Write-Once Kernels Bioinformatics

BioTriton: Portable Cross-Vendor GPU Kernels for High-Throughput Bioinformatics via OpenAI Triton

A library of 20+ write-once Triton kernels delivering 10–19,000× speedups for sequence alignment, k-mer indexing, and quality control, compiling through MLIR to run identically on NVIDIA and AMD with no source changes.

MLSys YPS 2026 · USAposterView findings →

ACM ICS 2026 Arch4Health · talk Clinical AI

From 805ms to 23ms: Accelerating State-Space Models for Real-Time ICU Monitoring

An accepted talk on the fused-kernel ICU work, presenting the systems story behind sub-50ms bedside inference: the 85% preprocessing bottleneck, the single-launch fix, and the latency, throughput, and accuracy gains across NVIDIA and AMD.

ACM ICS 2026 · United KingdomtalkView talk →