Manpreet Singh — GPU / ML Systems Research

News

most recent first

Jul 2026
ICPP 2026DC4AI · Singapore · “Error-Bounded Fused Attention Compression for Long-Context Genomic Foundation Models Across Heterogeneous GPUs” accepted to the International Workshop on Data Compression for AI and Big Data Applications. paperposter
Jul 2026
COLM 2026DAIH · San Francisco, USA · “Deploying Clinical Language and Vision-Language Models Where the Data Lives: An Energy-Aware, Cross-Vendor Protocol for On-Premises Healthcare AI” accepted for a poster at the Workshop on Deployable AI in Healthcare. paperposter
Jul 2026
ECCV 2026MedFM-Bench Invited by the program committee to serve as a reviewer for the Workshop on Medical Foundation Models and Benchmarks. reviewer
Jul 2026
MICCAI 2026Efficient Medical AI Invited to serve as a reviewer for the 2nd Workshop on Efficient Medical AI. reviewer
Jun 2026
ISCA 2026HotInfra · USA · “When the LLM-Tuned Stack Misses: An Infrastructure View of Biological Foundation Model Inference Across NVIDIA and AMD” accepted for short talk to the 3rd Workshop on Hot Topics in System Infrastructure. papertalk
May 2026
ICML 2026SD4H · South Korea · “From 805ms to 23ms: Accelerating State-Space Models for Real-Time ICU Monitoring with Fused Triton Kernels” accepted as a research paper to the workshop on Structured Data for Health. paperposter
May 2026
MLSys 2026YPS · USA · “BioTriton: Portable Cross-Vendor GPU Kernels for High-Throughput Bioinformatics via OpenAI Triton” accepted for a poster at the Young Professionals Symposium. poster
May 2026
ACM ICS 2026Arch4Health · UK · “From 805ms to 23ms: Accelerating State-Space Models for Real-Time ICU Monitoring” accepted as a short talk at the 3rd Workshop on Architecture for Health. talk
Apr 2026
RECOMB 2026Arch · Greece · “Hardware-Portable Fused GPU Kernels for High-Throughput Biological Foundation Models” accepted for a poster presentation. poster
Mar 2026
ISC High Performance 2026 · Germany · “Portable GPU Kernel Acceleration for Biological Foundation Models & Algorithms using OpenAI Triton” accepted for a research poster. poster
Oct 2025
EurIPS 2025SimBioChem · Denmark · “Accelerating Molecular Simulations with OpenAI Triton: Fused GPU Kernels for TensorNet Neural Potentials” accepted for presentation in Copenhagen. paperposter
Oct 2025
EurIPS 2025SimBioChem · Denmark · Invited by the program committee to serve as a reviewer for the EurIPS Workshop on Machine Learning for Simulations in Biology & Chemistry. reviewer
Oct 2025
Embedded LLM · Singapore Joined as an ML Systems Research Intern.
Jun 2025
CloudCosmos Joined as a Software Engineering Intern.

Experience

Embedded LLM Oct 2025 — Present Singapore · Remote

ML Systems Research Intern current

Led solo research producing 5 papers, 3 posters, and 2 accepted talks across ICML 2026, ISCA 2026, COLM 2026, ACM ICS 2026, ICPP 2026, EurIPS 2025, MLSys YPS, RECOMB-ARCH, and ISC High Performance 2026.
Building BioTriton, an open-source Triton acceleration library for biology and chemistry workflows across heterogeneous GPU backends.
Accelerated ProteinMPNN inference with custom Triton kernels: 1.5× average (1.61× peak), 100% accuracy preserved.
Compressed ESM-2 embeddings via TurboQuant: 4× smaller (439MB → 110MB), +46% Recall@10 over FAISS PQ, 10M proteins in 11GB RAM.
Upstream contributions to LinkedIn Liger Kernel, AMD ROCm, and PrunaAI.

CloudCosmos Jul — Sep 2025 North Carolina, USA

Software Engineering Intern

Accelerated a financial reconciliation pipeline 33.0s → 3.6s (9.1×) through compute and memory optimization.
Built text segmentation and information extraction for Sanskrit literature and architectural design documents.

Stealth Startup Mar — Apr 2025 India · Freelance

Freelance Machine Learning Engineer

Fine-tuned domain LLMs on federal immigration documents to auto-generate appeal drafts with precedent-based citations.
Improved an IELTS predictor R² 0.86 → 0.97 and shipped a 90%-accuracy visa approval prediction system.

Selected Projects

all numbers measured

DeepMind Genomic Model Acceleration

TRITON · BIOINFORMATICS · ATTENTION

5.05×

Custom Triton kernels for AlphaGenome (5.05× faster, ~0.99999997 cosine similarity) and Enformer across 196K-base DNA: 1.43× on MI300X, 1.82× on H100.

↗ GitHub·cross-vendor

38× Faster DualBind on AMD

TRITON · PYTORCH · ROCm

38×

Accelerated NVIDIA Bio Group's protein-ligand binding model on ROCm: 41.3s → 1.06s on MI300X, 4.2× on H100, 100% fidelity. Throughput 19.4 → 752.9 samples/s, 97% cost cut.

↗ Benchmarks·vendor asymmetry

BioCUDA-Triton · ESM-2

CUDA · TRITON · PROTEIN LM

43.1×

CUDA+Triton inference engine for Facebook's esm2_t6_8M (1.21M downloads/mo): 43.1× speedup, 97.7% latency reduction, 68.9% memory savings, $1,100+/mo/GPU saved.

↗ GitHub·production

T1Converter · MRI Synthesis

PYTORCH · MEDICAL IMAGING · GAN

0.95SSIM

GANs translating T2, T1CE, and FLAIR MRI from a single T1 scan, replacing four scans with one. Cuts scan time 30–44 min and cost ~70% per patient, SSIM > 0.95.

↗ Paper under review·clinical

Liger Kernel LinkedIn · 6k★ PR #887

AMD ROCm Iris · ROCm 7.0

PrunaAI PR #348 · #347

News

Research & Publications

Experience

Selected Projects