GPU Kernels · ML Systems · Biological & Medical AI

Making biological foundation models run orders of magnitude faster.

I'm a GPU/ML systems researcher writing portable GPU kernels that fuse and accelerate biological and medical foundation models across NVIDIA and AMD hardware, turning order-of-magnitude speedups into clinically and scientifically usable systems.

Embedded LLM · ML Systems Research Intern Thapar Institute · B.E. CS, CGPA 9.22 MI300XH100
38×
DualBind on MI300X
41.3s → 1.06s
43.1×
ESM-2 vs HF baseline
97.7% latency cut
805→23ms
ICU state-space model
real-time monitoring
8
accepted works · 2025–26
solo-authored
01

Research & Publications

all research →
02

Experience

Embedded LLM Oct 2025 — Present Singapore · Remote
ML Systems Research Intern current
  • Led solo research producing 3 papers, 3 posters, and an accepted talk across ICML 2026, ACM ICS 2026, EurIPS 2025, MLSys YPS, RECOMB-ARCH, and ISC High Performance 2026.
  • Building BioTriton, an open-source Triton acceleration library for biology and chemistry workflows across heterogeneous GPU backends.
  • Accelerated ProteinMPNN inference with custom Triton kernels: 1.5× average (1.61× peak), 100% accuracy preserved.
  • Compressed ESM-2 embeddings via TurboQuant: 4× smaller (439MB → 110MB), +46% Recall@10 over FAISS PQ, 10M proteins in 11GB RAM.
  • Upstream contributions to LinkedIn Liger Kernel, AMD ROCm, and PrunaAI.
CloudCosmos Jul — Sep 2025 North Carolina, USA
Software Engineering Intern
  • Accelerated a financial reconciliation pipeline 33.0s → 3.6s (9.1×) through compute and memory optimization.
  • Built text segmentation and information extraction for Sanskrit literature and architectural design documents.
Visa Guru Immigration Mar — Apr 2025 India · Freelance
Freelance Machine Learning Engineer
  • Fine-tuned domain LLMs on federal immigration documents to auto-generate appeal drafts with precedent-based citations.
  • Improved an IELTS predictor R² 0.86 → 0.97 and shipped a 90%-accuracy visa approval prediction system.
03

Selected Projects

all numbers measured
DeepMind Genomic Model Acceleration
TRITON · BIOINFORMATICS · ATTENTION
5.05×

Custom Triton kernels for AlphaGenome (5.05× faster, ~0.99999997 cosine similarity) and Enformer across 196K-base DNA: 1.43× on MI300X, 1.82× on H100.

↗ GitHub·cross-vendor
38× Faster DualBind on AMD
TRITON · PYTORCH · ROCm
38×

Accelerated NVIDIA Bio Group's protein-ligand binding model on ROCm: 41.3s → 1.06s on MI300X, 4.2× on H100, 100% fidelity. Throughput 19.4 → 752.9 samples/s, 97% cost cut.

↗ Benchmarks·vendor asymmetry
BioCUDA-Triton · ESM-2
CUDA · TRITON · PROTEIN LM
43.1×

CUDA+Triton inference engine for Facebook's esm2_t6_8M (1.21M downloads/mo): 43.1× speedup, 97.7% latency reduction, 68.9% memory savings, $1,100+/mo/GPU saved.

↗ GitHub·production
T1Converter · MRI Synthesis
PYTORCH · MEDICAL IMAGING · GAN
0.95SSIM

GANs translating T2, T1CE, and FLAIR MRI from a single T1 scan, replacing four scans with one. Cuts scan time 30–44 min and cost ~70% per patient, SSIM > 0.95.

Liger Kernel LinkedIn · 6k★ PR #887
AMD ROCm Iris · ROCm 7.0
PrunaAI PR #348 · #347

Let's make something fast.

Open to GPU systems research, kernel optimization, and collaborations at the intersection of high-performance computing and biology.