Manpreet Singh

Manpreet Singh

ML Systems Research Intern @ Embedded LLM — Singapore

singhman2005123@gmail.com

01

Bio

I'm Manpreet Singh, a B.E. Computer Engineering student at Thapar Institute (2023–2027, CGPA 9.22) and an ML Systems Research Intern at Embedded LLM, Singapore. I write portable GPU kernels that fuse and accelerate biological and medical foundation models across NVIDIA and AMD hardware, turning order-of-magnitude speedups into clinically and scientifically usable systems.

My goal is simple: make biological foundation models run orders of magnitude faster. This has produced 3 papers, 3 posters, and an accepted talk across ICML, ACM ICS, EurIPS, MLSys, RECOMB, and ISC High Performance, alongside upstream contributions to LinkedIn's Liger Kernel, AMD ROCm, and PrunaAI.

02

News

most recent first
03

Research & Publications

all research →
04

Experience

Embedded LLM Oct 2025 — Present Singapore · Remote
ML Systems Research Intern current
  • Led solo research producing 3 papers, 3 posters, and an accepted talk across ICML 2026, ACM ICS 2026, EurIPS 2025, MLSys YPS, RECOMB-ARCH, and ISC High Performance 2026.
  • Building BioTriton, an open-source Triton acceleration library for biology and chemistry workflows across heterogeneous GPU backends.
  • Accelerated ProteinMPNN inference with custom Triton kernels: 1.5× average (1.61× peak), 100% accuracy preserved.
  • Compressed ESM-2 embeddings via TurboQuant: 4× smaller (439MB → 110MB), +46% Recall@10 over FAISS PQ, 10M proteins in 11GB RAM.
  • Upstream contributions to LinkedIn Liger Kernel, AMD ROCm, and PrunaAI.
CloudCosmos Jul — Sep 2025 North Carolina, USA
Software Engineering Intern
  • Accelerated a financial reconciliation pipeline 33.0s → 3.6s (9.1×) through compute and memory optimization.
  • Built text segmentation and information extraction for Sanskrit literature and architectural design documents.
Stealth Startup Mar — Apr 2025 India · Freelance
Freelance Machine Learning Engineer
  • Fine-tuned domain LLMs on federal immigration documents to auto-generate appeal drafts with precedent-based citations.
  • Improved an IELTS predictor R² 0.86 → 0.97 and shipped a 90%-accuracy visa approval prediction system.
05

Selected Projects

all numbers measured
DeepMind Genomic Model Acceleration
TRITON · BIOINFORMATICS · ATTENTION
5.05×

Custom Triton kernels for AlphaGenome (5.05× faster, ~0.99999997 cosine similarity) and Enformer across 196K-base DNA: 1.43× on MI300X, 1.82× on H100.

↗ GitHub·cross-vendor
38× Faster DualBind on AMD
TRITON · PYTORCH · ROCm
38×

Accelerated NVIDIA Bio Group's protein-ligand binding model on ROCm: 41.3s → 1.06s on MI300X, 4.2× on H100, 100% fidelity. Throughput 19.4 → 752.9 samples/s, 97% cost cut.

↗ Benchmarks·vendor asymmetry
BioCUDA-Triton · ESM-2
CUDA · TRITON · PROTEIN LM
43.1×

CUDA+Triton inference engine for Facebook's esm2_t6_8M (1.21M downloads/mo): 43.1× speedup, 97.7% latency reduction, 68.9% memory savings, $1,100+/mo/GPU saved.

↗ GitHub·production
T1Converter · MRI Synthesis
PYTORCH · MEDICAL IMAGING · GAN
0.95SSIM

GANs translating T2, T1CE, and FLAIR MRI from a single T1 scan, replacing four scans with one. Cuts scan time 30–44 min and cost ~70% per patient, SSIM > 0.95.

Liger Kernel LinkedIn · 6k★ PR #887
AMD ROCm Iris · ROCm 7.0
PrunaAI PR #348 · #347