ML Systems Research Intern @ Embedded LLM — Singapore
singhman2005123@gmail.com
I'm Manpreet Singh, a B.E. Computer Engineering student at Thapar Institute (2023–2027, CGPA 9.22) and an ML Systems Research Intern at Embedded LLM, Singapore. I write portable GPU kernels that fuse and accelerate biological and medical foundation models across NVIDIA and AMD hardware, turning order-of-magnitude speedups into clinically and scientifically usable systems.
My goal is simple: make biological foundation models run orders of magnitude faster. This has produced 3 papers, 3 posters, and an accepted talk across ICML, ACM ICS, EurIPS, MLSys, RECOMB, and ISC High Performance, alongside upstream contributions to LinkedIn's Liger Kernel, AMD ROCm, and PrunaAI.
Custom Triton kernels for AlphaGenome (5.05× faster, ~0.99999997 cosine similarity) and Enformer across 196K-base DNA: 1.43× on MI300X, 1.82× on H100.
Accelerated NVIDIA Bio Group's protein-ligand binding model on ROCm: 41.3s → 1.06s on MI300X, 4.2× on H100, 100% fidelity. Throughput 19.4 → 752.9 samples/s, 97% cost cut.
CUDA+Triton inference engine for Facebook's esm2_t6_8M (1.21M downloads/mo): 43.1× speedup, 97.7% latency reduction, 68.9% memory savings, $1,100+/mo/GPU saved.
GANs translating T2, T1CE, and FLAIR MRI from a single T1 scan, replacing four scans with one. Cuts scan time 30–44 min and cost ~70% per patient, SSIM > 0.95.