← Research Details
RECOMB-ARCH 2026 Greece · Solo-authored poster / abstract

Hardware-Portable Fused GPU Kernels for High-Throughput Biological Foundation Models

2,986×
k-mer indexing
vs. Jellyfish
10+
models & algorithms
co-designed
60–80%
runtime in memory-bound
operations, targeted
2
vendors, unmodified
NVIDIA + AMD

Hardware-algorithm co-design is becoming critical for computational molecular biology, because modern workloads increasingly stress memory bandwidth, parallel execution, and heterogeneous hardware. Despite rapid progress in protein and genomic foundation models, practical performance is held back by inefficient GPU execution: profiling shows 60–80% of runtime is dominated by memory-bound operations, fragmented kernel launches, and repeated device memory transfers, and hardware utilization only gets worse as sequence lengths and batch sizes grow.

This work delivers hardware-aware GPU optimizations in OpenAI Triton where each model is independently optimized with custom fused kernels tailored to its architecture. Multiple PyTorch operations are fused into single kernels to improve memory reuse and cut launch overhead, and the same kernels run unmodified on both NVIDIA and AMD GPUs while preserving numerical correctness within machine precision. These are designed as per-model drop-in replacements, not a single monolithic framework.

Pipeline: biological foundation models (ESM-2, ProtBERT, AlphaGenome, Enformer, ProteinMPNN, DualBind) and classical bioinformatics algorithms (Smith-Waterman, Needleman-Wunsch, k-mer indexing, Burrows-Wheeler Transform) pass through profiling, hardware-aware Triton kernel design, and hardware-portable execution on NVIDIA and AMD GPUs, with per-model performance gains shown at the bottom.
FIG 1 The co-design pipeline. Foundation models and classical algorithms are profiled to expose memory-bound bottlenecks, optimized through hardware-aware Triton kernel design (operator fusion, memory reuse and tiling, reduced global memory access, launch-overhead elimination, architecture-specific tuning), and executed with the same unmodified kernels across NVIDIA and AMD GPUs. Bottom: per-model and per-algorithm performance gains.

Foundation model results

Protein and genomic foundation models from Meta, Google DeepMind, RostLab, Baker Lab, and NVIDIA are the key beneficiaries, each accelerated with no loss in fidelity.

43.1×

Meta ESM-2

43.1× throughput improvement, 97.7% latency reduction, and up to 68.9% memory savings over Hugging Face baselines.

5.05×

DeepMind genomic models

AlphaGenome accelerated 5.05×; the Enformer PyTorch port up to 1.82× on 196K-base inputs, with cosine similarity preserved above 0.99999.

3.6×

RostLab ProtBERT

3.6× average speedup with identical predictions and a 72.2% reduction in GPU-hours.

38.8×

NVIDIA DualBind & Baker Lab ProteinMPNN

38.8× end-to-end runtime reduction for DualBind on AMD MI300X, plus a 1.51× speedup for ProteinMPNN, both with no loss in fidelity.

Classical bioinformatics results

Beyond foundation models, the same per-kernel approach accelerates classical algorithms against their established baselines, demonstrating that the method generalizes well past neural networks.

AlgorithmSpeedupBaseline
k-mer indexingup to 2,986×vs. Jellyfish
fastp (preprocessing)94× averagevs. fastp
Smith-Watermanup to 49×vs. Parasail (SIMD)
Needleman-Wunschup to 49×vs. Parasail (SIMD)
Burrows-Wheeler Transformup to 49×vs. SeqAn

Portable by construction

Every kernel compiles and runs unmodified across NVIDIA and AMD GPUs (A100, H100, RTX on one side; MI250, MI300X on the other). Because the gains come from eliminating memory round-trips rather than vendor-specific intrinsics, performance portability holds without source-level changes, and numerical correctness is preserved within machine precision on both backends.

The thesis in one line: domain-specific GPU kernel design, applied per model and per algorithm, significantly accelerates both modern foundation models and classical bioinformatics workloads on heterogeneous hardware, up to 2,986× and portably across NVIDIA and AMD.


Part of an ongoing line of work on portable GPU kernel acceleration for biological and medical AI, benchmarked across NVIDIA H100 and AMD MI300X. See more on the research page.