Needleman-Wunsch, H100
one framework
traffic via fusion
across workloads
Modern biological foundation models span DNA sequence alignment, protein language models, genomics, molecular vision, and scientific machine learning. In practice their performance is limited not by model design but by inefficient GPU execution: fragmented kernel launches, repeated memory transfers, and memory-bound operations dominate runtime, and the problem only grows with long sequences and large batches. The result is that end-to-end biological pipelines run far below what modern GPUs can deliver.
This work presents a single, portable Triton acceleration framework spanning all of those domains at once. Profiling shows 60–80% of runtime is spent in memory-bound and index-heavy operations, which it addresses by fusing 3–8 operations into single Triton kernels. That cuts memory traffic by 67–88%, lowers launch overhead, and improves GPU utilization, producing large and consistent speedups while preserving 100% accuracy (or near-identical numerical outputs) across every workload. The same kernel source runs on both NVIDIA and AMD without modification.
Results across six workloads
Rather than optimizing a single model, the framework demonstrates that one portable approach transfers across very different computational patterns, from dynamic-programming sequence alignment to attention-based genomic models.
NVIDIA DualBind
Protein-ligand binding model (NVIDIA Bio Group, ICML 2025 Workshop). 38× on AMD MI300X and 4.2× on NVIDIA H100, with 100% identical outputs to the original PyTorch model.
DeepMind genomic models
AlphaGenome runs 5.05× faster than the original. Enformer, the 196K-base DNA gene-expression predictor, gains 1.82× on H100 and 1.43× on MI300X.
Meta ESM-2 protein LM
43.1× faster inference and 97.7% latency reduction on H100, saving over $1,100 per GPU each month in AWS L4 rental costs.
RostLab ProtBERT
BERT-based protein language model: up to 4.6× (3.6× average) faster inference, cutting GPU hours by 72.2% with the Triton-optimized model.
Needleman-Wunsch
Dynamic-programming global sequence alignment: up to 720× over PyTorch on H100 and 490× on MI300X, reaching 1.7 GOPS peak throughput.
Micro-benchmark peak
Average speedup achieved on micro-benchmarks through systematic kernel fusion, with peak fused-kernel gains reaching 82.7× on the underlying operations.
One framework, two vendors
The unifying method is portable kernel fusion: the same Triton source compiles natively on NVIDIA (CUDA) and AMD (ROCm) backends with no source-level changes. Because the speedups come from eliminating memory round-trips rather than vendor-specific tricks, they hold across both ecosystems and across workloads as different as binding-affinity prediction, genomic attention, and dynamic-programming alignment.
The poster
Presented at ISC High Performance 2026, Hamburg. The full poster collects the per-model results, the unified framework diagram, and the original-vs-optimized comparisons in one view.
The thesis in one line: biological foundation models are bottlenecked by memory traffic, not model design, and one portable Triton fusion framework accelerates DNA, protein, genomic, and molecular workloads up to 720× across NVIDIA and AMD with zero accuracy loss.
Part of an ongoing line of work on portable GPU kernel acceleration for biological and medical AI, benchmarked across NVIDIA H100 and AMD MI300X. See more on the research page.