Portable GPU Kernel Acceleration for Biological Foundation Models

720×

peak speedup
Needleman-Wunsch, H100

models & algorithms
one framework

67–88%

less memory
traffic via fusion

accuracy loss
across workloads

Modern biological foundation models span DNA sequence alignment, protein language models, genomics, molecular vision, and scientific machine learning. In practice their performance is limited not by model design but by inefficient GPU execution: fragmented kernel launches, repeated memory transfers, and memory-bound operations dominate runtime, and the problem only grows with long sequences and large batches. The result is that end-to-end biological pipelines run far below what modern GPUs can deliver.

This work presents a single, portable Triton acceleration framework spanning all of those domains at once. Profiling shows 60–80% of runtime is spent in memory-bound and index-heavy operations, which it addresses by fusing 3–8 operations into single Triton kernels. That cuts memory traffic by 67–88%, lowers launch overhead, and improves GPU utilization, producing large and consistent speedups while preserving 100% accuracy (or near-identical numerical outputs) across every workload. The same kernel source runs on both NVIDIA and AMD without modification.

Results across six workloads

Rather than optimizing a single model, the framework demonstrates that one portable approach transfers across very different computational patterns, from dynamic-programming sequence alignment to attention-based genomic models.

38× / 4.2×

NVIDIA DualBind

Protein-ligand binding model (NVIDIA Bio Group, ICML 2025 Workshop). 38× on AMD MI300X and 4.2× on NVIDIA H100, with 100% identical outputs to the original PyTorch model.

5.05×

DeepMind genomic models

AlphaGenome runs 5.05× faster than the original. Enformer, the 196K-base DNA gene-expression predictor, gains 1.82× on H100 and 1.43× on MI300X.

43.1×

Meta ESM-2 protein LM

43.1× faster inference and 97.7% latency reduction on H100, saving over $1,100 per GPU each month in AWS L4 rental costs.

4.6×

RostLab ProtBERT

BERT-based protein language model: up to 4.6× (3.6× average) faster inference, cutting GPU hours by 72.2% with the Triton-optimized model.

720× / 490×

Needleman-Wunsch

Dynamic-programming global sequence alignment: up to 720× over PyTorch on H100 and 490× on MI300X, reaching 1.7 GOPS peak throughput.

82.7×

Micro-benchmark peak

Average speedup achieved on micro-benchmarks through systematic kernel fusion, with peak fused-kernel gains reaching 82.7× on the underlying operations.

One framework, two vendors

The unifying method is portable kernel fusion: the same Triton source compiles natively on NVIDIA (CUDA) and AMD (ROCm) backends with no source-level changes. Because the speedups come from eliminating memory round-trips rather than vendor-specific tricks, they hold across both ecosystems and across workloads as different as binding-affinity prediction, genomic attention, and dynamic-programming alignment.

The poster

Presented at ISC High Performance 2026, Hamburg. The full poster collects the per-model results, the unified framework diagram, and the original-vs-optimized comparisons in one view.

ISC High Performance 2026 poster: Portable GPU Kernel Acceleration for Biological Foundation Models using OpenAI Triton, showing speedups for DualBind, DeepMind genomic models, ESM-2, ProtBERT, and Needleman-Wunsch across NVIDIA and AMD GPUs. — **POSTER** The full ISC 2026 poster. Motivation and the unified Triton framework at top; per-model original-vs-optimized comparisons below for DualBind, DeepMind genomic models, ESM-2, ProtBERT, and Needleman-Wunsch. Click to open full size.

The thesis in one line: biological foundation models are bottlenecked by memory traffic, not model design, and one portable Triton fusion framework accelerates DNA, protein, genomic, and molecular workloads up to 720× across NVIDIA and AMD with zero accuracy loss.

Part of an ongoing line of work on portable GPU kernel acceleration for biological and medical AI, benchmarked across NVIDIA H100 and AMD MI300X. See more on the research page.

Portable GPU Kernel Acceleration for Biological Foundation Models & Algorithms using OpenAI Triton

Results across six workloads

NVIDIA DualBind

DeepMind genomic models

Meta ESM-2 protein LM

RostLab ProtBERT

Needleman-Wunsch

Micro-benchmark peak

One framework, two vendors

The poster