BioTriton: Portable Cross-Vendor GPU Kernels for Bioinformatics

20+

Triton kernels
written once

10–19K×

speedups over
CPU/GPU baselines

vendors, one source
NVIDIA + AMD

100%

correct: identical or
bitwise-equivalent

Core bioinformatics workloads (sequence alignment, k-mer indexing, quality control) are still bottlenecked on CPU tools like BioPython, samtools, and Jellyfish, and the GPU solutions that do exist are CUDA-only and locked to a single vendor. That leaves AMD users falling back to the CPU. BioTriton is a library of 20+ Triton GPU kernels that delivers 10–19,000× speedups over CPU and GPU baselines while staying portable: one Triton source compiles to both NVIDIA PTX and AMD ROCm with no modification.

The key insight is that bioinformatics maps naturally onto Triton's block-structured programming model. Its data structures (2-bit DNA encoding, 5-bit amino acids, shared-memory-resident scoring matrices) and algorithms (wavefront-parallel dynamic programming) line up with how Triton wants to express work, which enables domain-aware kernel fusion and memory-hierarchy optimization without vendor lock-in.

BioTriton overview: domain-aware Triton kernels for 2-bit DNA, 5-bit amino acids, scoring matrices, and wavefront-parallel dynamic programming compile through MLIR to PTX (NVIDIA) and AMDGPU (AMD) without source changes, run on both vendors, and achieve large speedups across Smith-Waterman, k-mer indexing, Hamming distance, quality filtering, consensus calling, motif scan, and BLOSUM scoring. — **FIG 1** The BioTriton design. (1) Domain-aware Triton kernels exploit bioinformatics data structures and wavefront-parallel algorithms. (2) Kernels compile through MLIR to vendor backends (PTX for NVIDIA, AMDGPU for AMD) with no source changes. (3) Identical kernels run on NVIDIA and AMD GPUs. (4) Representative speedups and peak throughputs on an NVIDIA RTX PRO 6000 Blackwell; the same kernels run correctly on AMD MI300X.

Write once, run on both vendors

BioTriton kernels are written once and compile through MLIR to PTX (NVIDIA) and AMDGPU (AMD) with no source modification. Benchmarked on NVIDIA RTX PRO 6000 Blackwell, H100, L4, and AMD MI300X, all 20+ kernels compile and pass correctness tests on both vendors. That closes a real gap: existing GPU bioinformatics tools like NVBIO and GASAL2 are CUDA-only, forcing AMD users back onto CPU fallbacks.

Benchmark results

Representative speedups on NVIDIA RTX PRO 6000 Blackwell against standard CPU and GPU baselines. Every kernel produces numerically identical outputs (alignment, scoring) or bitwise-equivalent results (quality control, translation), and the identical kernels run on AMD MI300X without recompilation.

Kernel	Baseline	Speedup	Peak throughput
BLOSUM scoring	PyTorch (GPU)	up to 1,688,154×	1.09 × 10¹⁰ ops/s
Motif scan (fuzzy)	PyTorch (GPU)	up to 274,855×	1.43 × 10⁹ ops/s
Consensus calling	samtools	up to 19,567×	1.10 × 10¹⁰ ops/s
k-mer indexing	Jellyfish (C++)	up to 2,986×	1.85 × 10¹⁰ ops/s
Hamming distance	PyTorch (GPU)	up to 2,712×	2.79 × 10¹⁰ ops/s
Smith-Waterman	BioPython / Parasail	up to 1,093× / 49×	7.4 × 10⁹ ops/s
Quality filter	fastp (C++)	avg 94×, max 251×	1.41 × 10¹⁰ ops/s

The speedups against compiled C++ tools (Jellyfish, fastp, samtools) matter most: they show the gains hold against highly optimized native baselines, not just Python reference implementations.

Why it matters

BioTriton shows that Triton is production-ready for portable scientific GPU computing. By targeting MLIR rather than a vendor-specific backend, it gets write-once, run-anywhere portability without sacrificing performance, which is exactly what HPC centers running heterogeneous clusters need: DOE's Frontier and El Capitan use AMD, while most academic clusters run NVIDIA. Future directions include cross-architecture roofline analysis quantifying memory-vs-compute bottlenecks on each vendor, multi-GPU scaling for whole-genome alignment, and BioKernelBench, a reproducible cross-vendor benchmark suite for bioinformatics kernels.

The thesis in one line: bioinformatics data structures and algorithms map cleanly onto Triton's block model, so a single write-once kernel library can hit 10–19,000× speedups and run identically across NVIDIA and AMD, with no CUDA lock-in.

Part of an ongoing line of work on portable GPU kernel acceleration for biological and medical AI, benchmarked across NVIDIA H100 and AMD MI300X. See more on the research page.

BioTriton: Portable Cross-Vendor GPU Kernels for High-Throughput Bioinformatics via OpenAI Triton

Write once, run on both vendors

Benchmark results

Why it matters