Podcast Episode
The algorithm was first announced at the Hot Chips 2025 conference by TogetherAI chief scientist Tri Dao, who originally proposed the FlashAttention concept in 2022. The project has since accumulated over 19,000 stars on GitHub.
FlashAttention-4 tackles this head-on by reducing memory scaling from quadratic to linear. NVIDIA claims the optimisation uses 20 times less memory compared to standard implementations, enabling models to work with much longer documents without running out of resources.
FlashAttention-4 exploits dedicated Tensor Memory built into each Blackwell processor, storing intermediate calculations directly on the chip rather than shuttling data back and forth to slower memory. Two key innovations drive the performance gains: a new softmax algorithm that skips 90 percent of unnecessary calculations, and software-based approximations that boost throughput.
For comparison, the previous FlashAttention-3 achieved roughly 740 teraflops on H100 chips. FlashAttention-4 now exceeds 1,600 teraflops on Blackwell hardware, representing a generational leap in capability.
NVIDIA's FlashAttention-4 Doubles AI Performance on Blackwell GPUs
January 26, 2026
Audio archived. Episodes older than 60 days are removed to save server storage. Story details remain below.
NVIDIA has released FlashAttention-4, a major optimisation for AI systems that squeezes over 1,600 teraflops from its Blackwell B200 chips. The upgrade delivers a 3.6x speedup over previous versions and slashes memory usage by 20 times, making it cheaper and faster to run large language models with long context windows.
The Petaflop Barrier Falls
NVIDIA has released FlashAttention-4, marking the first time an attention kernel has broken the petaflop barrier in GPU computing. The optimisation delivers 1,605 teraflops on the company's Blackwell B200 graphics processors, capturing 71 percent of the hardware's theoretical maximum performance.The algorithm was first announced at the Hot Chips 2025 conference by TogetherAI chief scientist Tri Dao, who originally proposed the FlashAttention concept in 2022. The project has since accumulated over 19,000 stars on GitHub.
Solving the Memory Problem
Traditional attention mechanisms in AI models face a brutal scaling challenge. When you double the length of text a model can process, memory requirements quadruple. This mathematical reality has constrained how much context AI systems can handle.FlashAttention-4 tackles this head-on by reducing memory scaling from quadratic to linear. NVIDIA claims the optimisation uses 20 times less memory compared to standard implementations, enabling models to work with much longer documents without running out of resources.
Blackwell-Specific Engineering
The algorithm was built specifically for NVIDIA's latest Blackwell architecture, which presents an unusual challenge: computing power has roughly doubled while memory bandwidth has not kept pace.FlashAttention-4 exploits dedicated Tensor Memory built into each Blackwell processor, storing intermediate calculations directly on the chip rather than shuttling data back and forth to slower memory. Two key innovations drive the performance gains: a new softmax algorithm that skips 90 percent of unnecessary calculations, and software-based approximations that boost throughput.
Already in Production
Major AI inference platforms including SGLang and vLLM have already integrated FlashAttention-4 support. NVIDIA has incorporated the techniques into its cuDNN 9.14 software library, making the improvements accessible to developers without requiring custom programming.For comparison, the previous FlashAttention-3 achieved roughly 740 teraflops on H100 chips. FlashAttention-4 now exceeds 1,600 teraflops on Blackwell hardware, representing a generational leap in capability.
Published January 26, 2026 at 5:32am