Assistant Professor & Chief Scientist |Princeton University / Together AI
Creator of FlashAttention and co-creator of Mamba. Stanford PhD in hardware-aware ML algorithms. His IO-aware attention kernel is now used in virtually every Transformer and enabled context lengths to scale from 4K to 1M tokens.
Biography
Tri Dao is an Assistant Professor of Computer Science at Princeton University and Co-founder & Chief Scientist of Together AI. He earned his PhD in Computer Science from Stanford University, co-advised by Christopher Re and Stefano Ermon, with a dissertation on hardware-aware algorithms for efficient machine learning. His research sits at the intersection of machine learning and systems, with a focus on hardware-aware algorithms and sequence models with long-range memory. He is the creator of FlashAttention, the IO-aware exact attention algorithm now used in virtually every Transformer-based model, and co-creator (with Albert Gu) of Mamba, the selective state space model that achieves linear-time sequence modeling with 5x higher throughput than Transformers. He joined Together AI as Chief Scientist in July 2023 and Princeton as Assistant Professor in September 2024. His work has received the ICML 2022 Outstanding Paper runner-up (Monarch), the COLM 2024 Outstanding Paper (Mamba), the MLSys 2025 Outstanding Paper Honorable Mention (Marconi), and the Schmidt Sciences AI2050 Early Career Fellowship.
IO-aware exact attention algorithm that reduces memory from quadratic to linear in sequence length and runs 2-4x faster than standard attention. Now used in virtually every Transformer-based model and enabled context lengths to grow from 2-4K to 128K-1M tokens.
Linear-time sequence model co-created with Albert Gu that achieves 5x higher throughput than Transformers by letting SSM parameters be functions of the input, with a hardware-aware parallel scan algorithm. 17.5K+ GitHub stars.
Theoretical framework proving SSMs and attention are dual views of the same model, enabling tensor-core-optimized SSM computation that is 2-8x faster than Mamba-1's parallel scan.
Third-generation attention kernel exploiting H100 asynchrony and FP8 low-precision to achieve up to 740 TFLOPS, 1.5-2x faster than FlashAttention-2 on Hopper GPUs.
Class of hardware-efficient structured matrices (products of block-diagonal matrices) enabling 2x training speedup for ViT and GPT-2 with comparable quality. ICML 2022 Outstanding Paper runner-up.
GitHub organization hosting flash-attention (22.8K stars), causal-conv1d, fast-hadamard-transform, and other high-performance CUDA kernels widely adopted across the ML community.
Try to understand both the algorithm and the systems that these algorithms run on.
I'm excited to announce that I'm joining Together AI as Chief Scientist, with the goal of making open source AI more accessible and cost-competitive.
The memory is linear in sequence length. In terms of computation, it's still quadratic, but we managed to make it much more hardware friendly.
I think my prior is that as long as your model architecture is reasonable and is hardware efficient, and you have lots of compute, and you have lots of data, the model would just do well.
Research generated March 19, 2026