Tri Dao

Tri Dao is an Assistant Professor of Computer Science at Princeton University and Co-founder & Chief Scientist of Together AI. He earned his PhD in Computer Science from Stanford University, co-advised by Christopher Re and Stefano Ermon, with a dissertation on hardware-aware algorithms for efficient machine learning. His research sits at the intersection of machine learning and systems, with a focus on hardware-aware algorithms and sequence models with long-range memory. He is the creator of FlashAttention, the IO-aware exact attention algorithm now used in virtually every Transformer-based model, and co-creator (with Albert Gu) of Mamba, the selective state space model that achieves linear-time sequence modeling with 5x higher throughput than Transformers. He joined Together AI as Chief Scientist in July 2023 and Princeton as Assistant Professor in September 2024. His work has received the ICML 2022 Outstanding Paper runner-up (Monarch), the COLM 2024 Outstanding Paper (Mamba), the MLSys 2025 Outstanding Paper Honorable Mention (Marconi), and the Schmidt Sciences AI2050 Early Career Fellowship.

Hardware-Aware AlgorithmsFlashAttentionState Space ModelsEfficient Training & InferenceIO-Aware ComputingSequence ModelingStructured MatricesGPU Kernel OptimizationLong-Context LLMsOpen Source AI

Timeline

12 Research12 total

2025

2025-01Research

Received Schmidt Sciences AI2050 Early Career Fellowship, Google ML & Systems Junior Faculty Award, and Google Research Scholar award

2025-01Research

Published Marconi: Prefix Caching for Hybrid LLMs (MLSys 2025 Outstanding Paper Honorable Mention)

2025-03Research

Delivered TED Talk: How AI is Transforming Work and Everyday Life (TEDAI San Francisco)

2024

2024-02Research

Appeared on Generally Intelligent podcast: FlashAttention, sparsity, quantization & efficient inference

2024-05Research

Published Mamba-2: Structured State Space Duality — faster than Mamba-1 with tensor core support (ICML 2024)

2024-07Research

Published FlashAttention-3: Asynchrony and low-precision on H100, achieving 740 TFLOPS (NeurIPS 2024)

2024-09Research

Joined Princeton University as Assistant Professor of Computer Science

2023

2023-07Research

Joined Together AI as Chief Scientist; released FlashAttention-2 (2x faster, ICLR 2024)

2023-07Research

Appeared on Latent Space podcast: FlashAttention 2 — making Transformers 800% faster

2023-12Research

Published Mamba: Linear-Time Sequence Modeling with Selective State Spaces (with Albert Gu, COLM 2024 Outstanding Paper)

2022

2022-05Research

Published FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (NeurIPS 2022)

2022-07Research

Monarch paper selected as ICML 2022 Outstanding Paper runner-up

Key Contributions

FlashAttention

IO-aware exact attention algorithm that reduces memory from quadratic to linear in sequence length and runs 2-4x faster than standard attention. Now used in virtually every Transformer-based model and enabled context lengths to grow from 2-4K to 128K-1M tokens.

Mamba (Selective State Space Model)

Linear-time sequence model co-created with Albert Gu that achieves 5x higher throughput than Transformers by letting SSM parameters be functions of the input, with a hardware-aware parallel scan algorithm. 17.5K+ GitHub stars.

Mamba-2 (Structured State Space Duality)

Theoretical framework proving SSMs and attention are dual views of the same model, enabling tensor-core-optimized SSM computation that is 2-8x faster than Mamba-1's parallel scan.

FlashAttention-3

Third-generation attention kernel exploiting H100 asynchrony and FP8 low-precision to achieve up to 740 TFLOPS, 1.5-2x faster than FlashAttention-2 on Hopper GPUs.

Monarch Matrices

Class of hardware-efficient structured matrices (products of block-diagonal matrices) enabling 2x training speedup for ViT and GPT-2 with comparable quality. ICML 2022 Outstanding Paper runner-up.

Dao-AILab Open Source Ecosystem

GitHub organization hosting flash-attention (22.8K stars), causal-conv1d, fast-hadamard-transform, and other high-performance CUDA kernels widely adopted across the ML community.

Notable Quotes

“

Try to understand both the algorithm and the systems that these algorithms run on.

Latent Space podcast, Jul 2023·Source

“

I'm excited to announce that I'm joining Together AI as Chief Scientist, with the goal of making open source AI more accessible and cost-competitive.

Together AI blog, Jul 2023·Source

“

The memory is linear in sequence length. In terms of computation, it's still quadratic, but we managed to make it much more hardware friendly.

Latent Space podcast, Jul 2023·Source

“

I think my prior is that as long as your model architecture is reasonable and is hardware efficient, and you have lots of compute, and you have lots of data, the model would just do well.

Generally Intelligent podcast, Feb 2024·Source

“

Certainly, I didn't anticipate the level of popularity.

On FlashAttention adoption·Source

14 sources(click to expand)

Tri Dao — personal website Tri Dao — Princeton CS profile AI2050 Fellow: Tri Dao — Schmidt Sciences Together AI: Introducing Chief Scientist Tri Dao FlashAttention: Fast and Memory-Efficient Exact Attention (arXiv)FlashAttention-2: Faster Attention with Better Parallelism (arXiv)FlashAttention-3: Fast and Accurate Attention (arXiv)Mamba: Linear-Time Sequence Modeling (arXiv)Mamba-2 Blog Series — Structured State Space Duality Monarch: Expressive Structured Matrices (ICML 2022)Latent Space: FlashAttention 2 — 800% faster Transformers Generally Intelligent: Tri Dao on FlashAttention & efficient inference TED Talk: How AI is Transforming Work and Everyday Life Dao-AILab — GitHub organization

Research generated March 19, 2026

Researchers & Thinkers/Tri Dao

All Profiles