Co-founder & CTO |Inferact
Co-created vLLM, the most adopted open-source LLM inference engine (2,000+ contributors). Invented PagedAttention for efficient KV-cache management, reducing GPU memory waste from 60-80% to under 4%. Ph.D. from UC Berkeley under Ion Stoica. Inferact launched Jan 2026 with $150M seed at $800M valuation (a16z, Lightspeed, Sequoia).
GitHub
4 repositories · 195 total stars
Languages
Episodes
Biography
Woosuk Kwon is the co-founder and CTO of Inferact, an AI inference infrastructure startup that launched in January 2026 with $150M in seed funding at an $800M valuation led by Andreessen Horowitz and Lightspeed. He co-created vLLM, the most widely adopted open-source LLM inference engine with over 2,000 contributors, and invented PagedAttention, an algorithm inspired by OS virtual memory that eliminates 60-80% of GPU memory waste in KV caches. Kwon earned dual B.S. degrees in Computer Science and Mathematical Sciences from Seoul National University (ranked 1st of 134 students) and a Ph.D. in Computer Science from UC Berkeley under Ion Stoica with a 4.0 GPA, completing his dissertation in December 2025. Before founding Inferact, he was a Research Scientist at Google DeepMind (2024-2025) and a Member of Technical Staff at Thinking Machines Lab (May-Nov 2025).
The most widely adopted open-source LLM inference and serving engine, achieving 24x throughput over HuggingFace Transformers and 3.5x over TGI. Over 2,000 contributors and used by major AI labs globally.
Novel attention algorithm inspired by OS virtual memory paging that partitions KV caches into non-contiguous blocks, reducing memory waste from 60-80% to under 4% and enabling copy-on-write memory sharing across requests.
AI inference infrastructure company commercializing vLLM with a serverless managed offering, observability, and multi-hardware support. Raised $150M seed at $800M valuation from a16z, Lightspeed, Sequoia, and Databricks.
Memory management system for serving LLMs with heterogeneous hardware, extending PagedAttention principles to mixed GPU environments.
Intercloud broker for sky computing that optimizes workload placement across multiple cloud providers for cost and performance.
Method for pruning tokens in Transformer models to accelerate inference while maintaining accuracy.
Lightweight and parallel GPU task scheduling framework for deep learning workloads.
We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building.
vLLM taught me a tough lesson: to keep the GPU fully utilized, we need to pay close attention to everything happening on the CPU.
vLLM improves the throughput of popular LLMs by 2-4x with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca.
Research generated March 19, 2026