106 Stories246 Podcasts

AI Infrastructure & Inference

Woosuk Kwon

Co-founder & CTO |Inferact

Co-created vLLM, the most adopted open-source LLM inference engine (2,000+ contributors). Invented PagedAttention for efficient KV-cache management, reducing GPU memory waste from 60-80% to under 4%. Ph.D. from UC Berkeley under Ion Stoica. Inferact launched Jan 2026 with $150M seed at $800M valuation (a16z, Lightspeed, Sequoia).

2episodes

3papers

195stars

1.3kfollowers

GitHubWoosukKwon

GitHub · 4 repos · 195 stars

retraining-free-pruningPython192 scatterPython1 scriptsShell1 expPython1

Episodes

Podcast Appearances

Title

Date

Dur.

Inferact: Building the Infrastructure That Runs Modern AI

2026-01-2243.6 min

Inferact: Building the Infrastructure That Runs Modern AI

2026-01-2243.6 min

Research Papers

3 papers

arXiv

Efficient Memory Management for Large Language Model Serving with PagedAttention

2309.06180·2023-09-12

Link

SkyPilot: An Intercloud Broker for Sky Computing

example.com·2023-04

Link

Jenga: Effective Memory Management for Serving LLM with Heterogeneity

example.com·2025

Biography

Woosuk Kwon is the co-founder and CTO of Inferact, an AI inference infrastructure startup that launched in January 2026 with $150M in seed funding at an $800M valuation led by Andreessen Horowitz and Lightspeed. He co-created vLLM, the most widely adopted open-source LLM inference engine with over 2,000 contributors, and invented PagedAttention, an algorithm inspired by OS virtual memory that eliminates 60-80% of GPU memory waste in KV caches. Kwon earned dual B.S. degrees in Computer Science and Mathematical Sciences from Seoul National University (ranked 1st of 134 students) and a Ph.D. in Computer Science from UC Berkeley under Ion Stoica with a 4.0 GPA, completing his dissertation in December 2025. Before founding Inferact, he was a Research Scientist at Google DeepMind (2024-2025) and a Member of Technical Staff at Thinking Machines Lab (May-Nov 2025).

LLM Inference OptimizationKV Cache Memory ManagementPagedAttentionGPU SchedulingSpeculative DecodingModel PruningDistributed SystemsOpen Source AI InfrastructureServerless AI ServingSystems for Machine Learning

github twitter linkedin website scholar

Timeline

1 Paper1 total

2023

2023-09-12Paper

Published "Efficient Memory Management for Large Language Model Serving with PagedAttention"

Key Contributions

vLLM

The most widely adopted open-source LLM inference and serving engine, achieving 24x throughput over HuggingFace Transformers and 3.5x over TGI. Over 2,000 contributors and used by major AI labs globally.

PagedAttention

Novel attention algorithm inspired by OS virtual memory paging that partitions KV caches into non-contiguous blocks, reducing memory waste from 60-80% to under 4% and enabling copy-on-write memory sharing across requests.

Inferact

AI inference infrastructure company commercializing vLLM with a serverless managed offering, observability, and multi-hardware support. Raised $150M seed at $800M valuation from a16z, Lightspeed, Sequoia, and Databricks.

Jenga (SOSP 2025)

Memory management system for serving LLMs with heterogeneous hardware, extending PagedAttention principles to mixed GPU environments.

SkyPilot (NSDI 2023)

Intercloud broker for sky computing that optimizes workload placement across multiple cloud providers for cost and performance.

Learned Token Pruning (KDD 2022)

Method for pruning tokens in Transformer models to accelerate inference while maintaining accuracy.

Nimble (NeurIPS 2020)

Lightweight and parallel GPU task scheduling framework for deep learning workloads.

Notable Quotes

“

We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building.

Inferact launch blog post, Jan 2026·Source

“

vLLM taught me a tough lesson: to keep the GPU fully utilized, we need to pay close attention to everything happening on the CPU.

X (@woosuk_k)·Source

“

vLLM improves the throughput of popular LLMs by 2-4x with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca.

PagedAttention paper (SOSP 2023)·Source

12 sources(click to expand)

Woosuk Kwon - Personal Website WoosukKwon - GitHub Profile Efficient Memory Management for LLM Serving with PagedAttention (arXiv)vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (blog)vLLM GitHub Repository Inferact launches with $150M funding to commercialize vLLM (SiliconANGLE)a16z announces $150M seed for Inferact (X)Woosuk Kwon | Sequoia Capital vLLM: An Efficient Inference Engine for LLMs - Ph.D. Dissertation (UC Berkeley)Woosuk Kwon - DBLP Bibliography PyTorch Conference 2024 - vLLM Talk UC Berkeley Sky Computing Lab - vLLM Project Page

Research generated March 19, 2026

AI Infrastructure & Inference/Woosuk Kwon

All Profiles