Matei Zaharia

Research Papers

4 papers

DOI

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

10.5555/2228298.2228301·2012-04-25

Link

Accelerating the Machine Learning Lifecycle with MLflow

example.com·2018-12-01

arXiv

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

2112.01488·2022-05-01

arXiv

Compiling Declarative Language Model Calls into Self-Improving Pipelines (DSPy)

2310.03714·2024-01-16

Biography

Matei Zaharia is a Romanian-Canadian computer scientist, co-founder and CTO of Databricks, and Associate Professor of EECS at UC Berkeley. Born in Bucharest, Romania, he earned his BMath (double major in Computer Science and Combinatorics & Optimization) from the University of Waterloo in 2007, where his ICPC team placed 4th globally and 1st in North America. He completed his PhD at UC Berkeley's AMPLab in 2013, winning the 2014 ACM Doctoral Dissertation Award. During his PhD he created Apache Spark (2009), now the most widely used engine for large-scale data processing. He co-founded Databricks in 2013 alongside Ali Ghodsi, Ion Stoica, and four other Berkeley researchers, growing it to a $134 billion valuation by late 2025. He has also driven the creation of MLflow, Delta Lake, ColBERT, DSPy, DBRX, and Dolly. His 2024 blog post 'The Shift from Models to Compound AI Systems' has become a defining reference for the industry's move beyond monolithic LLMs. He was previously an assistant professor at MIT (2015) and Stanford (2016-2023) before returning to UC Berkeley in 2023.

Apache SparkDistributed Data ProcessingCompound AI SystemsMLflow & ML LifecycleDelta Lake & Lakehouse ArchitectureLarge Language Models (DBRX, Dolly)DSPy & LLM Programming FrameworksColBERT & Neural Information RetrievalOpen Source Data InfrastructureCloud Computing & Big DataAI for EnterpriseData Quality & Governance

Timeline

18 Research18 total

2025

2025-12Research

Databricks raises $4B+ at $134B valuation (Series L), surpassing $4.8B revenue run-rate with 55% YoY growth

2024

2024-01Research

DSPy accepted at ICLR 2024: 'Compiling Declarative Language Model Calls into Self-Improving Pipelines'

2024-02Research

Published 'The Shift from Models to Compound AI Systems' on the BAIR blog, defining a major AI architecture trend

2024-03Research

Released DBRX, Databricks' open-source LLM with 132B parameters (mixture-of-experts), exceeding GPT-3.5 on benchmarks

2024-06Research

Announced Mosaic AI platform expansion and compound AI systems tooling at Data+AI Summit 2024

2023

2023-01Research

Joined UC Berkeley as Associate Professor of EECS, moving from Stanford; received the Mark Weiser Award

2023-07Research

Databricks completed acquisition of MosaicML for $1.3B, strengthening generative AI capabilities

2022

2022-01Research

Received SIGMOD Systems Award for Apache Spark and Sloan Research Fellowship; promoted to Associate Professor at Stanford

2019

2019-01Research

Received US Presidential Early Career Award for Scientists and Engineers (PECASE)

2018

2018-01Research

Launched MLflow, open-source platform for the machine learning lifecycle; published foundational paper in IEEE Data Engineering Bulletin

2017

2017-01Research

Received NSF CAREER Award for research on large-scale data systems

2016

2016-01Research

Joined Stanford University as Assistant Professor of Computer Science

2015

2015-01Research

Joined MIT as faculty member before moving to Stanford the following year

2014

2014-01Research

Received the 2014 ACM Doctoral Dissertation Award for PhD thesis on large-scale data processing

2013

2013-01Research

Co-founded Databricks with Ali Ghodsi, Ion Stoica, and four other UC Berkeley researchers; completed PhD

2012

2012-04Research

Published 'Resilient Distributed Datasets' paper at NSDI 2012, winning Best Paper Award

2009

2009-01Research

Created Apache Spark at UC Berkeley's AMPLab as a faster alternative to MapReduce

2007

2007-01Research

Graduated from University of Waterloo with BMath (CS + Combinatorics & Optimization); ICPC gold medalist (4th worldwide, 1st North America in 2005)

Key Contributions

Apache Spark

Created the Apache Spark distributed computing framework during his PhD at UC Berkeley's AMPLab in 2009. Spark introduced Resilient Distributed Datasets (RDDs) for fault-tolerant in-memory cluster computing, becoming the most widely used engine for large-scale data processing, with over 2,000 contributors and adoption across virtually every Fortune 500 company.

Databricks

Co-founded Databricks in 2013 and serves as CTO, building the Lakehouse platform that unifies data warehousing and data lakes. The company reached a $134B valuation by late 2025 with $4.8B+ annual revenue run-rate, becoming the leading enterprise data and AI platform.

MLflow

Created MLflow in 2018 as an open-source platform for managing the complete machine learning lifecycle -- experiment tracking, reproducibility, and model deployment. MLflow is now the most popular ML lifecycle management tool, with tens of thousands of users.

Delta Lake

Co-created Delta Lake, an open-source ACID transaction layer for cloud data lakes, enabling reliable data engineering at scale. Published at VLDB 2020, Delta Lake is a cornerstone of the Lakehouse architecture.

Compound AI Systems

Authored the influential 2024 BAIR blog post 'The Shift from Models to Compound AI Systems,' arguing that state-of-the-art AI results increasingly come from multi-component systems rather than monolithic models, shaping industry direction toward retrieval-augmented and agent-based architectures.

DSPy

Co-developed DSPy, a framework for programming (rather than prompting) language models, enabling declarative composition of LLM pipelines that self-optimize. Accepted at ICLR 2024.

ColBERT

Co-developed ColBERT (and ColBERTv2), a late-interaction neural retrieval model that achieves state-of-the-art passage search quality with practical efficiency. Published at NAACL 2022.

Notable Quotes

“

Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models.

X post announcing Compound AI Systems blog post (Feb 2024)·Source

“

Your AI is only as good as the data you put into it.

Q&A with Monte Carlo Data·Source

“

AI or machine learning should really have been called something like 'data extrapolation', because that's basically what machine learning algorithms do by definition.

Q&A with Monte Carlo Data·Source

“

The fewer data copy, ETL, and transport steps you have, the more likely your system is to be reliable.

Q&A with Monte Carlo Data·Source

“

The open source model is a good basis to start from, but I think everyone will then kind of tune it for their domain and get something better.

Interview on open-source AI strategy·Source

“

Compound AI systems will be the best way to maximize the quality, reliability, and measurement of AI applications going forward, and may be one of the most important trends in AI in 2024.

Data+AI Summit 2024 keynote·Source

14 sources(click to expand)

Matei Zaharia - UC Berkeley Faculty Page Matei Zaharia - Wikipedia Matei Zaharia - ACM Award Recipient Profile The Shift from Models to Compound AI Systems - BAIR Blog Bringing Reliable Data And AI To The Cloud: Q&A with Databricks' Matei Zaharia ACM ByteCast Episode 32 - Matei Zaharia Matei Zaharia: The Bazaar in the Cathedral - Data Radicals Podcast Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (NSDI 2012)Accelerating the Machine Learning Lifecycle with MLflow (IEEE Data Eng. Bull. 2018)Databricks raises $4B+ at $134B valuation (Dec 2025)Databricks completes acquisition of MosaicML Matei Zaharia on Google Scholar Matei Zaharia - GitHub Profile (mateiz)DBRX announcement tweet by Matei Zaharia

Research generated March 19, 2026

Builders & Technical Leaders/Matei Zaharia

All Profiles