CTO & Co-Founder |Databricks
Creator of Apache Spark. Co-founder and CTO of Databricks ($134B valuation). Associate Professor at UC Berkeley. Co-created MLflow, Delta Lake, ColBERT, and DSPy. Pioneered the concept of compound AI systems. ACM Doctoral Dissertation Award winner.
Biography
Matei Zaharia is a Romanian-Canadian computer scientist, co-founder and CTO of Databricks, and Associate Professor of EECS at UC Berkeley. Born in Bucharest, Romania, he earned his BMath (double major in Computer Science and Combinatorics & Optimization) from the University of Waterloo in 2007, where his ICPC team placed 4th globally and 1st in North America. He completed his PhD at UC Berkeley's AMPLab in 2013, winning the 2014 ACM Doctoral Dissertation Award. During his PhD he created Apache Spark (2009), now the most widely used engine for large-scale data processing. He co-founded Databricks in 2013 alongside Ali Ghodsi, Ion Stoica, and four other Berkeley researchers, growing it to a $134 billion valuation by late 2025. He has also driven the creation of MLflow, Delta Lake, ColBERT, DSPy, DBRX, and Dolly. His 2024 blog post 'The Shift from Models to Compound AI Systems' has become a defining reference for the industry's move beyond monolithic LLMs. He was previously an assistant professor at MIT (2015) and Stanford (2016-2023) before returning to UC Berkeley in 2023.
Created the Apache Spark distributed computing framework during his PhD at UC Berkeley's AMPLab in 2009. Spark introduced Resilient Distributed Datasets (RDDs) for fault-tolerant in-memory cluster computing, becoming the most widely used engine for large-scale data processing, with over 2,000 contributors and adoption across virtually every Fortune 500 company.
Co-founded Databricks in 2013 and serves as CTO, building the Lakehouse platform that unifies data warehousing and data lakes. The company reached a $134B valuation by late 2025 with $4.8B+ annual revenue run-rate, becoming the leading enterprise data and AI platform.
Created MLflow in 2018 as an open-source platform for managing the complete machine learning lifecycle -- experiment tracking, reproducibility, and model deployment. MLflow is now the most popular ML lifecycle management tool, with tens of thousands of users.
Co-created Delta Lake, an open-source ACID transaction layer for cloud data lakes, enabling reliable data engineering at scale. Published at VLDB 2020, Delta Lake is a cornerstone of the Lakehouse architecture.
Authored the influential 2024 BAIR blog post 'The Shift from Models to Compound AI Systems,' arguing that state-of-the-art AI results increasingly come from multi-component systems rather than monolithic models, shaping industry direction toward retrieval-augmented and agent-based architectures.
Co-developed DSPy, a framework for programming (rather than prompting) language models, enabling declarative composition of LLM pipelines that self-optimize. Accepted at ICLR 2024.
Co-developed ColBERT (and ColBERTv2), a late-interaction neural retrieval model that achieves state-of-the-art passage search quality with practical efficiency. Published at NAACL 2022.
Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models.
AI or machine learning should really have been called something like 'data extrapolation', because that's basically what machine learning algorithms do by definition.
The fewer data copy, ETL, and transport steps you have, the more likely your system is to be reliable.
The open source model is a good basis to start from, but I think everyone will then kind of tune it for their domain and get something better.
Compound AI systems will be the best way to maximize the quality, reliability, and measurement of AI applications going forward, and may be one of the most important trends in AI in 2024.
Research generated March 19, 2026