Thomas Wolf

Research Papers

4 papers

arXiv

HuggingFace's Transformers: State-of-the-art Natural Language Processing

1910.03771·2020-10-09

arXiv

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

1910.01108·2019-10-02

arXiv

Datasets: A Community Library for Natural Language Processing

2109.02846·2021-09-07

arXiv

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

2406.17557·2024-06-25

Biography

Thomas Wolf is co-founder and Chief Science Officer (CSO) of Hugging Face, the collaborative open-source platform for machine learning that hosts over 1 million public models and is valued at $4.5 billion. After graduating from Ecole Polytechnique (Paris), he researched laser-plasma interactions at the BELLA Center of Lawrence Berkeley National Laboratory. He completed his Ph.D. in Statistical/Quantum Physics at Sorbonne University and ESPCI, working on superconducting materials. He then pivoted to law, earning a degree from Pantheon Sorbonne University and spending five years as a European patent attorney at Cabinet Plasseraud. In 2015, consulting for deep-learning startups, he recognized that many AI methods were re-branded statistical physics approaches and taught himself modern ML. In 2016, alongside Clement Delangue and Julien Chaumond, he co-founded Hugging Face in New York City, initially as a chatbot app before pivoting to become the central hub for open-source AI. Wolf created the Transformers library (158k+ GitHub stars) and the Datasets library, co-authored the O'Reilly book 'Natural Language Processing with Transformers,' initiated and led the BigScience research workshop that produced BLOOM (a 176-billion-parameter multilingual LLM), and now leads Hugging Face's push into open-source robotics with LeRobot. His papers have been cited over 55,000 times.

Open-Source AITransformer ArchitecturesNatural Language ProcessingModel DemocratizationLarge Language ModelsKnowledge DistillationOpen ScienceMultilingual AIAI RoboticsCommunity-Driven ResearchDataset CurationSmall Language Models

Timeline

19 Research19 total

2025

2025-01Research

Released FineWeb2 — multilingual web dataset spanning 3 trillion words across many languages

2025-09Research

Spoke at TechCrunch Disrupt 2025 on building the future of open AI

2025-11Research

Appeared on Sequoia's 'Training Data' podcast discussing LeRobot and the 'App Store for Robots'

2024

2024-05Research

Launched LeRobot — open-source robotics library bringing the Hugging Face community approach to physical AI

2024-06Research

Co-authored FineWeb dataset paper — 15-trillion-token open pretraining dataset from Common Crawl

2024-07Research

Released SmolLM family of small language models (135M to 1.7B) with open training corpus

2024-10Research

Fortune interview: advocated that open-source AI benefits far outweigh risks

2024-11Research

Spoke at MLOps Community 'Agents in Production' event on LLMs, agents and open-source

2023

2023-08Research

Hugging Face raised $235M Series D at $4.5B valuation from Google, Nvidia, Amazon, Salesforce

2022

2022-02Research

Co-authored 'Natural Language Processing with Transformers' (O'Reilly, with Tunstall & von Werra)

2022-07Research

BigScience released BLOOM — 176B-parameter open multilingual LLM trained on 46 languages

2021

2021-01Research

Published 'Datasets: A Community Library for NLP' — won EMNLP 2021 Best Demo Paper

2021-05Research

Launched BigScience research workshop — largest open collaboration in AI with 1,000+ researchers

2020

2020-01Research

Published 'Transformers: State-of-the-art NLP' — won EMNLP 2020 Best Demo Paper

2019

2019-01Research

Co-authored DistilBERT paper (NeurIPS 2019) — smaller, faster, cheaper BERT distillation

2019-12Research

Hugging Face raised $15M Series A led by Lux Capital

2018

2018-01Research

Open-sourced PyTorch pretrained BERT, sparking what became the Transformers library

2016

2016-01Research

Co-founded Hugging Face with Clement Delangue and Julien Chaumond in New York City

2014

2014-01Research

Completed Ph.D. in Statistical/Quantum Physics at Sorbonne University and ESPCI, Paris

Key Contributions

Hugging Face Transformers

Created the Transformers library (158k+ GitHub stars), providing a unified API for state-of-the-art pretrained models across PyTorch, TensorFlow, and JAX. Used by 5,000+ research organizations worldwide.

Hugging Face Datasets

Created the Datasets library for efficient access to thousands of ML datasets. Won EMNLP 2021 Best Demonstration Paper.

BigScience / BLOOM

Initiated and led BigScience, the largest open research collaboration in AI (1,000+ researchers, 60+ countries). Produced BLOOM, a 176B-parameter multilingual open LLM trained on 46 languages.

DistilBERT

Co-authored DistilBERT (NeurIPS 2019), demonstrating that a 40% smaller BERT model retaining 97% of performance could be produced via knowledge distillation, enabling wider deployment.

Natural Language Processing with Transformers (O'Reilly)

Co-authored the reference book on building language applications with Hugging Face, published by O'Reilly (2022).

LeRobot

Leading Hugging Face's open-source robotics initiative, bringing community-driven development to physical AI with affordable hardware like the $100 SO100 robotic arm.

FineWeb / FineWeb2

Co-authored FineWeb (15T tokens) and FineWeb2 (3T+ words, multilingual) — open pretraining datasets that produce better-performing LLMs than other open data sources.

SmolLM

Led release of SmolLM family of small language models (135M-1.7B params) with fully open training data, demonstrating that small open models can rival larger proprietary ones.

Notable Quotes

“

It's nice to give a fish to someone to feed them, it's even better to teach them to fish.

Sequoia 'Training Data' podcast, Nov 2025·Source

“

Everyone feels like they can build with AI and not just consuming AI.

Sequoia 'Training Data' podcast, Nov 2025·Source

“

All of these people also become roboticists in a way, if you give them the tools.

Sequoia 'Training Data' podcast, Nov 2025·Source

“

The missing brick was really software that could adapt, that could be dynamic.

TechCrunch Disrupt 2025·Source

“

The big bet was, can you build a big community in robotics as well?

Sequoia 'Training Data' podcast, Nov 2025·Source

14 sources(click to expand)

Thomas Wolf - personal website Hugging Face - Wikipedia Thomas Wolf | World Economic Forum Thomas Wolf - Crunchbase Hugging Face raises $235M (TechCrunch, Aug 2023)Open-source AI benefits outweigh risks (Fortune, Oct 2024)Thomas Wolf at TechCrunch Disrupt 2025 Hugging Face's Thomas Wolf on the 'App Store' for Robots (Sequoia)HuggingFace's Transformers: State-of-the-art NLP (arXiv)FineWeb Datasets paper BigScience BLOOM model page Natural Language Processing with Transformers (O'Reilly)The Past, Present and Future of Hugging Face - Analytics Vidhya LLM, Agents and Open Source - MLOps Community

Research generated March 19, 2026

Lab Leaders & Founders/Thomas Wolf

All Profiles