Younes Belkada

Younes Belkada is a Senior AI Engineer at the Technology Innovation Institute (TII) in Abu Dhabi, where he works on pre-training, evaluation, and tooling for the Falcon family of large language models. Previously he spent three years (2021-2024) as a Machine Learning Engineer on the Hugging Face Open Source team, where he became a core developer of PEFT (Parameter-Efficient Fine-Tuning), TRL (Transformers Reinforcement Learning), and the bitsandbytes quantization integration in Transformers. He co-authored the landmark LLM.int8() paper with Tim Dettmers, pioneered the native Flash Attention 2 integration in Hugging Face Transformers, and wrote the widely-used 4-bit QLoRA integration blog post. He holds an MSc in Mathematics, Vision, and Learning (MVA) from ENS Paris-Saclay and studied Applied Mathematics and Computer Science at Polytech Sorbonne (Sorbonne Universite), with an exchange semester in Data Science at EPFL. He has co-authored 10+ papers spanning BLOOM, StarCoder 2, Zephyr, Petals, Falcon Mamba, and Falcon-H1, and co-instructed the DeepLearning.AI course 'Open Source Models with Hugging Face'.

Parameter-Efficient Fine-Tuning (PEFT / LoRA / QLoRA)Model Quantization (bitsandbytes, LLM.int8, 4-bit NF4)Transformers Reinforcement Learning (TRL / RLHF / DPO)Flash Attention IntegrationFalcon LLM Family (Falcon Mamba, Falcon-H1, Falcon 3)Hybrid Transformer-SSM ArchitecturesLarge-Scale Pre-training (BLOOM 176B, StarCoder 2)Distributed Inference (Petals)LLM Alignment (Zephyr, Direct Distillation)Open-Source ML Tooling & Developer Education

Timeline

15 Research15 total

2026

2026-01Research

Co-authored 'Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers' on optimizing weight scaling during LLM pretraining

2025

2025-06Research

Co-authored NeurIPS 2025 E2LM Competition paper on Early Training Evaluation of Language Models; served as platform administrator and evaluator

2025-07Research

Co-authored 'Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance' — hybrid Transformer-SSM architecture at TII

2024

2024-02Research

Co-authored 'StarCoder 2 and The Stack v2: The Next Generation' — code LLMs trained on 619 programming languages

2024-07Research

Joined Technology Innovation Institute (TII) as Senior AI Engineer working on Falcon LLM pre-training and evaluation

2024-10Research

Co-authored 'Falcon Mamba: The First Competitive Attention-free 7B Language Model' at TII

2023

2023-05Research

Co-authored HF blog post 'Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA' (176 upvotes)

2023-09Research

Integrated native Flash Attention 2 support into Hugging Face Transformers, acknowledged by Tri Dao

2023-09Research

Spoke at DataFest 2023 on 'Making RLHF more accessible with TRL and PEFT libraries'

2023-10Research

Co-authored 'Zephyr: Direct Distillation of LM Alignment' — distilled alignment technique for smaller LMs

2023-12Research

Co-authored 'Distributed Inference and Fine-tuning of Large Language Models Over The Internet' (Petals system)

2022

2022-08Research

Co-authored 'LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale' (NeurIPS 2022) with Tim Dettmers, Mike Lewis, Luke Zettlemoyer

2022-09Research

Co-authored 'Petals: Collaborative Inference and Fine-tuning of Large Models' with Tim Dettmers and collaborators

2022-11Research

Co-authored BLOOM: A 176B-Parameter Open-Access Multilingual Language Model with the BigScience Workshop

2021

2021-01Research

Joined Hugging Face Open Source team as Machine Learning Engineer; began core contributions to transformers, PEFT, and TRL

Key Contributions

PEFT (Parameter-Efficient Fine-Tuning)

Core developer of Hugging Face's PEFT library (20.8k stars) enabling LoRA, QLoRA, and other parameter-efficient methods for fine-tuning large models on consumer hardware

bitsandbytes / QLoRA Integration in Transformers

Built the native 4-bit and 8-bit quantization integration in Hugging Face Transformers via bitsandbytes, making it possible to load and fine-tune 65B-parameter models on a single 48GB GPU

TRL (Transformers Reinforcement Learning)

Core contributor to the TRL library (17.7k stars) for training transformer language models with RLHF, DPO, and PPO

LLM.int8() — 8-bit Matrix Multiplication at Scale

Co-authored the LLM.int8() paper (NeurIPS 2022) introducing mixed-precision decomposition that halves inference memory without accuracy loss, enabling 175B models on consumer GPUs

Flash Attention 2 Integration in Transformers

Led the native integration of Flash Attention 2 into Hugging Face Transformers, enabling faster and more memory-efficient training and inference across 30+ model architectures

Falcon-H1 Hybrid Language Models

Co-authored the Falcon-H1 family of hybrid Transformer-SSM models at TII, spanning 0.5B to 34B parameters with state-of-the-art efficiency

BLOOM 176B

Contributed to the BigScience Workshop's BLOOM, a 176B-parameter open-access multilingual language model trained collaboratively by 1000+ researchers

Open Source Models with Hugging Face (DeepLearning.AI)

Co-instructed the DeepLearning.AI short course teaching NLP, audio, image, and multimodal tasks using open-source Hugging Face models

Notable Quotes

“

Use PEFT (Parameter-Efficient Fine-Tuning)! In the Hugging Face ecosystem, you can now fine-tune large language models with a fraction of the memory using LoRA and QLoRA.

LinkedIn post on PEFT adoption·Source

“

GPTQ quantized models can be now loaded out of the box in transformers, making large model inference accessible to everyone.

LinkedIn post on GPTQ integration·Source

“

The Falcon has landed in the Hugging Face ecosystem.

LinkedIn post on Falcon integration at Hugging Face·Source

12 sources(click to expand)

younesbelkada on GitHub (91 repos, 861 followers)ybelkada on Hugging Face (144+ models, 25 spaces, 11 papers)Younes Belkada on LinkedIn LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (NeurIPS 2022)Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA (HF Blog)Falcon-H1: A Family of Hybrid-Head Language Models (arXiv)NeurIPS 2025 E2LM Competition (arXiv)Open Source Models with Hugging Face (DeepLearning.AI)Younes Belkada on CatalyzeX (publications list)Younes Belkada's personal blog (about page)Tri Dao acknowledging Flash Attention 2 HF integration (X/Twitter)Younes Belkada on ResearchGate (MVA, ENS Paris-Saclay)

Research generated March 19, 2026

Builders & Technical Leaders/Younes Belkada

All Profiles