Senior AI Engineer |Technology Innovation Institute (TII)
Core developer of PEFT, TRL, and bitsandbytes integration at Hugging Face. Co-author of LLM.int8() and Falcon-H1.
Biography
Younes Belkada is a Senior AI Engineer at the Technology Innovation Institute (TII) in Abu Dhabi, where he works on pre-training, evaluation, and tooling for the Falcon family of large language models. Previously he spent three years (2021-2024) as a Machine Learning Engineer on the Hugging Face Open Source team, where he became a core developer of PEFT (Parameter-Efficient Fine-Tuning), TRL (Transformers Reinforcement Learning), and the bitsandbytes quantization integration in Transformers. He co-authored the landmark LLM.int8() paper with Tim Dettmers, pioneered the native Flash Attention 2 integration in Hugging Face Transformers, and wrote the widely-used 4-bit QLoRA integration blog post. He holds an MSc in Mathematics, Vision, and Learning (MVA) from ENS Paris-Saclay and studied Applied Mathematics and Computer Science at Polytech Sorbonne (Sorbonne Universite), with an exchange semester in Data Science at EPFL. He has co-authored 10+ papers spanning BLOOM, StarCoder 2, Zephyr, Petals, Falcon Mamba, and Falcon-H1, and co-instructed the DeepLearning.AI course 'Open Source Models with Hugging Face'.
Core developer of Hugging Face's PEFT library (20.8k stars) enabling LoRA, QLoRA, and other parameter-efficient methods for fine-tuning large models on consumer hardware
Built the native 4-bit and 8-bit quantization integration in Hugging Face Transformers via bitsandbytes, making it possible to load and fine-tune 65B-parameter models on a single 48GB GPU
Core contributor to the TRL library (17.7k stars) for training transformer language models with RLHF, DPO, and PPO
Co-authored the LLM.int8() paper (NeurIPS 2022) introducing mixed-precision decomposition that halves inference memory without accuracy loss, enabling 175B models on consumer GPUs
Led the native integration of Flash Attention 2 into Hugging Face Transformers, enabling faster and more memory-efficient training and inference across 30+ model architectures
Co-authored the Falcon-H1 family of hybrid Transformer-SSM models at TII, spanning 0.5B to 34B parameters with state-of-the-art efficiency
Contributed to the BigScience Workshop's BLOOM, a 176B-parameter open-access multilingual language model trained collaboratively by 1000+ researchers
Co-instructed the DeepLearning.AI short course teaching NLP, audio, image, and multimodal tasks using open-source Hugging Face models
Use PEFT (Parameter-Efficient Fine-Tuning)! In the Hugging Face ecosystem, you can now fine-tune large language models with a fraction of the memory using LoRA and QLoRA.
GPTQ quantized models can be now loaded out of the box in transformers, making large model inference accessible to everyone.
The Falcon has landed in the Hugging Face ecosystem.
Research generated March 19, 2026