I'm currently a research engineer at Hugging Face in 🥐 Paris, where I'm part of the distributed training team nanotron and work on various research reproduction effortsThese Twitter threads document my research experiments - scroll through earlier and later tweets to see the full journey and discoveries on the Hugging Face science team (FP8 research, Infini-Attention, MoE's Expert Parallelism, DoMiNo, DoReMi).
I maintain a life-long learning progress threadScroll down the Twitter thread to see the learning journey where I've shared my study notes over the past two years (3D parallelism).
Before Hugging Face, I designed a study plan that spanned across many subjects and then consistently studied from 3:30 AM to 3:30 PM, then went to sleep from 5:20 PM to 3:00 AM and repeated for 2 years.
Selected Work
Found two stable FP8 pretraining recipes that pretrained a LLaMA 2 architecture in FP8 for both forward and backward passes, as well as both momentums (50% memory reduction), while matching the standard BF16 mixed-precision baseline after 100B tokens with the same pretraining learning rate as Llama 2.
Recipe 1: with architecture and optimizer modification
Recipe 2: ablated of recipe 1 without architecture modification
Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Leandro Werra, Thomas Wolf.
A 3D parallelism distributed training library used for smollm and fineweb.
Implemented a 3D parallelism library from scratch with ZeRO-1.
Research reproduction of the Infini-Attention paper. TLDR: Infini-Attention's performance gets worse as we increase the number of times we compress the memory. To the best of our knowledge, ring attention, YaRN, and rope scaling are still the best ways for extending a pretrained model to a longer context length.