current learning goals
Math [why]My ultimate goal for learning math is building a strong mathematical background to contribute to the science of almost any fundamental problems in deep learning such as new optimization techniques, new architectures, ...: Working towards Metric Spaces and Geometry, Real and Complex Analysis, Measure Theory and Functional Analysis, Mean Field Theory,
18.950 Differential Geometry,
18.715 Introduction to Representation Theory,
18.705 Commutative Algebra,
18.676 Stochastic Calculus,
18.675 Theory of Probability,
18.655 Mathematical Statistics,
18.656[J] Mathematical Statistics: a Non-Asymptotic Approach,
18.615 Introduction to Stochastic Processes,
18.515 Mathematical Logic,
18.435[J] Quantum Computation, and 10 more courses.
Physics [why]My ultimate goal for learning physics is building a technical foundation to pursue sci-fi and disruptive technology projects, and bring the latest advances in science and technology to reality: Working towards Condensed Matter Physics, Fission and Fusion, Nuclear Physics, Particle Physics, Radiation and Relativity, Exoplanets, 8.14 Experimental Physics II, String Theory, Introduction to Particle Accelerators, 8.370[J] Quantum Computation, 8.914 Plasma Astrophysics II, and 5 more courses.
intro
I'm currently a ML research engineer building pretraining infrastructure at
Nous Research in 🥐 Paris.
Previously, I was a research engineer at 🤗 Hugging Face, where I was part of the distributed training team nanotron and worked on various research reproduction effortsThese Twitter threads document my research experiments - scroll through earlier and later tweets to see the full journey and discoveries on the Hugging Face science team (FP8 research, Infini-Attention, MoE's Expert Parallelism, DoMiNo, DoReMi).
I maintain a life-long learning progress threadScroll down the Twitter thread to see the learning journey where I've shared my study notes over the past two years.
Before Hugging Face, I designed a study plan that spanned across many subjects and then consistently studied from 3:30 AM to 3:30 PM, then went to sleep from 5:20 PM to 3:00 AM and repeated for 2 years.
Selected Work
Built a new expert parallelism implementation and validated near-linear scaling across 16 nodes (128 GPUs) on Qwen3-30B-A3B model:
• 33% faster than torchtitan's default expert parallelism (14,796 vs 9,930 tok/s/GPU at EP=8)
• Scales near-linearly to 16 nodes (128 GPUs, 13,856 tok/s/GPU), enabling 10T tokens/month throughput at 256 GPUs
• Identified a potential optimization that could bump throughput by 30%New fix coming up, translating to 21k throughput, equivalent to 30 trillion tokens in 2 months (for reference, Qwen3-30B-A3B was trained on 36T tokens)
Found two stable FP8 pretraining recipes that pretrained a LLaMA 2 architecture in FP8 for both forward and backward passes, as well as both momentums (50% memory reduction), while matching the standard BF16 mixed-precision baseline after 100B tokens with the same pretraining learning rate as Llama 2.
Recipe 1: with architecture and optimizer modification
Recipe 2: ablated of recipe 1 without architecture modification
Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Leandro Werra, Thomas Wolf.
A 3D parallelism distributed training library used for smollm and fineweb.
Implemented a 3D parallelism library from scratch with ZeRO-1.
Research reproduction of the Infini-Attention paper. TLDR: Infini-Attention's performance gets worse as we increase the number of times we compress the memory. To the best of our knowledge, ring attention, YaRN, and rope scaling are still the best ways for extending a pretrained model to a longer context length.