Phuc Nguyen

High school dropout. ML research engineer.

DMs open. The best way to reach me is twitter, discord: neuralink, or email: phucnh791 [at] gmail [dot] com. If you're ever in Paris, feel free to drop me a message at any of these contacts!

current learning goals

Math [why]My ultimate goal for learning math is building a strong mathematical background to contribute to the science of almost any fundamental problems in deep learning such as new optimization techniques, new architectures, ...: Working towards Metric Spaces and Geometry, Real and Complex Analysis, Measure Theory and Functional Analysis, Mean Field Theory, and 10 more courses.

Physics [why]My ultimate goal for learning physics is building a technical foundation to pursue sci-fi and disruptive technology projects, and bring the latest advances in science and technology to reality: Working towards Condensed Matter Physics, Fission and Fusion, Nuclear Physics, Particle Physics, Radiation and Relativity, Exoplanets, and 5 more courses.

intro

I'm currently a ML research engineer building pretraining infrastructure at Nous Research in 🥐 Paris.

Previously, I was a research engineer at 🤗 Hugging Face, where I was part of the distributed training team nanotron and worked on various research reproduction effortsThese Twitter threads document my research experiments - scroll through earlier and later tweets to see the full journey and discoveries on the Hugging Face science team (FP8 research, Infini-Attention, MoE's Expert Parallelism, DoMiNo, DoReMi).

I maintain a life-long learning progress threadScroll down the Twitter thread to see the learning journey where I've shared my study notes over the past two years.

Before Hugging Face, I designed a study plan that spanned across many subjects and then consistently studied from 3:30 AM to 3:30 PM, then went to sleep from 5:20 PM to 3:00 AM and repeated for 2 years.

Selected Work

MoE Pretraining Infrastructure @ NousResearch

Built a new expert parallelism implementation and validated near-linear scaling across 16 nodes (128 GPUs) on Qwen3-30B-A3B model:
• 33% faster than torchtitan's default expert parallelism (14,796 vs 9,930 tok/s/GPU at EP=8)
• Scales near-linearly to 16 nodes (128 GPUs, 13,856 tok/s/GPU), enabling 10T tokens/month throughput at 256 GPUs
• Identified a potential optimization that could bump throughput by 30%New fix coming up, translating to 21k throughput, equivalent to 30 trillion tokens in 2 months (for reference, Qwen3-30B-A3B was trained on 36T tokens)

Stable FP8 pretraining recipe for Transformer

Found two stable FP8 pretraining recipes that pretrained a LLaMA 2 architecture in FP8 for both forward and backward passes, as well as both momentums (50% memory reduction), while matching the standard BF16 mixed-precision baseline after 100B tokens with the same pretraining learning rate as Llama 2.
Recipe 1: with architecture and optimizer modification
Recipe 2: ablated of recipe 1 without architecture modification

The Ultra-Scale Playbook: Training LLMs on GPU Clusters • 🛒 order it here

Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Leandro Werra, Thomas Wolf.

nanotron - Minimalistic LLM 3D-parallelism training ⭐ 2.1k

A 3D parallelism distributed training library used for smollm and fineweb.

pipegoose - 3D parallelism

Implemented a 3D parallelism library from scratch with ZeRO-1.

A failed experiment: Infini-Attention, and why we should keep trying?

Research reproduction of the Infini-Attention paper. TLDR: Infini-Attention's performance gets worse as we increase the number of times we compress the memory. To the best of our knowledge, ring attention, YaRN, and rope scaling are still the best ways for extending a pretrained model to a longer context length.

Talks

Gave an oral presentation at France's Jean Zay supercomputer facility, part of the France's National Scientific Research Center (CNRS)

Get in Touch

Twitter • GitHub • Discord • Twitch • Email • LinkedIn • CV

Phuc Nguyen

Selected Work

MoE Pretraining Infrastructure @ NousResearch

Stable FP8 pretraining recipe for Transformer

The Ultra-Scale Playbook: Training LLMs on GPU Clusters • 🛒 order it here

nanotron - Minimalistic LLM 3D-parallelism training ⭐ 2.1k

pipegoose - 3D parallelism

A failed experiment: Infini-Attention, and why we should keep trying?

Selected Study Notes

A complete breakdown of the implementation details of Megatron-LM, torchgpipe, and OSLO codebases

FP8-LM: Training FP8 Large Language Models

Horovod - Elastic Training and Fault-Tolerance

All other unorganized notes

Talks

Gave an oral presentation at France's Jean Zay supercomputer facility, part of the France's National Scientific Research Center (CNRS)

Get in Touch