Phuc Nguyen

High school dropout. ML research engineer.

DMs open. The best way to reach me is twitter, discord: neuralink, or email: phucnh791 [at] gmail [dot] com. If you're ever in Paris, feel free to drop me a message at any of these contacts!

current learning goals

Math [why]My ultimate goal for learning math is building a strong mathematical background to contribute to the science of almost any fundamental problems in deep learning such as new optimization techniques, new architectures, ...: Working towards Metric Spaces and Geometry, Real and Complex Analysis, Measure Theory and Functional Analysis, Mean Field Theory, and 10 more courses.

Physics [why]My ultimate goal for learning physics is building a technical foundation to pursue sci-fi and disruptive technology projects, and bring the latest advances in science and technology to reality: Working towards Condensed Matter Physics, Fission and Fusion, Nuclear Physics, Particle Physics, Radiation and Relativity, Exoplanets, and 5 more courses.

intro

I'm currently a ML research engineer building pretraining infrastructure at Nous Research in 🥐 Paris.

Previously, I was a research engineer at 🤗 Hugging Face, where I was part of the distributed training team nanotron and worked on various research reproduction effortsThese Twitter threads document my research experiments - scroll through earlier and later tweets to see the full journey and discoveries on the Hugging Face science team (FP8 research, Infini-Attention, MoE's Expert Parallelism, DoMiNo, DoReMi).

I maintain a life-long learning progress threadScroll down the Twitter thread to see the learning journey where I've shared my study notes over the past two years.

Before Hugging Face, I designed a study plan that spanned across many subjects and then consistently studied from 3:30 AM to 3:30 PM, then went to sleep from 5:20 PM to 3:00 AM and repeated for 2 years.

Selected Work

MoE Pretraining Infrastructure @ NousResearch

Built a new expert parallelism implementation and validated near-linear scaling across 16 nodes (128 GPUs) on Qwen3-30B-A3B model:
• 33% faster than torchtitan's default expert parallelism (14,796 vs 9,930 tok/s/GPU at EP=8)
• Scales near-linearly to 16 nodes (128 GPUs, 13,856 tok/s/GPU), enabling 10T tokens/month throughput at 256 GPUs
Identified a potential optimization that could bump throughput by 30%New fix coming up, translating to 21k throughput, equivalent to 30 trillion tokens in 2 months (for reference, Qwen3-30B-A3B was trained on 36T tokens)

A failed experiment: Infini-Attention, and why we should keep trying?

Research reproduction of the Infini-Attention paper. TLDR: Infini-Attention's performance gets worse as we increase the number of times we compress the memory. To the best of our knowledge, ring attention, YaRN, and rope scaling are still the best ways for extending a pretrained model to a longer context length.

Selected Study Notes

Talks

Get in Touch

TwitterGitHubDiscordTwitchEmailLinkedInCV