Projects

On this page you can find a summary of the open-source content I have worked on personally (in blue) or professionally (in green). Listed here are mainly contents published on sites other than my personal blog — primarily course translations, as well as dataset and model releases.

Key projects

French models and datasets

FAT5

FAT5 is a PyTorch implementation of T5 with a UL2 objective optimised for GPGPU, developed with Boris ALBAR.
It uses custom CUDA and Triton kernels along with specific optimisations to increase throughput and reduce memory usage for training and inference by a factor of 2 compared to the original Hugging Face implementation.
We applied it by pre-training a 147M-parameter French model using a single A100. The estimated pre-training cost for such a model is only €1,200 (Sesterce instance estimate).
The pre-training code is available on GitHub under Apache-2.0 and the trained model weights are available on CATIE’s Hugging Face account. A blog post detailing our methodology is available here.

NER

The NERmemBERT family consists of French Named Entity Recognition models capable of labelling up to 4 entity types (Persons, Locations, Organisations, Misc such as work titles, diseases, etc.). Available in base (110M or 136M parameters) and large (336M) sizes, handling contexts from 512 to 8,192 tokens. Weights are freely available as open-source, as are the training datasets. Everything is available on CATIE’s Hugging Face account. A blog post detailing the methodology is available here.
They have been downloaded more than 185,000 times since their release.

Question Answering

The QAmemBERT family consists of French question answering models capable of determining whether the answer to a question is present or absent in an associated context text. Available in base (110M or 136M parameters) and large (335M) sizes, handling contexts from 512 to 8,192 tokens. Weights are freely available as open-source, as is the training dataset. A blog post detailing the methodology is available here.
They have been downloaded more than 160,000 times since their release.

DFP

The Dataset of French Prompts (DFP) contains 113,129,978 rows covering 30 different NLP tasks.
724 prompts were written in imperative form, informal and formal registers to cover as broadly as possible the pre-training data used by models that will use these inputs.
The inputs and targets columns follow the same format as the xP3 dataset by Muennighoff et al. Full details are available on Hugging Face.
It has been downloaded more than 90,000 times since its release.

La marmite

Project still in progress.
The goal is to provide a French equivalent of the cauldron dataset to train a French VLM. It will include OCR data (available here), captioning data (available here), VQA data (available here) and reasoning data.
The sub-datasets already online have been downloaded more than 50,000 times since their release.

Translations

Yann LeCun & Alfredo Canziani's NYU course

This translation was the longest to complete, spanning from 2020 to 2022.
The content is structured in 19 units across 33 lecture videos 🎥 (lectures and lab sessions) totalling approximately 45 hours, 74 web pages 🌐 summarising the videos via student notes, and 16 Jupyter notebooks 📓 (PyTorch) used during labs. A parallel dataset of over 3,000 manually verified pairs was also created to train a translation model.
All resources are available on the dedicated website: https://lbourdois.github.io/cours-dl-nyu/.

Hugging Face 🤗 courses

NLP course
In 2022, I translated the Hugging Face natural language processing course.
The content is structured in 10 chapters comprising 76 videos 🎥 (~5h), 78 web pages 🌐 and 61 Jupyter notebooks 📓 (PyTorch and TensorFlow).
All resources are available on the Hugging Face website.

Audio course
In 2023, I translated the Hugging Face audio course.
The content is structured in 8 units across 46 web pages 🌐.
All resources are available on the Hugging Face website.

Diffusion models course
In 2023, I translated the Hugging Face diffusion models course.
The content is structured in 4 chapters covering 17 web pages 🌐 and 8 Jupyter notebooks 📓 (PyTorch).
All resources are available on Hugging Face’s GitHub.

AI agents course
In 2025, I contributed with Kim NOEL to the translation of the Hugging Face AI agents course.
The content is structured in 4 units (+ 3 bonus) across 74 web pages 🌐 and 16 Jupyter notebooks 📓.
All resources are available on the Hugging Face website.

LLM evaluation guide
In 2025, I translated Clémentine FOURRIER’s guide.
The content is structured in 5 chapters across 30 web pages 🌐 and 3 Jupyter notebooks 📓.
All resources are available here.