On this page you can find a summary of the open-source content I have worked on personally (in blue) or professionally (in green).
Listed here are mainly contents published on sites other than my personal blog: primarily course translations, as well as dataset and model releases.


All my projects



Key projects

French models and datasets

FAT5

FAT5 is a PyTorch implementation of T5 with a UL2 objective optimized for GPGPU, developed with Boris ALBAR.
It uses custom CUDA and Triton kernels along with specific optimizations to increase throughput and reduce memory usage for training and inference by a factor of 2 compared to the original implementation available in Hugging Face.
We applied it by pre-training a 147M parameter French model using only an A100. We thus estimate being able to bring the pre-training cost of such a model down to just €1,200 (estimate based on a Sesterce instance).
The pre-training code is available on GitHub under the Apache-2.0 license and the trained model weights on CATIE's Hugging Face account. A blog post detailing our methodology is available here.

NER

The NERmemBERT models form a family of French Named Entity Recognition models capable of labeling up to 4 entity types (People, Locations, Organizations, Miscellaneous such as artwork names, disease names, etc.). They are available in base size (110M or 136M parameters) and large (336M), handling contexts ranging from 512 to 8,192 tokens. The weights are freely available as open-source, as are the datasets used for training. Everything is available on CATIE's Hugging Face account. A blog post detailing the methodology adopted is available here.
They have been downloaded more than 185,000 times since their release.

Question Answering

The QAmemBERT models form a family of French question answering models capable of indicating whether the answer to a question is present or not in an associated context text. They are available in base size (110M or 136M parameters) and large (335M), handling contexts ranging from 512 to 8,192 tokens. The weights are freely available as open-source, as is the dataset used for training. Everything is available on CATIE's Hugging Face account. A blog post detailing the methodology adopted is available here.
They have been downloaded more than 160,000 times since their release.

DFP

Dataset of French Prompts (DFP) contains 113,129,978 rows covering 30 different NLP tasks.
724 prompts were written in imperative form, using both informal (tu) and formal (vous) address, in order to cover as broadly as possible the pre-training data used by the model that will consume this data, which is unknown to us.
The inputs and targets columns follow the same format as the xP3 dataset by Muennighoff et al.
All details are available on Hugging Face.
It has been downloaded more than 90,000 times since its release.

La marmite

Project still in progress.
The goal is to provide a French equivalent of the the cauldron dataset in order to train a French VLM.
This dataset will include OCR data (see those already available online here), captioning data (see those already available online here), VQA data (see those already available online here), and reasoning data.
The sub-datasets already online have been downloaded more than 50,000 times since their release.



Translations

NYU course by Yann LeCun and Alfredo Canziani

This translation was the longest to carry out, spanning from 2020 to 2022.
The content is structured into 19 units spread across 33 lecture 🎥 videos (lectures and tutorials) with a total duration of approximately 45 hours, 74 web pages 🌐 summarizing the videos through notes taken by students during class, and 16 Jupyter notebooks 📓 (in PyTorch) used during the tutorials. Additionally, a dataset of more than 3,000 manually verified parallel entries was created to train a translation model.
You can find all these resources on the dedicated website created for the occasion: https://lbourdois.github.io/cours-dl-nyu/.

Hugging Face 🤗 courses

NLP course

In 2022, I translated the Hugging Face natural language processing course.
The content is structured into 10 chapters comprising a total of 76 🎥 videos with a total duration of approximately 5 hours, 78 web pages 🌐, and 61 Jupyter notebooks 📓 (in PyTorch and TensorFlow).
You can find all these resources on the Hugging Face website.

Audio course

In 2023, I translated the Hugging Face audio course.
The content is structured into 8 units spread across 46 web pages 🌐.
You can find all these resources on the Hugging Face website.

Diffusion models course

In 2023, I translated the Hugging Face diffusion models course.
The content is structured into 4 chapters covering 17 web pages 🌐 and 8 Jupyter notebooks 📓 (in PyTorch).
You can find all these resources on Hugging Face's GitHub (the content has not yet been propagated to the official website).

AI agents course

In 2025, I collaborated with Kim NOEL on the translation of the Hugging Face AI agents course.
The content is structured into 4 units (+ 3 bonus) spread across 74 web pages 🌐 and 16 Jupyter notebooks 📓.
You can find all these resources on the Hugging Face website.

LLM evaluation guide

In 2025, I translated the guide by Clémentine FOURRIER.
The content is structured into 5 chapters spread across 30 web pages 🌐 and 3 Jupyter notebooks 📓.
You can find all these resources here.