Welcome to CSD PhD Seminar webpage !
The PhD Seminar is a weekly meeting gathering students of the Center around interactive paper presentations, classes or tutorials. This seminar is not limited to students from the center and everybody is welcome to join us !
The talks will take place in the Conference Room of the center on Mondays from 12pm to 1pm. Sandwiches will be distributed to participants during the seminar.
Our ambition is to vary presentations formats. Feel free to contact us if you have ideas of classes, tutorials that could be organized during the seminar.
The CSD Phd Seminar team - Mathieu, Maria, Othmane & Gaspar
(On site at the CSD - Online at meet.google.com/atc-oiee-ezr)
17th October : Thomas Wang (HuggingFace) - BigScience : Collaboratively training a large multilingual language model
Over the last few years, pre-trained self-supervised language models have shown their usefulness for many applications in many domains. These models aim at discovering general representations from a large amount of text without the need for human annotation - a time-consuming and costly task. GPT-3 has shown through scaling laws that increasing the size of models is a path forward towards obtaining better models. Unfortunately, only a few organizations across the world have the resources to train such models - especially due to the computational resources required to train those . As a result, the scientific community relies on what these resource-rich groups are willing to publish to understand how they are built, how they work, and how they can be further improved.
BigScience is a one-year research project whose main ambition is to train a 176 billion parameters language model in the order of magnitude of GPT-3 (OpenAI’s proprietary solution), in a transparent, public and collaborative manner. In order to do so, 1000+ researchers, coming from both academia and industry, have gathered and contributed to 30 working groups in order to take decisions every step of the way : the creation of multilingual datasets, the design of the model, the engineering challenges, the formulation of a new license for the model, legal considerations of personal identifiable information within training datasets, the development of evaluation tools and finally reflections on downstream applications in different domains, such as bio-medical.
- Paper presentation