Seminar 4 : Thomas Wang
Title : BigScience : Collaboratively training a large multilingual language model
Abstract :
Over the last few years, pre-trained self-supervised language models have shown their usefulness for many applications in many domains. These models aim at discovering general representations from a large amount of text without the need for human annotation - a time-consuming and costly task. GPT-3 has shown through scaling laws that increasing the size of models is a path forward towards obtaining better models. Unfortunately, only a few organizations across the world have the resources to train such models - especially due to the computational resources required to train those . As a result, the scientific community relies on what these resource-rich groups are willing to publish to understand how they are built, how they work, and how they can be further improved.
BigScience is a one-year research project whose main ambition is to train a 176 billion parameters language model in the order of magnitude of GPT-3 (OpenAI’s proprietary solution), in a transparent, public and collaborative manner. In order to do so, 1000+ researchers, coming from both academia and industry, have gathered and contributed to 30 working groups in order to take decisions every step of the way : the creation of multilingual datasets, the design of the model, the engineering challenges, the formulation of a new license for the model, legal considerations of personal identifiable information within training datasets, the development of evaluation tools and finally reflections on downstream applications in different domains, such as bio-medical.