Bloom: A 176b-parameter open-access multilingual language model BS Workshop, T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow, ... JMLR, 2023 | 1642* | 2023 |
Obelics: An open web-scale filtered dataset of interleaved image-text documents H Laurençon, L Saulnier, L Tronchon, S Bekman, A Singh, A Lozhkov, ... NeurIPS, 2024 | 228 | 2024 |
The bigscience roots corpus: A 1.6 tb composite multilingual dataset H Laurençon, L Saulnier, T Wang, C Akiki, A Villanova del Moral, ... NeurIPS, 2022 | 192* | 2022 |
What matters when building vision-language models? H Laurençon, L Tronchon, M Cord, V Sanh NeurIPS, 2024 | 107 | 2024 |
The ROOTS search tool: Data transparency for LLMs A Piktus, C Akiki, P Villegas, H Laurençon, G Dupont, AS Luccioni, ... ACL, 2023 | 29 | 2023 |
DP-Parse: Finding word boundaries from raw speech with an instance lexicon R Algayres, T Ricoul, J Karadayi, H Laurençon, S Zaiem, A Mohamed, ... TACL, 2022 | 16 | 2022 |
Building and better understanding vision-language models: insights and future directions H Laurençon, A Marafioti, V Sanh, L Tronchon NeurIPS, 2024 | 12 | 2024 |
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset H Laurençon, L Tronchon, V Sanh arXiv preprint, 2024 | 10 | 2024 |
Continuous homeostatic reinforcement learning for self-regulated autonomous agents H Laurençon, CR Ségerie, J Lussange, BS Gutkin arXiv preprint, 2021 | 6 | 2021 |
Calm: A multi-task benchmark for comprehensive assessment of language model bias V Gupta, PN Venkit, H Laurençon, S Wilson, RJ Passonneau COLM, 2023 | 5 | 2023 |