Zeroth order optimization for pretraining language models

Nathan Allaire, Mahsa Ghazvini Nejad, Sébastien Le Digabel et Vahid Partovi Nia

Communication écrite (2025)

Document en libre accès dans PolyPublie et chez l'éditeur officiel

Affichage préliminaire

Libre accès au plein texte de ce document
Version officielle de l'éditeur
Conditions d'utilisation: Creative Commons: Attribution-Utilisation non commerciale-Pas d'oeuvre dérivée (CC BY-NC-ND)
Télécharger (1MB)

Afficher le résumé

Cacher le résumé

Abstract

The physical memory for training Large Language Models (LLMs) grow with the model size, and are limited to the GPU memory. In particular, back-propagation that requires the computation of the first-order derivatives adds to this memory overhead. Training extremely large language models with memory-efficient algorithms is still a challenge with theoretical and practical implications. Back-propagation-free training algorithms, also known as zeroth-order methods, are recently examined to address this challenge. Their usefulness has been proven in fine-tuning of language models. However, so far, there has been no study for language model pretraining using zeroth-order optimization, where the memory constraint is manifested more severely. We build the connection between the second order, the first order, and the zeroth order theoretically. Then, we apply the zeroth order optimization to pre-training light-weight language models, and discuss why they cannot be readily applied. We show in p articular that the curse of dimensionality is the main obstacle, and pave the way towards modifications of zeroth order methods for pre-training such models.

Mots clés

Département:	Département de mathématiques et de génie industriel
Centre de recherche:	GERAD - Groupe d'études et de recherche en analyse des décisions
ISBN:	9789897587306
URL de PolyPublie:	https://publications.polymtl.ca/64441/
Nom de la conférence:	14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025)
Lieu de la conférence:	Porto, Portugal
Date(s) de la conférence:	2025-02-23 - 2025-02-25
Maison d'édition:	Scitepress
DOI:	10.5220/0013261100003905
URL officielle:	https://doi.org/10.5220/0013261100003905
Date du dépôt:	07 avr. 2025 11:27
Dernière modification:	03 févr. 2026 21:17

Citer en APA 7:	Allaire, N., Ghazvini Nejad, M., Le Digabel, S., & Partovi Nia, V. (février 2025). Zeroth order optimization for pretraining language models [Communication écrite]. 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025), Porto, Portugal. https://doi.org/10.5220/0013261100003905

Statistiques

Total des téléchargements à partir de PolyPublie

Téléchargements par année

Provenance des téléchargements

Dimensions

Actions réservées au personnel

Afficher document