On the Convergence of Encoder-only Shallow Transformers

Wu, Yongtao; Liu, Fanghui; Chrysos, Grigorios; Cevher, Volkan

Wu, Yongtao; Liu, Fanghui; Chrysos, Grigorios; Cevher, Volkan

2023

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.

Details

Title On the Convergence of Encoder-only Shallow Transformers

Author(s) Wu, Yongtao ; Liu, Fanghui ; Chrysos, Grigorios ; Cevher, Volkan

Pagination 41

Conference 37th Annual Conference on Neural Information Processing Systems, New Orleans, USA, December 10-16. 2023

Date 2023

Keywords

AI-ML

Other identifier(s) View record in ArXiv

Laboratories LIONS

Record Appears in Scientific production and competences > STI - School of Engineering > IEM - Institut d'Electricité et de Microtechnique > LIONS - Laboratory for Information and Inference Systems
Peer-reviewed publications
Conference Papers
Work produced at EPFL

Record creation date 2024-03-14

Files

Abstract

Details

PDF