Abstract

Artificial Neural Networks (ANN) are habitually trained via the back-propagation (BP) algorithm. This approach has been extremely successful: Current models like GPT-3 have O(10 11 ) parameters, are trained on O(10 11 ) words and produce awe-inspiring results. However, there are good reasons to look for alternative training methods: With current algorithms and hardware constraints sometimes only half the available computing power is actually used. This is due to a complicated interplay between the size of the ANN, the available memory, throughput limitations of interconnects, the architecture of the network of computers, and the training algorithm. Training a model like the aforementioned GPT-3 takes months and costs millions. A different training paradigm, which could make clever use of specialized hardware, may train large ANNs more efficiently.

Details