Files

Abstract

Artificial intelligence, particularly the subfield of machine learning, has seen a paradigm shift towards data-driven models that learn from and adapt to data. This has resulted in unprecedented advancements in various domains such as natural language processing and computer vision, largely attributed to deep learning, a special class of machine learning models. Deep learning arguably surpasses traditional approaches by learning the relevant features from raw data through a series of computational layers. This thesis explores the theoretical foundations of deep learning by studying the relationship between the architecture of these models and the inherent structures found within the data they process. In particular, we ask: What drives the efficacy of deep learning algorithms and allows them to beat the so-called curse of dimensionality—i.e. the difficulty of generally learning functions in high dimensions due to the exponentially increasing need for data points with increased dimensionality? Is it their ability to learn relevant representations of the data by exploiting their structure? How do different architectures exploit different data structures? In order to address these questions, we push forward the idea that the structure of the data can be effectively characterized by its invariances—i.e. aspects that are irrelevant for the task at hand. Our methodology takes an empirical approach to deep learning, combining experimental studies with physics-inspired toy models. These simplified models allow us to investigate and interpret the complex behaviors we observe in deep learning systems, offering insights into their inner workings, with the far-reaching goal of bridging the gap between theory and practice. Specifically, we compute tight generalization error rates of shallow fully connected networks demonstrating that they are capable of performing well by learning linear invariances, i.e. becoming insensitive to irrelevant linear directions in input space. However, we show that these network architectures can perform poorly in learning non-linear invariances such as rotation invariance or the invariance with respect to smooth deformations of the input. This result illustrates that, if a chosen architecture is not suitable for a task, it might overfit, making a kernel method, for which representations are not learned, potentially a better choice. Modern architectures like convolutional neural networks, however, are particularly well-fitted to learn the non-linear invariances that are present in real data. In image classification, for example, the exact position of an object or feature might not be crucial for recognizing it. This property gives rise to an invariance with respect to small deformations. Our findings show that the neural networks that are more invariant to deformations tend to have higher performance, underlying the importance of exploiting such invariance. Another key property that gives structure to real data is the fact that high-level features are a hierarchical composition of lower-level features—a dog is made of a head and limbs, the head is made of eyes, nose, and mouth, which are then made of simple textures and edges. These features can be realized in multiple synonymous ways, giving rise to an invariance. To investigate the synonymic invariance that arises from the hierarchical structure of data, we introduce a toy data model that allows us to examine how features are extracted and combined to form incr

Details

PDF