Files

Abstract

This thesis focuses on two selected learning problems: 1) statistical inference on graphs models, and, 2) gradient descent on neural networks, with the common objective of defining and analysing the measures that characterize the fundamental limits. In the first part of the thesis, we consider spin synchronization problems on graphs, which consist of reconstructing a vector of n independent spins living on the vertices of a graph, based on noisy observations of their interactions on the edges of the graph. In particular, we consider synchronization models with erasure (BEC) side-information, where the spins of a small fraction of nodes are revealed, and investigate how the addition of such side-information influences the correlations between spins at distant sites. We show that on trees, whenever spins at distant sites are nearly independent given the edge observations, then they are still nearly independent given the edge observations and the side-information. We conjecture this to hold for any graph. On the other hand, (Kanade et al., 2014) conjectured that, on regular trees and on Galton-Watson trees, whenever any small fraction of node labels are revealed, the boundary at infinite depth becomes ineffective for detecting the root bit, even in the reconstruction regime. We explain how this can be used for computing the limiting entropy of the sparse Stochastic Block Model (SBM) with two symmetric communities. Finally, we show that the latter conjecture does not hold for every tree. In the second part of the thesis, we consider the problem of learning Boolean target functions with gradient descent (GD) on fully connected neural networks. We introduce the notion of "Initial Alignment" (INAL) between a neural network at initialization and a target function and prove that if a network and target do not have a noticeable INAL, then noisy gradient descent on a fully connected network with i.i.d. Gaussian initialization cannot learn the target in polynomial time. We show that for finite depth networks trained with the correlation loss, the result can be extended beyond Boolean inputs. Moreover, we prove that in a similar setting, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in (Zhang et al., 2021). We then show that in the distribution shift setting, when the data withholding corresponds to freezing a single feature, the generalisation error admits a tight characterisation in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions, GD tends to have an implicit bias towards low-degree representations. Finally, we consider a 'curriculum learning' (CL) strategy for learning k -parities on d bits of a binary string. We show that a wise choice of training examples, involving two or more product distributions, allows to learn k -parities in d O (1) time with a fully connected neural network trained with GD. We further show that for another class of functions - namely the 'Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial.

Details

PDF