TOMASO POGGIO

Massachusetts Institute of Technology

H-index: 149

North America-United States

Description

TOMASO POGGIO, With an exceptional h-index of 149 and a recent h-index of 70 (since 2020), a distinguished researcher at Massachusetts Institute of Technology, specializes in the field of Machine Learning, Learning Theory, AI, Neuroscience, Computational Vision.

His recent articles reflect a diverse array of research interests and contributions to the field:

Compositional Sparsity of Learnable Functions

Norm-based Generalization Bounds for Sparse Neural Networks

System identification of neural systems: If we got it right, would we know?

Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds

The Janus effects of SGD vs GD: high noise and low rank

SGD and Weight Decay Provably Induce a Low-Rank Bias in Deep Neural Networks

How to guess a gradient

For interpolating kernel machines, minimizing the norm of the ERM solution maximizes stability

Professor Information

University	Massachusetts Institute of Technology
Position	McDermott Professor in Brain Sciences
Citations(all)	126491
Citations(since 2020)	23787
Cited By	122550
hIndex(all)	149
hIndex(since 2020)	70
i10Index(all)	468
i10Index(since 2020)	223
Email	Access Email
University Profile Page	Massachusetts Institute of Technology

Research & Interests List

Machine Learning

Learning Theory

Neuroscience

Computational Vision

Top articles of TOMASO POGGIO

Compositional Sparsity of Learnable Functions

Neural networks have demonstrated impressive success in various domains, raising the question of what fundamental principles underlie the effectiveness of the best AI systems and quite possibly of human intelligence. This perspective argues that compositional sparsity, or the property that a compositional function have "few" constituent functions, each depending on only a small subset of inputs, is a key principle underlying successful learning architectures. Surprisingly, all functions that are efficiently Turing computable have a compositional sparse representation. Furthermore, deep networks that are also sparse can exploit this general property to avoid the “curse of dimensionality". This framework suggests interesting implications about the role that machine learning may play in mathematics.

Authors

Tomaso Poggio,Maia Fraser

Published Date

2024/2/8

Norm-based Generalization Bounds for Sparse Neural Networks

In this paper, we derive norm-based generalization bounds for sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones because they consider the sparse structure of the neural network architecture and the norms of the convolutional filters, rather than the norms of the (Toeplitz) matrices associated with the convolutional layers. Theoretically, we demonstrate that these bounds are significantly tighter than standard norm-based generalization bounds. Empirically, they offer relatively tight estimations of generalization for various simple classification problems. Collectively, these findings suggest that the sparsity of the underlying target function and the model's architecture plays a crucial role in the success of deep learning.

Authors

Tomer Galanti,Mengjia Xu,Liane Galanti,Tomaso Poggio

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

System identification of neural systems: If we got it right, would we know?

Artificial neural networks are being proposed as models of parts of the brain. The networks are compared to recordings of biological neurons, and good performance in reproducing neural responses is considered to support the model’s validity. A key question is how much this system identification approach tells us about brain computation. Does it validate one model architecture over another? We evaluate the most commonly used comparison techniques, such as a linear encoding model and centered kernel alignment, to correctly identify a model by replacing brain recordings with known ground truth models. System identification performance is quite variable; it also depends significantly on factors independent of the ground truth architecture, such as stimuli images. In addition, we show the limitations of using functional similarity scores in identifying higher-level architectural motifs.

Authors

Yena Han,Tomaso A Poggio,Brian Cheung

Published Date

2023/7/3

Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds

We overview several properties—old and new—of training overparameterized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous rectified linear unit networks. We study the convergence to a solution with the absolute minimum ρ, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers is used together with weight decay under different forms of gradient descent. A main property of the minimizers that bound their expected error for a specific network architecture is ρ. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next, we prove that quasi-interpolating solutions obtained by stochastic gradient descent in the presence of weight decay have a bias toward low-rank …

Authors

Mengjia Xu,Akshay Rangamani,Qianli Liao,Tomer Galanti,Tomaso Poggio

Journal

Research

Published Date

2023/3/8

The Janus effects of SGD vs GD: high noise and low rank

It was always obvious that SGD has higher fluctuations at convergence than GD. It has also been often reported that SGD in deep RELU networks has a low-rank bias in the weight matrices. A recent theoretical analysis linked SGD noise with the low-rank bias induced by the SGD updates associated with small minibatch sizes [1]. In this paper, we provide an empirical and theoretical analysis of the convergence of SGD vs GD, first for deep RELU networks and then for the case of linear regression, where sharper estimates can be obtained and which is of independent interest. In the linear case, we prove that the components of the matrix W corresponding to the null space of the data matrix X converges to zero for both SGD and GD, provided the regularization term is non-zero (in the case of square loss; for exponential loss the result holds independently of regularization). The convergence rate, however, is exponential for SGD, and linear for GD. Thus SGD has a much stronger bias than GD towards solutions for weight matrices W with high fluctuations and low rank, provided the initialization is from a random matrix (but not if W is initialized as a zero matrix). Thus SGD under exponential loss, or under the square loss with non-zero regularization, shows the coupled phenomenon of low rank and asymptotic noise.

Authors

Mengjia Xu,Tomer Galanti,Akshay Rangamani,Lorenzo Rosasco,Tomaso Poggio

Published Date

2023/12/21

SGD and Weight Decay Provably Induce a Low-Rank Bias in Deep Neural Networks

In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matri- ces. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.

Authors

Tomer Galanti,Zachary Siegel,Aparna Gupte,Tomaso Poggio

Published Date

2023/2/14

How to guess a gradient

How much can you say about the gradient of a neural network without computing a loss or knowing the label? This may sound like a strange question: surely the answer is "very little." However, in this paper, we show that gradients are more structured than previously thought. Gradients lie in a predictable low-dimensional subspace which depends on the network architecture and incoming features. Exploiting this structure can significantly improve gradient-free optimization schemes based on directional derivatives, which have struggled to scale beyond small networks trained on toy datasets. We study how to narrow the gap in optimization performance between methods that calculate exact gradients and those that use directional derivatives. Furthermore, we highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.

Authors

Utkarsh Singhal,Brian Cheung,Kartik Chandra,Jonathan Ragan-Kelley,Joshua B Tenenbaum,Tomaso A Poggio,Stella X Yu

Journal

arXiv preprint arXiv:2312.04709

Published Date

2023/12/7

For interpolating kernel machines, minimizing the norm of the ERM solution maximizes stability

In this paper, we study kernel ridge-less regression, including the case of interpolating solutions. We prove that maximizing the leave-one-out () stability minimizes the expected error. Further, we also prove that the minimum norm solution — to which gradient algorithms are known to converge — is the most stable solution. More precisely, we show that the minimum norm interpolating solution minimizes a bound on stability, which in turn is controlled by the smallest singular value, hence the condition number, of the empirical kernel matrix. These quantities can be characterized in the asymptotic regime where both the dimension () and cardinality () of the data go to infinity (with as ). Our results suggest that the property of stability of the learning algorithm with respect to perturbations of the training set may provide a more general framework than the classical theory of Empirical Risk Minimization …

Authors

Akshay Rangamani,Lorenzo Rosasco,Tomaso Poggio

Journal

Analysis and Applications

Published Date

2023/1/28