TOMASO POGGIO
Massachusetts Institute of Technology
H-index: 149
North America-United States
Description
TOMASO POGGIO, With an exceptional h-index of 149 and a recent h-index of 70 (since 2020), a distinguished researcher at Massachusetts Institute of Technology, specializes in the field of Machine Learning, Learning Theory, AI, Neuroscience, Computational Vision.
His recent articles reflect a diverse array of research interests and contributions to the field:
Compositional Sparsity of Learnable Functions
Norm-based Generalization Bounds for Sparse Neural Networks
System identification of neural systems: If we got it right, would we know?
Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds
The Janus effects of SGD vs GD: high noise and low rank
SGD and Weight Decay Provably Induce a Low-Rank Bias in Deep Neural Networks
How to guess a gradient
For interpolating kernel machines, minimizing the norm of the ERM solution maximizes stability
Professor Information
University | Massachusetts Institute of Technology |
---|---|
Position | McDermott Professor in Brain Sciences |
Citations(all) | 126491 |
Citations(since 2020) | 23787 |
Cited By | 122550 |
hIndex(all) | 149 |
hIndex(since 2020) | 70 |
i10Index(all) | 468 |
i10Index(since 2020) | 223 |
University Profile Page | Massachusetts Institute of Technology |
Research & Interests List
Machine Learning
Learning Theory
AI
Neuroscience
Computational Vision
Top articles of TOMASO POGGIO
Compositional Sparsity of Learnable Functions
Neural networks have demonstrated impressive success in various domains, raising the question of what fundamental principles underlie the effectiveness of the best AI systems and quite possibly of human intelligence. This perspective argues that compositional sparsity, or the property that a compositional function have "few" constituent functions, each depending on only a small subset of inputs, is a key principle underlying successful learning architectures. Surprisingly, all functions that are efficiently Turing computable have a compositional sparse representation. Furthermore, deep networks that are also sparse can exploit this general property to avoid the “curse of dimensionality". This framework suggests interesting implications about the role that machine learning may play in mathematics.
Authors
Tomaso Poggio,Maia Fraser
Published Date
2024/2/8
Norm-based Generalization Bounds for Sparse Neural Networks
In this paper, we derive norm-based generalization bounds for sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones because they consider the sparse structure of the neural network architecture and the norms of the convolutional filters, rather than the norms of the (Toeplitz) matrices associated with the convolutional layers. Theoretically, we demonstrate that these bounds are significantly tighter than standard norm-based generalization bounds. Empirically, they offer relatively tight estimations of generalization for various simple classification problems. Collectively, these findings suggest that the sparsity of the underlying target function and the model's architecture plays a crucial role in the success of deep learning.
Authors
Tomer Galanti,Mengjia Xu,Liane Galanti,Tomaso Poggio
Journal
Advances in Neural Information Processing Systems
Published Date
2024/2/13
System identification of neural systems: If we got it right, would we know?
Artificial neural networks are being proposed as models of parts of the brain. The networks are compared to recordings of biological neurons, and good performance in reproducing neural responses is considered to support the model’s validity. A key question is how much this system identification approach tells us about brain computation. Does it validate one model architecture over another? We evaluate the most commonly used comparison techniques, such as a linear encoding model and centered kernel alignment, to correctly identify a model by replacing brain recordings with known ground truth models. System identification performance is quite variable; it also depends significantly on factors independent of the ground truth architecture, such as stimuli images. In addition, we show the limitations of using functional similarity scores in identifying higher-level architectural motifs.
Authors
Yena Han,Tomaso A Poggio,Brian Cheung
Published Date
2023/7/3
Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds
We overview several properties—old and new—of training overparameterized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous rectified linear unit networks. We study the convergence to a solution with the absolute minimum ρ, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers is used together with weight decay under different forms of gradient descent. A main property of the minimizers that bound their expected error for a specific network architecture is ρ. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next, we prove that quasi-interpolating solutions obtained by stochastic gradient descent in the presence of weight decay have a bias toward low-rank …
Authors
Mengjia Xu,Akshay Rangamani,Qianli Liao,Tomer Galanti,Tomaso Poggio
Journal
Research
Published Date
2023/3/8
The Janus effects of SGD vs GD: high noise and low rank
It was always obvious that SGD has higher fluctuations at convergence than GD. It has also been often reported that SGD in deep RELU networks has a low-rank bias in the weight matrices. A recent theoretical analysis linked SGD noise with the low-rank bias induced by the SGD updates associated with small minibatch sizes [1]. In this paper, we provide an empirical and theoretical analysis of the convergence of SGD vs GD, first for deep RELU networks and then for the case of linear regression, where sharper estimates can be obtained and which is of independent interest. In the linear case, we prove that the components of the matrix W corresponding to the null space of the data matrix X converges to zero for both SGD and GD, provided the regularization term is non-zero (in the case of square loss; for exponential loss the result holds independently of regularization). The convergence rate, however, is exponential for SGD, and linear for GD. Thus SGD has a much stronger bias than GD towards solutions for weight matrices W with high fluctuations and low rank, provided the initialization is from a random matrix (but not if W is initialized as a zero matrix). Thus SGD under exponential loss, or under the square loss with non-zero regularization, shows the coupled phenomenon of low rank and asymptotic noise.
Authors
Mengjia Xu,Tomer Galanti,Akshay Rangamani,Lorenzo Rosasco,Tomaso Poggio
Published Date
2023/12/21
SGD and Weight Decay Provably Induce a Low-Rank Bias in Deep Neural Networks
In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matri- ces. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.
Authors
Tomer Galanti,Zachary Siegel,Aparna Gupte,Tomaso Poggio
Published Date
2023/2/14
How to guess a gradient
How much can you say about the gradient of a neural network without computing a loss or knowing the label? This may sound like a strange question: surely the answer is "very little." However, in this paper, we show that gradients are more structured than previously thought. Gradients lie in a predictable low-dimensional subspace which depends on the network architecture and incoming features. Exploiting this structure can significantly improve gradient-free optimization schemes based on directional derivatives, which have struggled to scale beyond small networks trained on toy datasets. We study how to narrow the gap in optimization performance between methods that calculate exact gradients and those that use directional derivatives. Furthermore, we highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
Authors
Utkarsh Singhal,Brian Cheung,Kartik Chandra,Jonathan Ragan-Kelley,Joshua B Tenenbaum,Tomaso A Poggio,Stella X Yu
Journal
arXiv preprint arXiv:2312.04709
Published Date
2023/12/7
For interpolating kernel machines, minimizing the norm of the ERM solution maximizes stability
In this paper, we study kernel ridge-less regression, including the case of interpolating solutions. We prove that maximizing the leave-one-out () stability minimizes the expected error. Further, we also prove that the minimum norm solution — to which gradient algorithms are known to converge — is the most stable solution. More precisely, we show that the minimum norm interpolating solution minimizes a bound on stability, which in turn is controlled by the smallest singular value, hence the condition number, of the empirical kernel matrix. These quantities can be characterized in the asymptotic regime where both the dimension () and cardinality () of the data go to infinity (with as ). Our results suggest that the property of stability of the learning algorithm with respect to perturbations of the training set may provide a more general framework than the classical theory of Empirical Risk Minimization …
Authors
Akshay Rangamani,Lorenzo Rosasco,Tomaso Poggio
Journal
Analysis and Applications
Published Date
2023/1/28
Professor FAQs
What is TOMASO POGGIO's h-index at Massachusetts Institute of Technology?
The h-index of TOMASO POGGIO has been 70 since 2020 and 149 in total.
What are TOMASO POGGIO's top articles?
The articles with the titles of
Compositional Sparsity of Learnable Functions
Norm-based Generalization Bounds for Sparse Neural Networks
System identification of neural systems: If we got it right, would we know?
Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds
The Janus effects of SGD vs GD: high noise and low rank
SGD and Weight Decay Provably Induce a Low-Rank Bias in Deep Neural Networks
How to guess a gradient
For interpolating kernel machines, minimizing the norm of the ERM solution maximizes stability
...
are the top articles of TOMASO POGGIO at Massachusetts Institute of Technology.
What are TOMASO POGGIO's research interests?
The research interests of TOMASO POGGIO are: Machine Learning, Learning Theory, AI, Neuroscience, Computational Vision
What is TOMASO POGGIO's total number of citations?
TOMASO POGGIO has 126,491 citations in total.
What are the co-authors of TOMASO POGGIO?
The co-authors of TOMASO POGGIO are Earl K. Miller, Stephen SMALE, Lior Wolf, Shimon Edelman, Thomas Vetter, Manfred Fahle.