Yann LeCun

New York University

H-index: 145

North America-United States

Description

Yann LeCun, With an exceptional h-index of 145 and a recent h-index of 113 (since 2020), a distinguished researcher at New York University, specializes in the field of AI, machine learning, computer vision, robotics, image compression.

His recent articles reflect a diverse array of research interests and contributions to the field:

Learning and Leveraging World Models in Visual Representation Learning

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Fast and exact enumeration of deep networks partitions regions

An Information Theory Perspective on Variance-Invariance-Covariance Regularization

Eyes wide shut? exploring the visual shortcomings of multimodal llms

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

Learning by Reconstruction Produces Uninformative Features For Perception

To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review

Professor Information

University	New York University
Position	Chief AI Scientist at Facebook & Silver Professor at the Courant Institute
Citations(all)	338733
Citations(since 2020)	228006
Cited By	196357
hIndex(all)	145
hIndex(since 2020)	113
i10Index(all)	364
i10Index(since 2020)	286
Email	Access Email
University Profile Page	New York University

Research & Interests List

machine learning

computer vision

robotics

image compression

Top articles of Yann LeCun

Learning and Leveraging World Models in Visual Representation Learning

Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.

Authors

Quentin Garrido,Mahmoud Assran,Nicolas Ballas,Adrien Bardes,Laurent Najman,Yann LeCun

Journal

arXiv preprint arXiv:2403.00504

Published Date

2024/3/1

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets.

Authors

Amir Bar,Arya Bakhtiar,Danny Tran,Antonio Loquercio,Jathushan Rajasegaran,Yann LeCun,Amir Globerson,Trevor Darrell

Journal

arXiv preprint arXiv:2404.09991

Published Date

2024/4/15

Fast and exact enumeration of deep networks partitions regions

One fruitful formulation of Deep Networks (DNs) enabling their theoretical study and providing practical guidelines to practitioners relies on Piecewise Affine Splines. In that realm, a DN’s input-mapping is expressed as per-region affine map-ping where those regions are implicitly determined by the model’s architecture and form a partition of their input space. That partition –which is involved in all the results spanned from this line of research– has so far only been computed on 2/3-dimensional slices of the DN’s input space or estimated by random sampling. In this paper, we provide the first parallel algorithm that does exact enumeration of the DN’s partition regions. The proposed algorithm enables one to finally assess the closeness of the commonly employed approximations methods, e.g. based on random sampling of the DN input space. One of our key finding is that if one is only interested in regions with "large" …

Authors

Randall Balestriero,Yann LeCun

Published Date

2023/6/4

An Information Theory Perspective on Variance-Invariance-Covariance Regularization

Variance-Invariance-Covariance Regularization (VICReg) is a self-supervised learning (SSL) method that has shown promising results on a variety of tasks. However, the fundamental mechanisms underlying VICReg remain unexplored. In this paper, we present an information-theoretic perspective on the VICReg objective. We begin by deriving information-theoretic quantities for deterministic networks as an alternative to unrealistic stochastic network assumptions. We then relate the optimization of the VICReg objective to mutual information optimization, highlighting underlying assumptions and facilitating a constructive comparison with other SSL algorithms and derive a generalization bound for VICReg, revealing its inherent advantages for downstream tasks. Building on these results, we introduce a family of SSL methods derived from information-theoretic principles that outperform existing SSL techniques.

Authors

Ravid Shwartz-Ziv,Randall Balestriero,Kenji Kawaguchi,Tim GJ Rudner,Yann LeCun

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.

Authors

Shengbang Tong,Zhuang Liu,Yuexiang Zhai,Yi Ma,Yann LeCun,Saining Xie

Journal

arXiv preprint arXiv:2401.06209

Published Date

2024/1/11

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

Given a graph with textual attributes, we enable users to `chat with their graph': that is, to ask questions about the graph using a conversational interface. In response to a user's questions, our method provides textual replies and highlights the relevant parts of the graph. While existing works integrate large language models (LLMs) and graph neural networks (GNNs) in various ways, they mostly focus on either conventional graph tasks (such as node, edge, and graph classification), or on answering simple graph queries on small or synthetic graphs. In contrast, we develop a flexible question-answering framework targeting real-world textual graphs, applicable to multiple applications including scene graph understanding, common sense reasoning, and knowledge graph reasoning. Toward this goal, we first develop our Graph Question Answering (GraphQA) benchmark with data collected from different tasks. Then, we propose our G-Retriever approach, which integrates the strengths of GNNs, LLMs, and Retrieval-Augmented Generation (RAG), and can be fine-tuned to enhance graph understanding via soft prompting. To resist hallucination and to allow for textual graphs that greatly exceed the LLM's context window size, G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem. Empirical evaluations show that our method outperforms baselines on textual graph tasks from multiple domains, scales well with larger graph sizes, and resists hallucination. (Our codes and datasets are available at: https://github.com/XiaoxinHe/G-Retriever.)

Authors

Xiaoxin He,Yijun Tian,Yifei Sun,Nitesh V Chawla,Thomas Laurent,Yann LeCun,Xavier Bresson,Bryan Hooi

Journal

arXiv preprint arXiv:2402.07630

Published Date

2024/2/12

Learning by Reconstruction Produces Uninformative Features For Perception

Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance--a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90\% of the pixel variance can be solved with 45\% test accuracy. Using the bottom subspace instead, accounting for only 20\% of the pixel variance, reaches 55\% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task.

Authors

Randall Balestriero,Yann LeCun

Journal

arXiv preprint arXiv:2402.11337

Published Date

2024/2/17

To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review

Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory has shaped deep neural networks, particularly the information bottleneck principle. This principle optimizes the trade-off between compression and preserving relevant information, providing a foundation for efficient network design in supervised contexts. However, its precise role and adaptation in self-supervised learning remain unclear. In this work, we scrutinize various self-supervised learning approaches from an information-theoretic perspective, introducing a unified framework that encapsulates the self-supervised information-theoretic learning problem. This framework includes multiple encoders and decoders, suggesting that all existing work on self-supervised learning can be seen as specific instances. We aim to unify these approaches to understand their underlying principles better and address the main challenge: many works present different frameworks with differing theories that may seem contradictory. By weaving existing research into a cohesive narrative, we delve into contemporary self-supervised methodologies, spotlight potential research areas, and highlight inherent challenges. Moreover, we discuss how to estimate information-theoretic quantities and their associated empirical problems. Overall, this paper provides a comprehensive review of the intersection of information theory, self-supervised learning, and deep neural networks, aiming for a better understanding through our …

Authors

Ravid Shwartz Ziv,Yann LeCun

Published Date

2024/3/12