Trevor Darrell

University of California, Berkeley

H-index: 161

North America-United States

Description

Trevor Darrell, With an exceptional h-index of 161 and a recent h-index of 116 (since 2020), a distinguished researcher at University of California, Berkeley, specializes in the field of Computer Vision, Artificial Intelligence, AI, Machine Learning, Deep Learning.

His recent articles reflect a diverse array of research interests and contributions to the field:

Real-world humanoid locomotion with reinforcement learning

Diffusion hyperfeatures: Searching through time and space for semantic correspondence

Humanoid Locomotion as Next Token Prediction

Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data

InstanceDiffusion: Instance-level Control for Image Generation

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Neural Network Diffusion

Shape-guided diffusion with inside-outside attention

Professor Information

University	University of California, Berkeley
Position	Professor of Computer Science
Citations(all)	242160
Citations(since 2020)	158435
Cited By	144112
hIndex(all)	161
hIndex(since 2020)	116
i10Index(all)	468
i10Index(since 2020)	342
Email	Access Email
University Profile Page	University of California, Berkeley

Research & Interests List

Computer Vision

Artificial Intelligence

Machine Learning

Deep Learning

Top articles of Trevor Darrell

Real-world humanoid locomotion with reinforcement learning

Humanoid robots that can autonomously operate in diverse environments have the potential to help address labor shortages in factories, assist elderly at home, and colonize new planets. Although classical controllers for humanoid robots have shown impressive results in a number of settings, they are challenging to generalize and adapt to new environments. Here, we present a fully learning-based approach for real-world humanoid locomotion. Our controller is a causal transformer that takes the history of proprioceptive observations and actions as input and predicts the next action. We hypothesized that the observation-action history contains useful information about the world that a powerful transformer model can use to adapt its behavior in context, without updating its weights. We trained our model with large-scale model-free reinforcement learning on an ensemble of randomized environments in simulation …

Authors

Ilija Radosavovic,Tete Xiao,Bike Zhang,Trevor Darrell,Jitendra Malik,Koushil Sreenath

Journal

Science Robotics

Published Date

2024/4/17

Diffusion hyperfeatures: Searching through time and space for semantic correspondence

Diffusion models have been shown to be capable of generating high-quality images, suggesting that they could contain meaningful internal representations. Unfortunately, the feature maps that encode a diffusion model's internal information are spread not only over layers of the network, but also over diffusion timesteps, making it challenging to extract useful descriptors. We propose Diffusion Hyperfeatures, a framework for consolidating multi-scale and multi-timestep feature maps into per-pixel feature descriptors that can be used for downstream tasks. These descriptors can be extracted for both synthetic and real images using the generation and inversion processes. We evaluate the utility of our Diffusion Hyperfeatures on the task of semantic keypoint correspondence: our method achieves superior performance on the SPair-71k real image benchmark. We also demonstrate that our method is flexible and transferable: our feature aggregation network trained on the inversion features of real image pairs can be used on the generation features of synthetic image pairs with unseen objects and compositions. Our code is available at https://diffusion-hyperfeatures. github. io.

Authors

Grace Luo,Lisa Dunlap,Dong Huk Park,Aleksander Holynski,Trevor Darrell

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

Humanoid Locomotion as Next Token Prediction

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.

Authors

Ilija Radosavovic,Bike Zhang,Baifeng Shi,Jathushan Rajasegaran,Sarthak Kamat,Trevor Darrell,Koushil Sreenath,Jitendra Malik

Journal

arXiv preprint arXiv:2402.19469

Published Date

2024/2/29

Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data

Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of" task prompts", each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as" Promptonomy", since the prompts model task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the" Promptonomy" approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets. Project page: https://ofir1080. github. io/PromptonomyViT/

Authors

Roei Herzig,Ofir Abramovich,Elad Ben-Avraham,Assaf Arbelle,Leonid Karlinsky,Ariel Shamir,Trevor Darrell,Amir Globerson

Journal

Winter Conference on Applications of Computer Vision (WACV), 2024

Published Date

2022/12/8

InstanceDiffusion: Instance-level Control for Image Generation

Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bounding boxes or intricate instance segmentation masks, and combinations thereof. We propose three major changes to text-to-image models that enable precise instance-level control. Our UniFusion block enables instance-level conditions for text-to-image models, the ScaleU block improves image fidelity, and our Multi-instance Sampler improves generations for multiple instances. InstanceDiffusion significantly surpasses specialized state-of-the-art models for each location condition. Notably, on the COCO dataset, we outperform previous state-of-the-art by 20.4% AP for box inputs, and 25.4% IoU for mask inputs.

Authors

Xudong Wang,Trevor Darrell,Sai Saketh Rambhatla,Rohit Girdhar,Ishan Misra

Journal

arXiv preprint arXiv:2402.03290

Published Date

2024/2/5

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets.

Authors

Amir Bar,Arya Bakhtiar,Danny Tran,Antonio Loquercio,Jathushan Rajasegaran,Yann LeCun,Amir Globerson,Trevor Darrell

Journal

arXiv preprint arXiv:2404.09991

Published Date

2024/4/15

Neural Network Diffusion

Diffusion models have achieved remarkable success in image and video generation. In this work, we demonstrate that diffusion models can also \textit{generate high-performing neural network parameters}. Our approach is simple, utilizing an autoencoder and a standard latent diffusion model. The autoencoder extracts latent representations of a subset of the trained network parameters. A diffusion model is then trained to synthesize these latent parameter representations from random noise. It then generates new representations that are passed through the autoencoder's decoder, whose outputs are ready to use as new subsets of network parameters. Across various architectures and datasets, our diffusion process consistently generates models of comparable or improved performance over trained networks, with minimal additional cost. Notably, we empirically find that the generated models perform differently with the trained networks. Our results encourage more exploration on the versatile use of diffusion models.

Authors

Kai Wang,Zhaopan Xu,Yukun Zhou,Zelin Zang,Trevor Darrell,Zhuang Liu,Yang You

Journal

arXiv preprint arXiv:2402.13144

Published Date

2024/2/20

Shape-guided diffusion with inside-outside attention

We introduce precise object silhouette as a new form of user control in text-to-image diffusion models, which we dub Shape-Guided Diffusion. Our training-free method uses an Inside-Outside Attention mechanism during the inversion and generation process to apply a shape constraint to the cross-and self-attention maps. Our mechanism designates which spatial region is the object (inside) vs. background (outside) then associates edits to the correct region. We demonstrate the efficacy of our method on the shape-guided editing task, where the model must replace an object according to a text prompt and object mask. We curate a new ShapePrompts benchmark derived from MS-COCO and achieve SOTA results in shape faithfulness without a degradation in text alignment or image realism according to both automatic metrics and annotator ratings. Our data and code will be made available at https://shape-guided-diffusion. github. io.

Authors

Dong Huk Park,Grace Luo,Clayton Toste,Samaneh Azadi,Xihui Liu,Maka Karalashvili,Anna Rohrbach,Trevor Darrell

Published Date

2024