Ruslan Salakhutdinov

Carnegie Mellon University

H-index: 115

North America-United States

Description

Ruslan Salakhutdinov, With an exceptional h-index of 115 and a recent h-index of 103 (since 2020), a distinguished researcher at Carnegie Mellon University, specializes in the field of Machine Learning, Artificial Intelligence, Deep Learning.

His recent articles reflect a diverse array of research interests and contributions to the field:

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

SPRING: Studying Papers and Reasoning to play Games

Inference via Interpolation: Contrastive Representations Provably Enable Planning and Inference

Factorized contrastive learning: Going beyond multi-view redundancy

Automatic question-answer generation for long-tail knowledge

Generating images with multimodal language models

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Stylus: Automatic Adapter Selection for Diffusion Models

Professor Information

University	Carnegie Mellon University
Position	UPMC Professor Machine Learning Department
Citations(all)	191959
Citations(since 2020)	137051
Cited By	115559
hIndex(all)	115
hIndex(since 2020)	103
i10Index(all)	270
i10Index(since 2020)	264
Email	Access Email
University Profile Page	Carnegie Mellon University

Research & Interests List

Machine Learning

Artificial Intelligence

Deep Learning

Top articles of Ruslan Salakhutdinov

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

Prompt engineering is effective for controlling the output of text-to-image (T2I) generative models, but it is also laborious due to the need for manually crafted prompts. This challenge has spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, and produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically identifies human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompts distribution for given reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.

Authors

Yutong He,Alexander Robey,Naoki Murata,Yiding Jiang,Joshua Williams,George J Pappas,Hamed Hassani,Yuki Mitsufuji,Ruslan Salakhutdinov,J Zico Kolter

Journal

arXiv preprint arXiv:2403.19103

Published Date

2024/3/28

SPRING: Studying Papers and Reasoning to play Games

Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read Crafter's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM). Prompted with the LaTeX source as game context and a description of the agent's current observation, our SPRING framework employs a directed acyclic graph (DAG) with game-related questions as nodes and dependencies as edges. We identify the optimal action to take in the environment by traversing the DAG and calculating LLM responses for each node in topological order, with the LLM's answer to final node directly translating to environment actions. In our experiments, we study the quality of in-context" reasoning" induced by different forms of prompts under the setting of the Crafter environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RL baselines, trained for 1M steps, without any training. Finally, we show the potential of Crafter as a test bed for LLMs. Code at github. com/holmeswww/SPRING

Authors

Yue Wu,So Yeon Min,Shrimai Prabhumoye,Yonatan Bisk,Ruslan Salakhutdinov,Amos Azaria,Tom Mitchell,Yuanzhi Li

Published Date

2023

Inference via Interpolation: Contrastive Representations Provably Enable Planning and Inference

Given time series data, how can we answer questions like "what will happen in the future?" and "how did we get here?" These sorts of probabilistic inference questions are challenging when observations are high-dimensional. In this paper, we show how these questions can have compact, closed form solutions in terms of learned representations. The key idea is to apply a variant of contrastive learning to time series data. Prior work already shows that the representations learned by contrastive learning encode a probability ratio. By extending prior work to show that the marginal distribution over representations is Gaussian, we can then prove that joint distribution of representations is also Gaussian. Taken together, these results show that representations learned via temporal contrastive learning follow a Gauss-Markov chain, a graphical model where inference (e.g., prediction, planning) over representations corresponds to inverting a low-dimensional matrix. In one special case, inferring intermediate representations will be equivalent to interpolating between the learned representations. We validate our theory using numerical simulations on tasks up to 46-dimensions.

Authors

Benjamin Eysenbach,Vivek Myers,Ruslan Salakhutdinov,Sergey Levine

Journal

arXiv preprint arXiv:2403.04082

Published Date

2024/3/6

Factorized contrastive learning: Going beyond multi-view redundancy

In a wide range of multimodal tasks, contrastive learning has become a particularly appealing approach since it can successfully learn representations from abundant unlabeled data with only pairing information (eg, image-caption or video-audio pairs). Underpinning these approaches is the assumption of multi-view redundancy-that shared information between modalities is necessary and sufficient for downstream tasks. However, in many real-world settings, task-relevant information is also contained in modality-unique regions: information that is only present in one modality but still relevant to the task. How can we learn self-supervised multimodal representations to capture both shared and unique information relevant to downstream tasks? This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy. FactorCL is built from three new contributions:(1) factorizing task-relevant information into shared and unique representations,(2) capturing task-relevant information via maximizing MI lower bounds and removing task-irrelevant information via minimizing MI upper bounds, and (3) multimodal data augmentations to approximate task relevance without labels. On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results on six benchmarks.

Authors

Paul Pu Liang,Zihao Deng,Martin Q Ma,James Y Zou,Louis-Philippe Morency,Ruslan Salakhutdinov

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

Automatic question-answer generation for long-tail knowledge

Pretrained Large Language Models (LLMs) have gained significant attention for addressing open-domain Question Answering (QA). While they exhibit high accuracy in answering questions related to common knowledge, LLMs encounter difficulties in learning about uncommon long-tail knowledge (tail entities). Since manually constructing QA datasets demands substantial human resources, the types of existing QA datasets are limited, leaving us with a scarcity of datasets to study the performance of LLMs on tail entities. In this paper, we propose an automatic approach to generate specialized QA datasets for tail entities and present the associated research challenges. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets, comparing their performance with and without external resources including Wikipedia and Wikidata knowledge graphs.

Authors

Rohan Kumar,Youngmin Kim,Sunitha Ravi,Haitian Sun,Christos Faloutsos,Ruslan Salakhutdinov,Minji Yoon

Journal

arXiv preprint arXiv:2403.01382

Published Date

2024/3/3

Generating images with multimodal language models

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text—outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.

Authors

Jing Yu Koh,Daniel Fried,Russ R Salakhutdinov

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in …

Authors

Raghav Kapoor,Yash Parag Butala,Melisa Russak,Jing Yu Koh,Kiran Kamble,Waseem Alshikh,Ruslan Salakhutdinov

Journal

arXiv preprint arXiv:2402.17553

Published Date

2024/2/27

Stylus: Automatic Adapter Selection for Diffusion Models

Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See stylus-diffusion.github.io for more.

Authors

Michael Luo,Justin Wong,Brandon Trabucco,Yanping Huang,Joseph E Gonzalez,Zhifeng Chen,Ruslan Salakhutdinov,Ion Stoica

Journal

arXiv preprint arXiv:2404.18928

Published Date

2024/4/29