Andrew Zisserman

University of Oxford

H-index: 194

Europe-United Kingdom

Description

Andrew Zisserman, With an exceptional h-index of 194 and a recent h-index of 120 (since 2020), a distinguished researcher at University of Oxford, specializes in the field of Computer Vision, Machine Learning.

His recent articles reflect a diverse array of research interests and contributions to the field:

Action classification in video clips using attention-based neural networks

FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Parallel video processing systems

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

AutoAD III: The Prequel--Back to the Pixels

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Moving Object Segmentation: All You Need Is SAM (and Flow)

Professor Information

University	University of Oxford
Position	___
Citations(all)	396658
Citations(since 2020)	228604
Cited By	253692
hIndex(all)	194
hIndex(since 2020)	120
i10Index(all)	632
i10Index(since 2020)	445
Email	Access Email
University Profile Page	University of Oxford

Research & Interests List

Computer Vision

Machine Learning

Top articles of Andrew Zisserman

Action classification in video clips using attention-based neural networks

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for classifying actions in a video. One of the methods obtaining a feature representation of a video clip; obtaining data specifying a plurality of candidate agent bounding boxes in the key video frame; and for each candidate agent bounding box: processing the feature representation through an action transformer neural network.

Published Date

2024/1/25

FlexCap: Generating Rich, Localized, and Flexible Captions in Images

We introduce a versatile vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a approach with FlexCap can be better at open-ended object detection than a approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .

Authors

Debidatta Dwibedi,Vidhi Jain,Jonathan Tompson,Andrew Zisserman,Yusuf Aytar

Journal

arXiv preprint arXiv:2403.12026

Published Date

2024/3/18

Parallel video processing systems

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for parallel processing of video frames using neural networks. One of the methods includes receiving a video sequence comprising a respective video frame at each of a plurality of time steps; and processing the video sequence using a video processing neural network to generate a video processing output for the video sequence, wherein the video processing neural network includes a sequence of network components, wherein the network components comprise a plurality of layer blocks each comprising one or more neural network layers, wherein each component is active for a respective subset of the plurality of time steps, and wherein each layer block is configured to, at each time step at which the layer block is active, receive an input generated at a previous time step and to process the input to generate a …

Published Date

2023/6/15

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and then use these exemplars to classify all speech segments by speaker identity. Notably, the method does not require face detection or tracking. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of the vast amount of videos available on modern streaming services. Project page : \url{https://www.robots.ox.ac.uk/~vgg/research/look-listen-recognise/}

Authors

Bruno Korbar,Jaesung Huh,Andrew Zisserman

Journal

arXiv preprint arXiv:2401.12039

Published Date

2024/1/22

AutoAD III: The Prequel--Back to the Pixels

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.

Authors

Tengda Han,Max Bain,Arsha Nagrani,Gül Varol,Weidi Xie,Andrew Zisserman

Journal

Conference on Computer Vision and Pattern Recognition (CVPR 2024)

Published Date

2024/4/22

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.

Authors

Yash Bhalgat,Iro Laina,João F Henriques,Andrew Zisserman,Andrea Vedaldi

Journal

arXiv preprint arXiv:2403.10997

Published Date

2024/3/16

The Manga Whisperer: Automatically Generating Transcriptions for Comics

In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.

Authors

Ragav Sachdeva,Andrew Zisserman

Journal

arXiv preprint arXiv:2401.10224

Published Date

2024/1/18

Moving Object Segmentation: All You Need Is SAM (and Flow)

The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.

Authors

Junyu Xie,Charig Yang,Weidi Xie,Andrew Zisserman

Journal

arXiv preprint arXiv:2404.12389

Published Date

2024/4/18