Vision and hearing are our primary channels of communication and sensation. Audio-visual collaboration is beneficial for humans to better perceive and interpret the world. Humans have the ability to associate mixed sounds with object instances in complicated realistic scenarios. Imagine a cocktail-party scenario: when a group of people is speaking, we can not only locate the sound sources but also determine how many people are talking. Inspired by this human perception, we explore instance-level sound source localization in long videos and propose a new task, namely audio-visual instance segmentation (AVIS), which It requests a model to simultaneously classify, segment and track sounding object instances—identify which objects are making sounds, infer where the sounding objects are, and monitor when they are making sounds. This new task facilitates a wide range of practical applications, including embodied robotics, virtual reality, video surveillance, video editing, etc. Moreover, it can serve as a fundamental task for evaluating the comprehension capabilities of multi-modal large models.
Comparison of different audio-visual segmentation tasks. (a) Audio-Visual Object Segmentation (AVOS) only requires binary segmentation. (b) Audio-Visual Semantic Segmentation (AVSS) associates one category with every pixel. (c) Audio-Visual Instance Segmentation (AVIS) treats each sounding object of the same class as an individual instance.
Audio-visual instance segmentation is related to several existing tasks. For example, audio-visual object segmentation (AVOS) is to separate sounding objects from the background region of a given audible video, as shown in Figure \ref{fig_intro} (a). Unlike AVOS being tasked with binary foreground segmentation, audio-visual semantic segmentation (AVSS) aims at predicting semantic maps that assign each pixel with a specific category. To accomplish the above tasks, many works extend the image segmentation frameworks to the video domain, and design various audio-visual fusion modules for sound source localization. Despite promising performance in the AVSBench dataset, current methods still suffer from two limitations in real-world scenarios. First, these methods fail to differentiate two sounding objects with the same category, such as the woman, man, left ukulele and right ukulele depicted. Second, these methods focus on 5- or 10-second trimmed short videos and ignore long-range modeling abilities, which may lead to weak performance in real world.
One potential reason that the AVIS task is rarely studied is the absence of a high-quality dataset. Despite the existence of audio-visual segmentation datasets, none are directly applicable to our proposed task, due to lacking instance-level annotations and long-form videos. To explore audio-visual instance segmentation and evaluate the proposed methods, we create a new large-scale benchmark called AVISeg.
Our released AVISeg dataset satisfies the following criteria: 1) It focuses on long-term videos, bringing them much closer to real applications. 2) It contains 26 common sound categories, spanning 4 dynamic scenarios: Music, Speaking, Machine and Animal. 3) It involves some challenging cases, such as videos with silent sound sources, single sound source, and multiple sources.
Videos in AVISeg are public on YouTube, and annotated via crowdsourcing. Our dataset does not contain personally identifiable information or offensive content.
Illustrations of our AVISeg dataset statistics.. (a) Ratio of different sound sources. (b) Number of video in 4 real-world scenarios. (c) Distribution of video lengths. (d) Number of video and objects for the 26 categories. (e) Relations between different categories.
All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.
We introduce a new baseline model, termed AVISM, for the audio-visual instance segmentation task. The proposed AVISM model, built upon Mask2Former and VITA, adopts a query-based Transformer architecture to learn a set of query vectors representing sounding objects for the instance segmentation and tracking. To better model audio-visual semantic correlations in long and complicated videos, we present the frame-level audio-visual fusion module and video-level audio-visual fusion module to integrate audio and visual features.
Overview of the proposed AVISM for audio-visual instance segmentation. (a) The frame-level sound source localizer segments sounding objects within each frame independently and condenses dense image features into frame queries. (b) The video-level sounding object tracker takes frame queries and audio features as input, and then performs temporal audio-visual communications between frames.
To study different input modalities and validate the effectiveness of the proposed model, we conduct extensive ablations of our model and compare to recent approaches.
Zero-shot results of different multi-modal large models for audio-referred visual grounding on the AVISeg test set.
Quantitative evaluation of different models from related tasks on the AVISeg test set. The best results are highlighted in bold.
Our AVISM model accurately localize the sounding object across both spatial and temporal dimensions, e.g., ``lion'' in video (d). In complex scenes with multiple sound sources, our model enables to handle the numerous mixed semantics, e.g., ``person'' and ``ukulele'' in video (a). When an object begins producing sound in the intermediate frames, AVISM is able to segment it and assign a new identity, as evidenced in video (b). This case also shows the effectiveness of our model in identifying and distinguishing objects with similar appearances or sounds. Moreover, if a sounding object disappears and reoccurs, the AVISM still correctly tracks it, e.g., ``tree harvester'' in video (c).
Sample results of our baseline model on AVISeg dataset from four scenarios: (a) Music; (b) Speaking; (c) Machine; (d) Animal. Each row have six sampled frames from a video sequence. Zoom in to see details.