Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Renmin University, Beijing University of Posts and Telecommunications, Shanghai AI Lab

Author Info

Amara Okonkwo

Robotics & Embodied AI Editor

M.Eng. Robotics (Imperial College London); former field applications engineer

Amara covers humanoids, industrial automation, and simulation-to-real transfer. She interviews practitioners about safety cases, unit economics, and dataset quality rather than demo videos alone. Her reviews call out what is lab-only versus commercially deployed.

#Embodied AI #Industrial Robotics #Simulation #Safety & Deployment

Full author profile →

New Approach to Enable AI to Locate Objects of Interest Using Multimodal Cues Like Humans!

Researchers from Renmin University’s GeWu-Lab, Beijing University of Posts and Telecommunications (BUPT), Shanghai AI Laboratory, and other institutions have proposed Ref-AVS (Refer and Segment Objects in Audio-Visual Scenes). This method enables AI to see, hear, and better understand the real physical world.

The related paper has been accepted by ECCV 2024, a top-tier conference in computer vision.

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 2

For example, in the image below, how can a machine accurately locate the person actually playing an instrument?

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 3

Relying on single-modal analysis is insufficient, yet this is precisely what existing research has been doing (approaching the problem from visual, textual, and audio cues independently).

  • Video Object Segmentation (VOS): Typically uses an object mask from the first frame as a reference to guide the segmentation of specific objects in subsequent frames. (Heavily relies on precise annotations in the first frame.)
  • Referring Video Object Segmentation (Ref-VOS): Segments objects in videos based on natural language descriptions, replacing the mask annotations used in VOS. (While more accessible, its capabilities are limited.)
  • Audio-Visual Segmentation (AVS): Uses audio as guidance to segment sound-emitting objects in videos. (Cannot handle silent objects.)

The new Ref-AVS method integrates relationships across multiple modalities (text, audio, and vision) to adapt to more realistic dynamic audio-visual scenes.

Now, individuals singing while playing the guitar can be easily identified.

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 4

Furthermore, the same clip can be reused repeatedly to identify which guitar is currently being played.

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 5

Meanwhile, researchers constructed a dataset named Ref-AVS Bench and designed an end-to-end framework to efficiently process multimodal cues.

The details are as follows.

Construction of the Ref-AVS Bench Dataset

In summary, the Ref-AVS Bench dataset comprises 40,020 video frames, containing 6,888 objects and 20,261 reference expressions.

Each data point includes audio corresponding to the video frame and provides pixel-level annotations for each frame.

To ensure diversity in the referenced objects, the team selected 52 categories including backgrounds: 48 categories of sound-emitting objects and 3 categories of static, non-sound-emitting objects.

During video collection, all videos were sourced from YouTube and clipped to 10 seconds.

Throughout the manual collection process, the team deliberately avoided videos with the following characteristics:

a) Videos containing a large number of identical semantic instances;
b) Videos with extensive editing and camera perspective switches;
c) Synthetic or unrealistic videos created through post-production.

Additionally, to improve consistency with real-world distributions, the team selected videos that contributed to diversity in scene types within the dataset.

For instance, this includes videos involving interactions between multiple objects (such as instruments, people, vehicles, etc.).

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 6

Furthermore, diversity in expressions is one of the core elements in constructing the Ref-AVS dataset.

Beyond inherent textual semantic information, expressions are composed of auditory, visual, and temporal dimensions.

The auditory dimension includes features such as volume and rhythm, while the visual dimension encompasses attributes like object appearance and spatial properties.

The team also utilized temporal cues to generate references with sequential hints, such as “(object) that makes sound first” or “(object) that appears later.”

By integrating auditory, visual, and temporal information, the researchers designed rich expressions that not only accurately reflect multimodal scenes but also meet users’ specific needs for precise referencing.

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 7

Moreover, accuracy of expressions is a core focus.

The research follows three rules to generate high-quality expressions:

  1. Uniqueness: An expression must refer to a single unique object and cannot simultaneously point to multiple objects.
  2. Necessity: While complex expressions can be used to identify objects, every adjective in the sentence should narrow down the scope of the target object, avoiding unnecessary or redundant descriptions.
  3. Clarity: Some expression templates involve subjective factors, such as “the louder __.” Such expressions should only be used when the situation is clear enough to avoid ambiguity.

The team divided each 10-second video into ten equal one-second segments. They utilized Grounding SAM to segment and label key frames, then required annotators to manually check and correct these key frames.

This process enabled the team to generate masks and labels for multiple target objects within key frames.

Once the masks for key frames were determined, tracking algorithms were applied to follow the target objects, obtaining the final mask labels (Ground Truth Masks) for the targets across the 10-second span.

Regarding data splitting and statistics, videos in the test set and their corresponding annotations underwent careful review and correction by trained annotators.

To comprehensively evaluate model performance on the Ref-AVS task, the test set was further divided into three distinct subsets.

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 8

Specifically, the three test subsets include:

  • Seen Subset: Includes object categories that appeared in the training set. This subset was established to evaluate the model’s baseline performance.
  • Unseen Subset: Specifically designed to assess the model’s generalization ability in unseen audio-visual scenarios.
  • Null Subset: Tests the model’s robustness against null references, where the expression is unrelated to any object in the video.

Implementation Details

After completing dataset preparation, the team leveraged multimodal cues to enhance expression referencing capabilities (Expression Enhancing with Multimodal Cues, EEMC) to achieve superior audio-visual referring segmentation.

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 9

Specifically, in the Temporal Bi-Modal Transformer module, the team fused audio-visual modal information (FV, FA), which contains temporal information, with textual information (FT).

Note: To enable the model to better perceive temporal information, the study proposes an intuitive Cached Memory mechanism (CV, CA).

Cached memory stores the time-averaged modality features from the beginning up to the current moment to capture the magnitude of change in multimodal information over time. The calculation method for multimodal features (QV, QA) is as follows:

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 10

Where $t$ represents a specific time step in the sequence, and $\alpha$ is an adjustable hyperparameter used to control the model’s sensitivity to temporal feature changes during the sequence process.

When the current audio or visual features do not change significantly compared to the mean of past features, the output features remain almost unchanged.

However, when changes are more pronounced, the cached memory amplifies the differences in current features, thereby producing outputs with significant characteristics.

Subsequently, the concatenated multimodal features are fed into the Multimodal Integration Transformer module for fusion, generating the final feature ($Q_M$) of the reference expression containing multimodal information as input to the mask decoder.

The mask decoder is a segmentation foundation model based on Transformer architecture, such as MaskFormer, Mask2Former, or SAM.

The team selected Mask2Former as the segmentation foundation model, using its pre-trained mask queries as $Q_{mask}$, and treating the multimodal reference expression features as $K$ and $V$.

A cross-attention transformer (CATF) transfers the multimodal reference expression features into the mask queries, thereby enabling the segmentation foundation model to perform segmentation based on multimodal features.

Experimental Results

In quantitative experiments, the team compared the proposed baseline method with other methods, supplementing missing modality information in other approaches for fairness.

Test results on the Seen subset show that the new Ref-AVS method outperforms other methods.

Simultaneously, on the Unseen and Null subsets, Ref-AVS demonstrated generalizability and could accurately follow reference expressions.

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 11

In qualitative experiments, the team visualized segmentation masks on the Ref-AVS Bench test set and compared them with AVSegFormer and ReferFormer.

Making AI Better Understand the Physical World: New Multimodal Segmentation Method Proposed by Re… — figure 12

The results show that ReferFormer’s performance in Ref-VOS tasks and AVSegFormer’s performance in AVS tasks both failed to accurately segment the objects described in the expressions.

Specifically, AVSegFormer struggled with understanding expressions, often directly generating masks for sound sources.

For example, in the bottom-left sample, AVSegFormer incorrectly segmented the vacuum cleaner as the target instead of the boy.

On the other hand, Ref-VOS may fail to fully comprehend audio-visual scenes, thereby misidentifying a toddler as a piano player, as shown in the top-right sample.

In contrast, the Ref-AVS method demonstrated superior capabilities, handling both multimodal expressions and scenes simultaneously, thus accurately understanding user instructions and segmenting target objects.

In the future, higher-quality multimodal fusion techniques, real-time applicability of models, and dataset expansion and diversification could be considered to apply multimodal referring segmentation to challenges in video analysis, medical image processing, autonomous driving, and robot navigation.

For more details, please refer to the original paper.

Comments