RMIT University
Browse

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

conference contribution
posted on 2024-11-03, 13:53 authored by Haytham AbokelaHaytham Abokela, Anurag Kumar
Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4%).

History

Start page

558

End page

565

Total pages

8

Outlet

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

Editors

Christian Bessiere

Name of conference

IJCAI-PRICAI 2020

Publisher

International Joint Conferences on Artificial Intelligence

Place published

United States

Start date

2021-01-07

End date

2021-01-15

Language

English

Copyright

Copyright © 2020 International Joint Conferences on Artificial Intelligence. All rights reserved

Former Identifier

2006100750

Esploro creation date

2021-06-01