Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations
conference contribution
posted on 2024-11-03, 14:39 authored by Po-Yao Huang, Xiaojun ChangXiaojun Chang, Alexander HauptmannWith the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks. © 2019 Association for Computational Linguistics
History
Start page
1461End page
1467Total pages
7Outlet
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019)Name of conference
EMNLP-IJCNLP 2019Publisher
Association for Computational LinguisticsPlace published
United StatesStart date
2019-11-03End date
2019-11-07Language
EnglishCopyright
© 2019 Association for Computational LinguisticsFormer Identifier
2006109380Esploro creation date
2021-08-29Usage metrics
Categories
Licence
Exports
RefWorksRefWorks
BibTeXBibTeX
Ref. managerRef. manager
EndnoteEndnote
DataCiteDataCite
NLMNLM
DCDC