Bridging Vision and Language over Time with Neural Cross-modal Embeddings
Giving computers the ability to fully comprehend an image is one of the main goals of computer vision. While humans excel at this task, this is still a challenge for machines, requiring bridging vision and language, which inherently have distinct heterogeneous computational representations. Cross-modal embeddings are used to tackle this challenge, by learning a common embedding space that unifies these representations. In a real-setting, emerging events change the way humans interpret and describe the same visual element over time: e.g. depending on the time instant, images with wreckage may be described as "tsunami", "flood", "collapsed building", etc. However, the temporal footprint of data interactions and its impact has been overlooked by state-of-the-art approaches, which are designed under a static world model assumption. This research extends previous works by seeking models that capture patterns of visual and textual interactions over time. We present novel cross-modal embedding models that incorporate time: 1) in a relative manner, where by using pairwise temporal correlations to aid data structuring, we obtained a model that provides better visual-textual correspondences on dynamic corpora, and 2) in a diachronic manner, where the temporal dimension is fully preserved in the embedding space. With our diachronic approach, which captures visual and language interactions' evolution, we were able to extract rich insights from a 20 years large-scale dataset.
David Semedo is a Ph.D. candidate from the NOVA University of Lisbon and a researcher in the Web and Media Search group (NOVA Search) from NOVA LINCS. In his Ph.D. thesis, advised by Prof. João Magalhães also from NOVA LINCS, he addresses the semantic gap between visual and textual modalities, towards computationally bridging vision and language, by modelling cross-modal data interactions over time. Namely, he focuses on Temporal Cross-modal Embeddings and Neural-based Representation Learning for Multimedia Understanding. His research interests are multimodal machine learning, at the intersection of computer vision and natural language processing, neural networks and data mining. He has received a Master in Computer Science and Engineering and has been a researcher in several national and international projects (COGNITUS, GoLocal, Smartyflow and Restrict to Plan).