seminars
Detail
Publication date: 1 de June, 2021Bridging Vision and Language over Time with Neural Cross-modal Embeddings
Giving computers the ability to fully comprehend an image is one of the main goals of computer vision. While humans excel at this task, this is still a challenge for machines, requiring bridging vision and language, which inherently have distinct heterogeneous computational representations. Cross-modal embeddings are used to tackle this challenge, by learning a common embedding space that unifies these representations. In a real-setting, emerging events change the way humans interpret and describe the same visual element over time: e.g. depending on the time instant, images with wreckage may be described as “tsunami”, “flood”, “collapsed building”, etc. However, the temporal footprint of data interactions and its impact has been overlooked by state-of-the-art approaches, which are designed under a static world model assumption. This research extends previous works by seeking models that capture patterns of visual and textual interactions over time. We present novel cross-modal embedding models that incorporate time: 1) in a relative manner, where by using pairwise temporal correlations to aid data structuring, we obtained a model that provides better visual-textual correspondences on dynamic corpora, and 2) in a diachronic manner, where the temporal dimension is fully preserved in the embedding space. With our diachronic approach, which captures visual and language interactions’ evolution, we were able to extract rich insights from a 20 years large-scale dataset.
Date | 29/01/2020 |
---|---|
State | Concluded |