Rate per minute of video monolingual captions

10/25/2022

Notice how the English caption for the first image is “A black and white dog leaps to catch a frisbee,” but the color of the dog is not mentioned by the German caption for the same image. We show in Figure 1 two sample images used during training where the first image has two captions associated with it, one in German and another in English, and the second image has one caption in English and another in Japanese. Since our proposed approach does not train models to associate parallel texts (we do not use a language encoder), it does not require access to parallel text for any specific language pair as long as enough images with text in each target language are available. Instead, we adapt transformer-based decoders for language triplets and beyond. The present work significantly extends our prior work on backpropagation-based decoding ( Yang et al., 2020) using LSTMs for language pairs. For instance, multimodal machine translation is most effective when images are provided on top of parallel text where the images enhance the traditional machine translation corpora, and a second limitation is that translation models are still required for every language pair even if there is a single common visual representation. Multimodal machine translation aims to build word associations grounded in the visual word, however there are still some challenges. While some works have exploited alignments at the word-level ( Bergsma and Van Durme, 2011 Hewitt et al., 2018), recent work has moved forward to finding alignments between complex sentences ( Barrault et al., 2018 Surís et al., 2020 Sigurdsson et al., 2020 Yang et al., 2020). In the past few years, there have been several efforts in taking advantage of images to discover and enhance connections across different languages ( Gella et al., 2017 Nakayama and Nishida, 2017 Elliott and Kádár, 2017). However, current machine translation models usually learn these mappings between languages through large amounts of parallel multilingual text-only data. People can effortlessly associate the visual stimuli of an apple sitting on top of a table with either the word “apple” in English or Êç in Chinese. Learning a new language is a difficult task for humans as it involves significant repetition and internalization of the association between words and concepts. The results of our method also compare favorably in the Multi30k dataset against recently proposed methods that are also aiming to leverage images as an intermediate source of translations. Moreover, we demonstrate that our approach is also generally useful in the multilingual image captioning task when sentences in a second language are available at test time. We particularly show the capabilities of this approach in the translation of German-Japanese and Japanese-German sentence pairs, given a training data of images freely associated with text in English, German, and Japanese but for which no single image contains annotations in both Japanese and German. Our work proposes using backpropagation-based decoding coupled with transformer-based multilingual-multimodal language models in order to obtain translations between any languages used during training. The aim of this work is to use the learned intermediate visual representations from a deep convolutional neural network to transfer information across languages for which paired data is not available in any form. People are able to describe images using thousands of languages, but languages share only one visual world. 4Department of Computer Science, Rice University, Houston, TX, United States.3Adobe Research, San José, CA, United States.2Department of Computer Science, Universidad Católica San Pablo, Arequipa, Perú.1Department of Computer Science, University of Virginia, Charlottesville, VA, United States.Ziyan Yang 1*, Leticia Pinto-Alva 2, Franck Dernoncourt 3 and Vicente Ordonez 1,4

0 Comments

Rate per minute of video monolingual captions

Leave a Reply.

Author

Archives

Categories