VQA: In the most common form of Visual Question Answering (VQA), the computer is presented with an image and a textual question about this image. It must then determine the correct answer, typically a few words or a short phrase. Variants include binary (yes/no) and multiple-choice settings, in which candidate answers are proposed. A major distinction between VQA and other tasks in computer vision is that the question to be answered is not determined until run time. In traditional problems such as segmentation or object detection, the single question to be answered by an algorithm is predetermined and only the input image changes. In VQA, in contrast, the form that the question will take is unknown, as is the set of operations required to answer it. In this sense, it more closely reflects the challenge of general image understanding.
VLN: A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually grounded navigation instructions, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings.