Vision and Language

The ability to process vision takes up more of the human brain than any other function, and language is our primary means of communication. Any robot that is going to communicate flexibly about the world with a human will inevitably need to relate vision and language in much the same way that humans do.

Project Leader

Team Members

Project Aim

The ability to process vision takes up more of the human brain than any other function, and language is our primary means of communication. Any robot that is going to communicate flexibly about the world with a human will inevitably need to relate vision and language in much the same way that humans do. It’s not that this is the best way to sense, or communicate, only that it’s the human way, and communicating with humans is central to what robots need to be able to do.

This project uses technology developed for vision and language purposes to develop capabilities relevant to visual robotics. This is more than just Visual Question and Answering (VQA) for robots or Dialogue for Tasking. It includes questions of what needs to be learned, stored, and reasoned over for a robot to be able to carry out a general task specified by a human through natural language.

Key Results

In 2019, the project team extended the Room2Room dataset to evaluate a robot’s ability to identify a specific, household object in another room. For example, a cup, spoon or pillow. This provided a new task and dataset called REVERIE (Remote Embodied Visual referring Expression in Real Indoor Environments), which has been submitted to the 2020 Conference on Computer Vision and Pattern Recognition (CVPR). This is a significant step towards the ‘Bring me a spoon’ challenge, and an important extension of the existing dataset because it develops the challenge from merely navigating to the right location to identifying a specific object in a particular location. The project team plans to complete the ‘Bring me a spoon’ challenge (via simulation) in 2020.

The team proposed an Object-and-Action Aware Model for Robust Visual-and-Language Navigation (VLN). VLN is unique in that it requires turning relatively general natural language commands into actions on the basis of the visible environment. This requires extracting value from two very different types of natural language information: action specifications (describing movements the robot must achieve) and object descriptions (specifying items visible in the environment). The proposed approach is to process these two different forms of natural language-based instruction separately. This research is important because a robot can perform correct actions only after it fully understands both visual and semantic information. It has been submitted to the Meeting of the Association for Computational Linguistics (ACL 2020)

The project team proposed a novel deep learning-based model – a Sub-Instruction Aware Vision-and-Language Navigation model – which focuses on the granularity of visual and language sequences as well as the trackability of robots through completion of instruction. In this model, robots are provided with fine-grained annotationsduring training, and found to be able to better follow the instruction and have a greater chance of reaching the target at test time. This work has also been submitted to ACL 2020.

The team proposed a Visual Question Answering (VQA) model with Prior Class Semantics that can deal with out-of-domain answers for VQA problems. Out-of-domain answers are those that have never been seen before in the training set. This involved presentation of a novel mechanism to embed prior knowledge in a VQA model.The open-set nature of the task is at odds with the ubiquitous approach of training a fixed classifier. The project team showed how to exploit additional information pertaining to the semantics of candidate answers, and extended the answer prediction process with a regression objective in a semantic space (in which candidate answers were projected using prior knowledge derived from word embeddings).

Finally, the project team developed a demonstrator on a robotic arm (UR5 by Universal Robots) at the Centre’s University of Adelaide node. The robotic arm followed natural language instruction to draw a human face.

Activity Plan for 2020

Develop a robust and state-of-the-art model for vision-language-navigation and our new task REVERIE (Remote Embodied Visual Referring Expression in Real Indoor Environment).
Develop a demonstrator on the robotic arm located at the Centre’s University of Adelaide node. The team has demonstrated V2L technology previously on the Pepper robot. This demonstrator will extend this work, with the aim of enabling a robotic arm to follow novel natural language instructions.
Develop technology to enable a robot to identify information that it needs to specify and then complete its task. Moving from VQA into Visual Dialogue will provide the capability to ask questions that seek information necessary to complete a task, and to identify when enough information has been gathered and an action should be taken.

2019 Annual Report

Vision and Language

Project Leader

Qi Wu

Team Members

Anton van den Hengel

Stephen Gould

Chunhua Shen

Anthony Dick

Yuankai Qi

Hui Li

Violetta Shevchenko

Yicong Hong

Zheyuan ‘David’ Liu

Sam Bahrami

Project Aim

Key Results