Vision and Language

Humans communicate using language and conversations that frequently refer to things that can be seen in the world. Any robot that is going to communicate effectively about the world with a human will inevitably need to relate vision and language in much the same way that humans do.

Project Leader

Team Members

Project Aim

The ability to process vision takes up more of the human brain than any other function, and language is our primary means of communication. Any robot that is going to communicate flexibly about the world with a human will inevitably need to relate vision and language in much the same way that humans do. It’s not that this is the best way to sense, or communicate, only that it’s the human way, and communicating with humans is central to what robots need to be able to do.

This project used technology developed for vision and language purposes to develop capabilities relevant to visual robotics. More than just Visual Question and Answering (VQA) for robots or Dialogue for Tasking, it included questions about what needed to be learnt, stored, and reasoned over for a robot to be able to carry out a general task specified by a human through natural language.

Key Results

The team had two major milestones for 2020. The first was to achieve state-of-the-art for Visual Language Navigation (VLN). In relation to this milestone, Yuankai proposed an Object-and-Action Aware Model and Yicong proposed a Language and Visual Entity Relationship Graph model. Vision-and-Language Navigation requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects, and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment. To capture and utilize the relationships, we propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision, and the intra-modal relationships among visual entities. We propose a message passing algorithm for propagating information between language elements and visual entities in the graph, which we then combine to determine the next action to take. This model has achieved state-of-the-art performance in the emerging field of VLN.

The second was the development of a demonstrator in collaboration with the Manipulation Demonstrator team. Specifically, the aim was to deliver an interface that receives a text description and a tabletop image as the input, and outputs a bounding box of the queried object to the robot, so that the robot can grab, manipulate and pass the object on to a human.

In 2020, the project team had six papers accepted for the Conference on Computer Vision and Pattern Recognition and two papers for the International Joint Conference on Artificial Intelligence, one paper at the annual conference on Neural Information Processing Systems, one paper for the Empirical Methods in Natural Language conference, four papers for the European Conference on Computer Vision and four papers at the Association for Computing Machinery Multimedia conference. The team also won the TextVQA Challenge and the MediVQA Challenge and hosted their first Remoted Embodied Visual Referring Expression in Real Indoor Environments challenge at the 2020 meeting of the Association for Computational Linguistics.

2020 Annual Report

Vision and Language

Humans communicate using language and conversations that frequently refer to things that can be seen in the world. Any robot that is going to communicate effectively about the world with a human will inevitably need to relate vision and language in much the same way that humans do.

Project Leader

Qi Wu

Team Members

Anton van den Hengel

Stephen Gould

Chunhua Shen

Anthony Dick

Yuankai Qi

Hui Li

Violetta Shevchenko

Yicong Hong

Zheyuan ‘David’ Liu

Sam Bahrami

Project Aim

Key Results