2019 Annual Report

This research project expands the capabilities of robots by building representations and models that will allow a robot to understand human activity and the intent of human actions.

Team Members

Hongdong Li

Australian National University (ANU), Australia

Chief Investigator Professor Hongdong Li has been with the College of Engineering and Computer Science, the Australian National University (ANU). He joined the ANU Research School of Information Sciences and Engineering (RSISE) from 2004 first as a Postdoctoral Fellow and then Research Fellow and a Senior Fellow. He was seconded to National ICT Australia (NICTA) as a Senior Research Scientist during 2008-2010 working on the Australia Bionic Eye Project. From 2010 he took a tenured position with ANU, doing teaching and research in computer vision and robotics at ANU. He as one of the fouding CIs joined the ACRV since 2014. During 2017—2018 he was a Visiting Professor with the Robotics Institute at Carnegie Mellon University (CMU), Pittsburgh.

Visit Profile

Richard Hartley

Australian National University (ANU), Australia

Richard is renowned as the founder of the field of multi-view geometry in computer vision – his text has received over 12500 citations.  Richard has been at the Australian National University since January 2001. He is also the Program Leader for the Autonomous Systems and Sensor Technology Program of NICTA.  Richard worked at the General Electric Research and Development Center from 1985 to 2001. He became involved with Image Understanding and Scene Reconstruction working with GE’s Simulation and Control Systems Division which built large-scale flight-simulators.

From 1995 he was GE project leader for a shared-vision project with Lockheed-Martin involving design and implementation of algorithms for an AFIS (fingerprint analysis) system being developed under a Lockheed-Martin contract with the FBI. This involved work in feature extraction, interactive fingerprint editing and fingerprint database matching. In 2000, he co-authored (with Andrew Zisserman) a book for Cambridge University Press, summarizing the previous decade’s research in this area. (Over 34,000 citations and an h-index of 57).

Visit Profile

Dylan Campbell

Australian National University (ANU), Australia

Dylan joined the Centre as a Research Fellow at the Australian National University (ANU) in August 2018. Previously, he was a PhD student at ANU and Data61/CSIRO, where he worked on geometric vision problems, and a research assistant in the Cyber-Physical Systems group of Data61/CSIRO, where he worked on Resource Constrained Vision.  Dylan received a BE in Mechatronic Engineering from the University of New South Wales. He has broad research interests within computer vision and robotics, including geometric vision and human-centred vision. In particular, he has investigated geometric sensor data alignment problems, such as camera localisation, simultaneous localisation and mapping, and structure from motion.

He is currently looking at the problems of recognising, modelling, and predicting human actions, poses and human-object interactions with a view to facilitate robot-human interaction as part of a Centre project.

Visit Profile

Xin Yu

Australian National University (ANU), Australia

Xin Yu was a research fellow at Australian National University (ANU). He received his Ph.D degree from the Australian National University under the supervision of Prof. Richard Hartley, Fatih Porikli and Basura Fernando. He also obtained a Ph.D from Tsinghua University, China, under the supervision of Prof. Li Zhang.

Xin is now a lecturer at the University of Technology Sydney.

Visit Profile

Fatemeh Saleh

Australian National University (ANU), Australia

Fatemeh joined the Centre as a Research Fellow at the Australian National University (ANU) in January 2019. Prior to that, she was a PhD student at Australian National University and Data61-CSIRO, working on weakly-supervised semantic segmentation of images and videos. Within the Centre, she is now working on the problem of video understanding and latent-variable generative models, with the focus on multiple object tracking, human motion prediction, and video activity analysis.

Visit Profile

Itzik Ben Shabat

Australian National University (ANU), Australia

Itzik joined the Centre as a Research Fellow at the Australian National University (ANU) node in July 2019. Previously, he was a Ph.D. student at Technion Israel Institute of Technology where he worked on “Classification, segmentation, and geometric analysis of 3D point clouds using deep learning” under the supervision of Prof. Anath Fischer and Michael Lindenbaum.

Itzik completed his Bsc. Cum Laude in 2008 (Mechanical Engineering, Technion) and his Msc. Summa Cum Laude in 2015 under the supervision of Prof. Anath Fischer (Mechanical Engineering, Technion). His research interests lie in the intersection of robotic perception, 3D computer vision, and geometric analysis, usually using 3D point cloud data.

Visit Profile

Cristian Rodriguez Opazo

Australian National University (ANU), Australia

Cristian joined the Centre as PhD researcher under the supervision of Chief Investigator Hongdong Li and Research Fellow Basura Fernando. His research interests are machine learning and pattern recognition focus on the tasks of object detection, scene understanding and occlusion handling. Cristian completed a Bachelor and Computer Engineering degree at Metropolitan Technological University UTEM in Chile, before moving to Australia to complete a Master of Computing (Advanced) with a specialisation in artificial intelligence at the Australian National University, with a Chilean scholarship ‘Becas Chile’. Before Crisitan joined ANU, he also worked as a research assistant and developer in the Web Intelligence Centre in Chile.

Visit Profile

Samira Kaviani

Australian National University (ANU), Australia

Samira started her PhD at the Australian National University (ANU) in 2018 and is supervised by Chief Investigators Richard Hartley and Stephen Gould. Her current research focuses on human pose estimation and anticipation.

Visit Profile

Frederic ‘Zhen’ Zhang

Australian National University

Fred is a PhD student at the Australian National University under the supervision of Prof. Stephen Gould. In 2018, he received his bachelor degree in engineering from the Australian National University and bachelor of science from Beijing Institute of Technology.

Fred has been working on the task of Human-Object Interaction (HOI) Detection, and is generally interested in vision-based problems and its deep learning solutions.

Visit Profile

Sadegh Aliakbarian

Australian National University (ANU), Australia

Sadegh is an Associated PhD researcher at our Australian National University node and a researcher at Smart Vision Systems, Data61, CSIRO. He is working on vision-based action anticipation. His method for action anticipation is particularly crucial in scenarios where one needs to react before the action is finalised, such as to avoid hitting pedestrians with an autonomous car. Currently, he is working on action anticipation in driving scenarios to predict human centric actions (e.g., driver maneuver, violating traffic rules, front car intention, pedestrian intention, and accidents with cars and pedestrians) before they actually begin to happen. Prior to starting his PhD, he was a researcher at NICTA. Before joining NICTA, Sadegh worked in the Computer Vision industry for more than two years. He also has a Bachelor of Science with honors in Computer Software Engineering.

Sadegh is also interested in other computer vision applications such as weakly-supervised semantic segmentation, usage of synthetic data in computer vision, and sequence learning.

Visit Profile

Shihao ‘Zac’ Jiang

Australian National University (ANU), Australia

Shihao Jiang is a PhD researcher in Australian National University, working under the supervision of Richard Hartley, Dylan Campbell, Miaomiao Liu and Stephen Gould. His research interests include geometric vision, optimization and deep learning. Prior to his PhD, he received his Bachelor of Engineering degree with first class honours in 2016, major in Electronic and Communication Systems.

Visit Profile

Project Aim

For robots to understand human actions and intent, they need to make inferences from visual cues, just like humans do. This research project expands the capabilities of robots by building representations and models that will allow a robot to understand human activity and the intent of human actions. The research is important because it will ultimately enable robots to co-operate with humans to complete a complex task in an unstructured environment. For example, assembly of a piece of furniture in the home.  


Key Results

The project team celebrated a number of significant key results and achievements in 2019. First and foremost, Rodrigo Fonseca Santa Cruz Oliveira, who joined the Centre as a PhD candidate at The Australian National University node to work on semantic scene understanding – the challenging task of building machines that can interpret images and video as humans do at first glance – was awarded his PhD. Rodrigo’s thesis titled “Visual Recognition from Structured Supervision” contributed methods for learning better models from unsupervised data and understanding activities from primitive actions. Rodrigo has moved from the Centre to take up the position of Postdoctoral Research Fellow at CSIRO in Brisbane. 

The team continued to improve results on activity recognition with work by PhD Researcher Cristian Rodriguez-Opazo that aims to locate activities specified by a natural language query in long video sequences. This work, conducted in 2019, will be published at the 2020 Winter Conference on Applications of Computer Vision (WACV). 

Other work for fine-grained understanding of human activities, such as human pose forecasting (led by Associated PhD Researcher Sadegh Aliakbarian); and multiple object tracking in crowded environments (led by Research Fellow Fatemeh Saleh) is under review. Research on human-object interaction and hand pose estimation, respectively being undertaken by PhD Researchers Frederic Zhang and Samira Kaviana, is ongoing. 

In an exciting piece of research in collaboration with the Learning project team, we have developed theory that allows traditional computer vision models to be combined with contemporary deep learning models in a single framework known as a deep declarative network. 

A workshop on deep declarative networks will be held at the premier conference on computer vision, CVPR, in June 2020. The workshop is organised by Chief Investigators Stephen Gould and Richard Hartley, Research Fellow Dylan Campbell and former Research Fellow Anoop Cherian, and has attracted speakers from Stanford University, Carnegie Mellon University, Facebook IncEcole polytechnique fédérale de LausanneUniversity of Toronto, and the University of Massachusetts. A tutorial on deep declarative networks will also take place at the European Conference on Computer Vision in August 2020. The tutorial is organised by Research Fellow Itzik Ben-Shabat and will feature lectures by researchers from The Australian National University and Stanford University. 

This body of work, achieved in 2019, establishes the necessary foundation for the goal of demonstrating human-robot cooperation for the task of assembling a piece of Ikea furniture in 2020. To this end, Research Fellows Itzik Ben-Shabat and Xin Yu have led the collection of a large dataset that contains more than 350 video examples of people assembling various pieces of Ikea furniture ranging from tables to drawer units. The dataset is being annotated and planned for release in early 2020.


Activity Plan for 2020

  • Model interaction between humans and objects (initially in images and then extended to video). 
  • Forecast human pose and (a) predict object interaction (b) anticipate activity within a known context.  
  • Automatically learn activities as grammars/state machines from video that can be understood by a robot. 
  • Demonstrate 3D pose estimation and tracking of human joints on video in real-time.  Render human skeleton form arbitrary viewpoint. 
  • Project demonstrator showing monitoring and understanding of a person assembling a piece of Ikea furniture. 
  • Robot replication of a sequence of human actions to complete some simple task. 
  • Human-robot cooperation in completing a task where the robot monitors a task, predicts future human actions and provides guidance, hands over a tool or part required next, or holds a part in place while a human performs some action.  Ability to recover from unexpected actions.