Learning Intuitive Physics

Figure: What will happen when the robot arm moves left? Will the tape dispenser collide with the banana? Hind4sight-Net learns an unsupervised structured dynamics model which decomposes the scene into objects and predicts their motion conditioned on an action.

What will happen if the robot shown in the figure moves the arm to the left? We can all foresee that the tape dispenser will move to the left, probably colliding with the banana. Intelligent beings have the remarkable ability to effectively interact with unseen objects by leveraging intuitive models of their environment’s physics learned from experience. Predicting the effect of one’s actions is a cornerstone of intelligent behavior and also enables reasoning about sequences of actions needed to achieve desired goals. Thus, the ability to learn dynamics models autonomously from physical interaction provides an appealing avenue for improving a robot’s understanding of its physical environment, as robots can collect virtually unlimited experience through their own exploration.

In this work, we propose a novel approach to learn dynamics of the real-world and present a method that requires neither labeled data nor human supervision, enabling to improve a robot’s understanding of its environment’s physics in a lifelong learning manner. Our formulation leads to useful, interpretable models that can be used for visuomotor control and planning.

Technical Approach

Our approach denoted Hind4sight-Net jointly learns a forward and an inverse dynamics model and decomposes the scene into salient object parts and predicts their 3D motion. Our object-centric formulation allows us to capture several desirable inductive biases that help in learning more efficient and interpretable models - a scene comprises of several objects, actions can affect these objects, and the objects can, in turn, affect each other. Thus, our network outputs action-conditioned 3D scene flow, object masks and 2D optical flow as emergent properties. Unlike previous approaches, our method does not require ground-truth point-wise data associations, typically provided by a tracker, or a pre-trained perception network.

Network architecture
Figure: Structure of Hind4sight-Net: we jointly learn forward and inverse scene dynamics models from unlabeled interaction data. The forward model segments a 3D point cloud of the scene into salient object parts and predicts their SE(3) motion under the effect of an applied poking action. The inverse model takes two consecutive 3D point clouds as input and reasons over the poking action.

To learn from unlabeled real-world interaction data, we enforce consistency of estimated 3D clouds, actions and 2D images with observed ones. The main loss functions operate on observational changes and enable learning scene dynamics in the real-world without the need of data associations provided by a tracker. The image reconstruction loss uses the predicted 2D flow to minimize a photometric consistency error. The Chamfer Distance tries to enforce the geometric consistency between point clouds. The inverse model predicts spatial distributions of the actions that caused the scene to change.

Freiburg Poking Dataset

For experiments on real data, we collect 40K of interaction data with a KUKA LBR iiwa manipulator and a fixed Azure Kinect RGB-D camera. We built an arena of styrofoam with walls for preventing objects from falling down. At any given time there were 3-7 objects randomly chosen from a set of 34 distinct objects present on the arena. The objects differed from each other in shape, appearance, material, mass and friction.

For the quantitative evaluation of the learned structured forward dynamics model, we use the Bullet physics engine to collect a dataset of poking interactions. We pick four representative objects from the KIT kitchen object models database, which differ in geometry, size and texture. We record a dataset of 200K interactions, with randomized object start poses and poke actions.

License Agreement

This data is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. If you use the data in an academic context, please consider citing our paper mentioned in the Publications section.

Hind4sight-Net Performance

We evaluate the performance of our unsupervised structured dynamics model on both simulated and real-world datasets and demonstrate its applicability in a real-world model-predictive control experiment. Our Hind4sight-Net achieves the best 3D scene flow error compared to SE3-Nets (which is a supervised method for predicting the scene flow) even though it fully-unsupervised and not directly trained to predict 3D scene flow. Moreover, in the 2D optical flow estimation task Hind4sight-Net outperforms FlowNet 2.0, a state of the art optical flow prediction network, despite FlowNet 2.0 having access to two consecutive images as input and having explicit optical flow supervision.

flow result
Figure: Visualization of the optical flow predicted by FlowNet 2.0 and the implicit action-conditioned flow learned by our model. Hind4sight-Net outperforms FlowNet 2.0 as it shows sharper object masks, models collisions better and is less prone to visual distractors such as shadows.

Furthermore, to analyze the influence of our different building blocks on the learned dynamics model we conduct an ablation study and validate that reasoning jointly over the 3D and image domain improves significantly the results and incorporating the action loss of the inverse model for the full model, achieves the best result.

Hind4sight-Net for Planning

To evaluate the effectiveness of the learned dynamics model, we use the cross entropy method (CEM) to find poke action sequences that lead to a desired goal on both simulated and real data. We define the planning cost-function by a combination of the 3D and 2D domains the network has been trained on. We observe that in most cases we can reach the goal configuration with around 10 poke actions.



Coming soon.


Iman Nematollahi, Oier Mees, Lukas Hermann and Wolfram Burgard, "Hindsight for Foresight: Unsupervised Structured Dynamics Models from Physical Interaction",
arXiv preprint arXiv:2008.00456, 2020.


author = {Iman Nematollahi and Oier Mees and Lukas Hermann and Wolfram Burgard},
title={Hindsight for Foresight: Unsupervised Structured Dynamics Models from Physical Interaction},
booktitle={Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
address = {Las Vegas, USA}