Automation Research Team
AIST logo Tohoku University logo Moonshot logo

Visual Imitation Learning of Non-Prehensile Manipulation Tasks with Dynamics-Supervised Models

Abdullah Mustafa1, Ryo Hanai1, Ixchel Ramirez1, Floris Erich1, Ryoichi Nakajo1, Yukiyasu Domae1, Tetsuya Ogata2
 
1: National Institute of Advanced Industrial Science and Technology
2: Waseda University
 
IEEE International Conference on Automation Science and Engineering (CASE) 2024
 

Abstract

Unlike quasi-static robotic manipulation tasks like pick-and-place, dynamic tasks such as non-prehensile manipulation pose greater challenges, especially for vision-based control. Successful control requires the extraction of features relevant to the target task. In visual imitation learning settings, these features can be learnt by backpropagating the policy loss through the vision backbone. Yet, this approach tends to learn task-specific features with limited generalizability. Alternatively, learning world models can realize more generalizable vision backbones. Utilizing the learnt features, task-specific policies are subsequently trained. Commonly, these models are trained solely to predict the next RGB state from the current state and action taken. But only-RGB prediction might not fully-capture the task-relevant dynamics. In this work, we hypothesize that direct supervision of target dynamic states (Dynamics Mapping) can learn better dynamics-informed world models. Beside the next RGB reconstruction, the world model is also trained to directly predict position, velocity, and acceleration of environment rigid bodies. To verify our hypothesis, we designed a non-prehensile 2D environment tailored to two tasks: "Balance-Reaching" and "Bin-Dropping". When trained on the first task, dynamics mapping enhanced the task performance under different training configurations (Decoupled, Joint, End-to-End) and policy architectures (Feedforward, Recurrent). Notably, its most significant impact was for world model pretraining boosting the success rate from 21% to 85%. Although frozen dynamics-informed world models could generalize well to a task with in-domain dynamics, but poorly to a one with out-of-domain dynamics.

Acknowledgements

This work was supported by JST [Moonshot R&D][Grant Number JPMJMS2031].


Proposal: Dynamics Mapping

Instead of solely reconstructing future RGB state, dynamics supervision can enable learning of dynamics informed models. This can benefit both pretraining and joint training of world models. The overall model architecture constitutes a world model, a policy, and a set of decoding networks. The model inputs include RGB state $I_t$, action $a_t$, and goal $g_t$. The internal states include latent $z_t$ and hidden states $h_t$ and $h_{\pi_t}$. The outputs include predicted action $\hat{a}_t$, dynamics $[\hat{P}_t,\hat{V}_t,\hat{A}_t]$, and reconstructed RGB $\hat{I}_t$

Tasks Description

We design three main tasks; with one main task and two others for evaluating generalizability of the model.

  1. Balance-Reach: The agent must reach a target position while balancing a pole. This is our main task where base world model is trained.
  2. Balance-Reach v2: Similiar to the main task with an additional obstacle to avoid. This tasks shares similiar dynamics to the main task.
  3. Bin-Drop: The agent must drop a block into a bin. This task has different dynamics compared to the main task.

BRv1 BRv2 BinDrop

Dataset Generation

Obtaining expert demonstrations of the proposed tasks were challenging. Neither human demonstrations or scripted policies were feasible. We opted for Deep Reinforcement Learning (DRL) to generate expert demonstrations. The expert policy was trained utilizing ground truth dynamical states. In a realistic setup, such states are not accessible and thus learned from vision. An image/dynamics/action dataset is generated to train our dynamics-informed model.

Results

  1. In terms of the policy loss, Dynamics-Supervised models can have lower losses compared to only-RGB ones (Velocity was most effective on average). Comparing different training configurations, Decoupled training was the most effective as it is less sensitive to hyperparameters tuning (Joint Training) and overfitting (End-to-End Training).

  2. With only-RGB based latent and Feedforward policy, low success rates are attained due consistent object drops and high position errors. Dynamics-Supervised models can achieve higher success rates and lower position errors. The most significant impact was for world model pretraining boosting the success rate from 21% to 85%.

  3. We compare the different architectures incorporating dynamical states

    1. Balance-Reach [Main Task]: Dynamics-Supervised models always improved over only-RGB performance for all architecture choices; Decoupled (Feedforward, Recurrent) and Joint.

    2. Balance-Reach V2 [In-Domain Dynamics]: Trained world models could fairly transfer to tasks with similiar dynamics.

    3. Bin-Drop [Out-of-Domain Dynamics]: Poor generalization to tasks with out-of-domain dynamics was observed.