Inspired by the success of transfer learning in computer vision, roboticists have investigated visual pre-training as a means to improve the learning efficiency and generalization ability of policies learned from pixels. To that end, past work has favored large object interaction datasets, such as first-person videos of humans completing diverse tasks, in pursuit of manipulation-relevant features. Although this approach improves the efficiency of policy learning, it remains unclear how reliable these representations are in the presence of distribution shifts that arise commonly in robotic applications. Surprisingly, we find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture or the introduction of distractor objects. To understand what properties do lead to robust representations, we compare the performance of 15 pre-trained vision models under different visual appearances. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models. The rank order induced by this metric is more predictive than metrics that have previously guided generalization research within computer vision and machine learning, such as downstream ImageNet accuracy, in-domain accuracy, or shape-bias as evaluated by cue-conflict performance. We test this finding extensively on a suite of distribution shifts in ten tasks across two simulated manipulation environments. On the ALOHA setup, segmentation score predicts real-world performance after offline training with 50 demonstrations.
We find that the emergent segmentation performance of a ViT model, which we refer to as Segmenting-Features, strongly predicts generalization performance. In the example below, we visualize the attention of a ViT that has a high emergent segmentation performance: MoCo-v3. Policies trained on top of this model are successful zero-shot in 6 out of the 10 distribution shifts depicted below.
Models that don't have segmenting-features, such as Masked Visual Pre-training (MVP), are less successful at generalizing to new visual appearances. In the example below, a policy learned on top of MVP only generalizes to 3 out of 10 visual shifts. This is surprising because MVP is designed for manipulation and control tasks.
We can evaluate the example above quantitatively. Specifically we measure the predictive power of segmenting-features by correlating the Jaccard index of the attention heads of different pre-trained models with the out-of-distribution performance of a downstream policy. This metric is more predictive than other metrics of generalizability, such as downstream ImageNet accuracy, in-domain accuracy, or shape-bias as evaluated by cue-conflict performance.
There's a lot of excellent on pre-trained visual representations. This work wouldn't have been possible without the development and open-sourcing of existing PVR models such as VIP, MVP, and R3M. Please also see concurrent work that studies visual pre-training for robotics.
To evaluate the Jaccard index for a new ViT model, make the following changes in our fork of the IP-ViT codebase:
In get_model() in utils.py:
elif "my_model_name" == args.model_name:
import my_model
my_model.load_weights()
mean = (0.485, 0.456, 0.406) # imagenet example
std = (0.229, 0.224, 0.225)
In evaluate_segmentation.sh:
model_names=("my_model")
Then run:
./evaluate_segmentation.sh
Ensure that forward_selfattention
or forward_attention
are implemented in the model class for my_model.
An example wrapper for CLIP is provided at the top of utils.py.
@article{Burns2023WhatMakesPVR,
author = {Burns, Kaylee and Witzel, Zach and Hamid, Jubayer Ibn and Yu, Tianhe and Finn, Chelsea and Hausman, Karol},
title = {What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?},
journal = {ArXiv},
year = {2023},
}