SRSA: Skill Retrieval and Adaptation for Robotics Assembly Tasks

Abstract

Enabling robots to learn novel tasks in a data-efficient manner is a long-standing challenge. Common strategies involve carefully leveraging prior experiences, especially transition data collected on related tasks. Although much progress has been made for general pick-and-place manipulation, far fewer studies have investigated contact-rich assembly tasks, where precise control is essential.

We introduce SRSA (Skill Retrieval and Skill Adaptation), a novel framework designed to address this problem by utilizing a pre-existing skill library containing policies for diverse assembly tasks. The challenge lies in identifying which skill from the library is most relevant for fine-tuning on a new task. Our key hypothesis is that skills showing higher zero-shot success rates on a new task are better suited for rapid and effective fine-tuning on that task. To this end, we propose to predict the transfer success for all skills in the skill library on a novel task, and then use this prediction to guide the skill retrieval process. We establish a framework that jointly captures features of object geometry, physical dynamics, and expert actions to represent the tasks, allowing us to efficiently learn the transfer success predictor.

Extensive experiments demonstrate that SRSA significantly outperforms the leading baseline. When retrieving and fine-tuning skills on unseen tasks, SRSA achieves a 19% relative improvement in success rate, exhibits 2.6x lower standard deviation across random seeds, and requires 2.4x fewer transition samples to reach a satisfactory success rate, compared to the baseline. In a continual learning setup, SRSA efficiently learns policies for new tasks and incorporates them into the skill library, enhancing future policy learning. Furthermore, policies trained with SRSA in simulation achieve a 90% mean success rate when deployed in the real world.

5-min Overview Video with Narration.

Problem Setup

In this work, we consider the problem setting of solving a new target task leveraging pre-existing skills from a skill library. We focus on two-part assembly tasks as shown below.

Given a target task, we assume access to a prior task set. The skill library contains policies that solve each of the prior tasks, respectively. To solve a target task, the goal of reinforcement learning is to find a policy that produces an action for each state to maximize the expected return. We propose to first retrieve a skill (i.e., policy) for the most relevant prior task, and then rapidly and effectively adapt to the target task by fine-tuning the retrieved skill.

Skill Retrieval

To effectively retrieve the skills that are useful for a new target task T , we require a means to estimate the potential of applying a source policy to the target task. We are inspired by two intuitive points:

Applying the same policy on two tasks, similar success rate imply similarity in dynamics and initial state distributions.
Fine-tuning a source policy on a target task with similar dynamics to the source task could be efficient.

Therefore, we propose using zero-shot transfer success as a metric to gauge the potential to efficiently adapt a source policy to a target task. To identify a source policy with high zero-shot transfer success on a given target task, we propose to learn a function $F$ to predict zero-shot transfer success for any pair of source policy and target task.

1. Dataset Formulation to Learn Transfer Success Predictor

We treat any two tasks from the prior task set as a source-target task pair. For each pair, we evaluate the source policy on the target task to obtain the zero-shot transfer success rate. The transfer success predictor takes the information about source and target tasks as input, and outputs the zero-shot transfer success.

2. Learning Task Features for Transfer Success Predictor

We need a strong featurization of both the source and target tasks for efficient learning of the transfer success predictor. We propose a framework that jointly captures features of geometry, dynamics, and expert actions to represent the tasks.

(a) Geometry features are learned from point-cloud input using a PointNet autoencoder.
(b) Dynamics features are learned from transition segments using a state-prediction objective.
(c) Expert-action features are learned from transition segments using an action-reconstruction objective.

3. Training Transfer Success Predictor

The geometry, dynamics, and expert action features are concatenated together to form task features. We then pass the concatenated task features through an MLP to predict the transfer success.

4. Inferring Transfer Success for Retrieval

At test time, we use the well-trained transfer success predictor to predict the transfer success of applying any prior policy to a new task. We retrieve source policies with high predicted transfer successes.

We compare our approach (SRSA) with the baseline retrieval strategies (Signature, Behavior, Forward, Geometry), as shown in the figure below. Overall, SRSA retrieves source policies that obtain around 10% higher success rates on the test tasks.

Skill Adaptation

Our ultimate goal is to solve the new task as an RL problem. The retrieved skill is used to initialize the policy network. We subsequently use proximal policy optimization (PPO) and self-imitation learning to fine-tune the policy on the target task. We compare SRSA with the leading baseline AutoMate to learn specialist policies for two-part assembly tasks. We consider two settings:

Dense-reward setting: includes a reward term imitating reversed disassembly demonstrations, as well as a curriculum.
Sparse-reward setting: only provides a nonzero reward for task success to emulate real-world RL fine-tuning, where dense-reward information is much more challenging to acquire.

The learning curves in the set of test tasks show that SRSA achieves strong performance with fewer training epochs and greater stability. Quantitatively, in the dense-reward setting, SRSA reaches an average success rate of 82.6% on 10 test tasks, outperforming AutoMate (69.4%), corresponding to a relative improvement of 19% in performance. Moreover, SRSA shows greater stability, as AutoMate exhibits a 2.6x higher standard deviation across random seeds. In the sparse-reward setting, SRSA delivers a notable 135% relative improvement in the average success rate compared to the baseline.

SRSA Policy Deployment in Simulation

We deploy the trained SRSA policies in simulation on 5 distinct assemblies and show the policy performance in the videos below:

01029	01129	01136	01053	01079

SRSA vs. AutoMate in the Real World

We deploy the trained SRSA policies in the real world and directly compare them to AutoMate.

The experimental setup is the same as AutoMate, we place the robot in lead-through (a.k.a., manual guide mode), grasp a plug, guide it into the socket, and record the pose as a target pose. We then programmatically lift the plug until free from contact; apply an $xy$ perturbation of $ \pm 10$ mm, $z$ perturbation of $15 \pm 5$ mm, and yaw perturbation of $\pm 5$°; apply $x$, $y$, and $z$ observation noise of $\pm 2$ mm each; and deploy a policy.

We show the comparison between SRSA policies and AutoMate policies on 5 distinct assemblies in the videos below.

Frequently Asked Questions

Question: How do you initialize the critic network when fine-tuning a policy on a new target task?

Answer: All source policies are initially trained with dense rewards, following the AutoMate approach for specialist policies. When fine-tuning on a target task with dense rewards, we initialize the critic network using the weights of the source policy's critic, as the reward scale is consistent with the one used in the original training. However, if we fine-tune with sparse rewards, we randomly initialize the critic network to accommodate the different reward scale and structure.

Question: Why does sparse reward sometimes outperform dense reward, for example, in task 1053?

Answer: Task 1053 is relatively simple since both the plug and socket are large and easy to manipulate. In the dense reward setting, following the AutoMate framework, there’s an imitation reward component that encourages the agent to follow a disassembly trajectory. However, this disassembly trajectory may introduce bias, potentially creating local minima or artifacts in the dense reward function. For simpler tasks like this one, sparse rewards that only reward success can provide a more direct and unbiased signal, leading to better performance.

Question: To retrieve the best policy, rather than training a transfer success predictor, you could instead (at inference time) evaluate all source policies in the library on the target task to get the zero-shot transfer success rate. What are the limitations of this naive approach for retrieval?

Answer: This approach is indeed a straightforward method for skill retrieval, but it’s computationally intensive. Evaluating a single source policy on a target task requires 1,000 episodes to reliably measure the zero-shot transfer success rate, which takes about 20 minutes on a single V100 GPU. As the skill library grows, this process becomes increasingly costly and unscalable, as evaluating every source policy on each target task to identify the most relevant one would require significant computational resources. Therefore, we train the transfer success predictor F to provide an effective and efficient guidance for retrieval. While its predictions may not always be perfectly accurate, we can retrieve the top-k source skills ranked by the predictor F. Among these k candidates, we identify the most relevant skill by evaluating their zero-shot transfer success on the target task, and ultimately select the skill with the best transfer performance. This top-k selection technique is grounded in the same intuition as introduced in Sec. 4.1: zero-shot transfer success serves as a reliable metric for skill relevance. We provide more details in Appendix A.12.

Question: How do you handle storage for a growing skill library?

Answer: To manage storage as the skill library expands, we propose maintaining a fixed number of policies that maximize coverage of the task embedding space. Specifically, we propose clustering tasks based on a similarity matrix, where similarity is measured by the zero-shot transfer success rate between any pair of source policies and target tasks. For each cluster, we can retain a single policy capable of addressing all tasks within that cluster.

SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks

ICLR 2025 Spotlight