logo DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

1 UC Berkeley, 2 UIUC, 3 Google DeepMind
*Equal contribution, alphabetic order; work done at UC Berkeley


Success rate and corresponding trajectories. A green final screen indicates a successful trajectory; a red final screen indicates a failed trajectory.


Training corpuses for vision language models typically lack sufficient amounts of decision-centric data. This renders off-the-shelf VLMs sub-optimal for decision-making tasks such as in-the-wild device control through graphical user interfaces (GUIs). While training with static demonstrations has shown some promise, we show that such methods fall short when controlling real GUIs due to their failure to deal with real world stochasticity not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline RL to initialize the model, followed by offline-to-online RL. To do this, we build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator and develop a simple yet effective RL approach for learning in this domain. Our approach runs advantage-weighted RL with advantage estimators enhanced to account for stochasticity along with an automatic curriculum for deriving maximal learning signal. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our 1.5B VLM trained with RL achieves a 49.5% absolute improvement -- from 17.7% to 67.2% success rate -- over supervised fine-tuning with static human demonstration data. These results significantly surpass not only the prior best agents, including AppAgent with GPT-4V (8.3% success rate) and the 17B CogAgent trained with AitW data (14.4%), but also the prior best autonomous RL approach based on filtered behavior cloning (57.8%), thereby establishing a new state-of-the-art for digital agents for in-the-wild device control.

AitW General
Search for some good Italian restaurants

AitW Web Shopping
Go to, and search for "Alienware Aurora"

DigiRL solves open-ended realistic Android tasks with an novel online reinforcement learning algorithm with autonomous VLM evaluator.

This code subset follows the Eurus paper and contains MBPP and HumanEval.

This table contains the success rate across all approaches measured in the DigiRL paper. It includes performance on two subsets: AitW General and AitW Web Shopping. The codename of GPT-4V we use is gpt-4-vision-preview and the codename of Gemini-1.5-Pro is gemini-1.5-pro-latest.

DigiRL: Autonomous RL for Building a Strong Device-Control Agent

Why RL over the alternatives?

  • LLM Agent data such as device-control actions is poorly represented in the pre-training corpus of Off-the-shelf proprietary VLMs such as GPT4V and Gemini-1.5-Pro.
  • Supervised Fine-Tuning 1) requires a large amount of human demonstration data and 2) cannot recover from degrading model performance when real websites/applications have changed. As shown in the plot below, a frozen good policy trained with prior data experiences a gradual drop in performance as the websites change over time, while the DigiRL policy constantly updates with fresh autonomous data can maintain a stable performance.

What are we using RL for?

DigiRL consists of two steps:
  • First, we use Offline RL to make the most out of a potentially sub-optimal existing offline dataset.
  • Then, we use Offline-to-Online RL to encourage the agent to learn from its own trials and errors.
DigiRL identifies the most simple yet effective RL design choices for device-control agent problems. Our RL algorithmic framework automatically achieves the following advantages compared to state-of-the-art alternatives such as rejection sampling (or Filtered Behavior Cloning):
  • We makes use of an instruction-level value function to implicitly construct an automatic curriculum that prioritizes on the tasks most informative to the agent.
  • We makes use of a step-level value function to pick out the advantageous actions (actions that mark progress towards the goal) in a trajectory while leaving the noisy actions (actions that do not contribute to the goal).
Please check out our paper for more details of our algorithm!

Learning Curves

In addition to the convergence performance reported in the paper, we also present the sample complexity comparison of DigiRL against the state-of-the-art alternative Filtered Behavior Cloning (or rejection sampling). We find that DigiRL not only converges to a superior performance, but also learns more efficiently.

Autonomous Evaluation

Our main results are autonomously evaluated with Gemini-1.5-Pro. We also manually evaluate on some subsets and finds that the autonomous evaluation results highly align with manual evaluations with an average difference less than 3%:

Failure Mode Analysis

While all the types of failure modes benefit from offline and offline-to-online RL training, the most consistent and significant reduction is for the failure mode of failing to recover from mistakes. By training on autonomously-collected rollouts, our agent DigiRL is able to learn from its own mistakes and reduces failures to recover over training.


Re the Charts 📈

Try clicking on the legend of the charts!

Re the Icon logo

Infinity: Our environment is open-ended, which can be easily generalized to infinite open-ended tasks sets with our open-ended evaluator.
Loop: We use online reinforcement learning, which is closed-loop: the agent interacts with the environment and learns from its own trials and errors.


title={DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning},
author={Bai, Hao and Zhou, Yifei and Cemri, Mert and Pan, Jiayi and Suhr, Alane and Levine, Sergey and Kumar, Aviral},
journal={arXiv preprint arXiv:2406.11896},