Clicky

logo Digi-Q: Learning VLM Q-Value Functions for Training Device-Control Agents

1 UC Berkeley, 2 UIUC, 3 Amazon, 4 CMU
*Equal contribution

Demo


Method overview of Digi-Q compared to policy-based methods. Click on the button to switch the method.

Abstract

While most paradigms for building foundation model agents rely on prompting or fine-tuning on demonstrations, it is not sufficient in dynamic environments (e.g., mobile device control). On-policy reinforcement learning (RL) should address these limitations, but collecting actual rollouts in an environment is often undesirable when using truly open-ended agentic tasks such as mobile device, where simulation is a bottleneck. In such scenarios, an offline method for policy improvement that utilizes a trained value-function for training the policy is much more practical. In this paper, we develop a scalable value-based offline RL approach called Digi-Q to train VLM agents for device control entirely from static data. The key idea in Digi-Q is to train a value function using offline temporal-difference (TD) learning. We show that this can be done by fine-tuning a Q-function on top of frozen, intermediate-layer features of a VLM rather than fine-tuning the whole VLM itself, which saves us compute and enhances training stability. To make the VLM features amenable for representing the value function, we need to employ an initial phase of fine-tuning to amplify coverage over actionable information critical for Q-functions. Once trained, we use this value function alongside a best-of-N policy extraction operator that imitates the best action out of multiple candidate actions from the current policy as ranked by the value function, enabling policy improvement without ever needing to use the simulator. Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 9.9% improvement over prior best-performing method.

Digi-Q: Training VLM Q-Value Functions for Agentic Policy Learning

Why do we need to train an off-policy Q function for device-control agents?

  • While policy-based methods such as PPO and DigiRL can achieve strong performances through sampling a large amount of on-policy data, they may be impractical for real-world device-control tasks where simulation is slow and restricted due to safety concerns.
  • Value-based methods learn an off-policy Q function from historically collected data to score a policy's actions reliably. In this way, we can significantly simplify our recipe for policy improvement without costly simulations.

What are the challenges for training an off-policy Q function with foundation models?

  • (1) Instability in running temporal-difference (TD) learning with large models with billions of parameters.
  • (2) Inefficiency of TD backups per unit amount of "compute" (i.e., gradient steps spent).

How does Digi-Q deal with these challenges?

  • Digi-Q trains the Q function on top of a frozen intermediate layer of the VLM, after an initial phase of representation fine-tuning to prime representations of a VLM to be more amenable to TD-learning.
  • Once the Q function is trained, a Best-of-N policy-extraction objective is applied to train the agentic policy to imitate the best-rated action per the Q-function without any additional environment interactions.

Experiment results on Android-in-the-Wild

Digi-Q achieves superior performance compared to other state-of-the-art RL baselines using historically collect data. Surprisingly, it is even comparable to the strongest online RL baseline DigiRL , without relying on costly online simulations! Please refer to our paper for more analysis results.

This code subset follows the Eurus paper and contains MBPP and HumanEval.

This table contains the success rate across all approaches measured in our paper. It includes performance on two subsets: AitW General and AitW Web Shopping.

Example Policy Rollouts


Task: go to ebay.com, search for duracell a, and select the first item
Click on the buttons to switch between the views.

Scaling performance of Digi-Q

We compare the performance of Digi-Q and SOTA offline baseline DigiRL across different scales of offline data, and we found that DigiQ has the following benefits:

  • (1) Data efficiency. By reusing off-policy data from replay buffer, Digi-Q achieves better performance with fewer data.
  • (2) Convergence performance. By performing per-step credit assignment, Digi-Q achieves better convergence performance.

Misc

Re the Icon logo

Q: The “Q” hat upon the Android is Chinese brush calligraphy, with a hidden “∞” inside, which relates DigiRL (where the Android has a hat of “∞”)

BibTeX

@article{bai2025digiq,
title={Digi-Q: Learning VLM Q-Value Functions for Training Device-Control Agents},
author={Bai, Hao and Zhou, Yifei and Li, Erran Li and Levine, Sergey and Kumar, Aviral},
journal={},
year={2025}
}