Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

1Seoul National University   2McGill University   3Mila – Quebec AI Institute   4Canada CIFAR AI Chair
ICML 2026
*Equal contribution
Challenge

Agents struggle to generalize out of domain, i.e., to unseen websites at evaluation

In-domain training data: looks the same as the test environment.
In-Domain train
matches the test distribution
Out-of-domain training data: visibly different sites and interaction patterns from the test environment.
Out-of-Domain train
different sites & interaction patterns

Test domain is fixed (e.g., WebArena-Lite). Realistic offline corpora like AgentTrek are out-of-domain; agents trained on them must transfer to unseen websites and interaction patterns.

Approach

How Weasel selects training data

Result

Trained on out-of-domain data, Weasel matches, and even exceeds, the in-domain ceiling

WebArena-Lite SR 10 15 20 In-Domain Out-of-Domain 20.0 WebArena 16.4 w/o Train 17.7 AgentTrek 21.2 AgentTrek w/ Weasel

Qwen3-8B on WebArena-Lite. The in-domain ceiling is 20.0 (training on WebArena). Out-of-domain pretrained baseline reaches only 16.4, and SFT on AgentTrek 17.7. With Weasel, AgentTrek lifts to 21.2, overtaking the in-domain ceiling without seeing any WebArena task.

Conventional web agents drop sharply under out-of-domain shifts to unseen websites and interaction patterns. Weasel scores offline demonstration steps for goal relevance and diversity, then runs greedy subset selection under a fixed budget, yielding a +4.8 SR gain on zero-shot transfer (AgentTrek → WebArena-Lite, Qwen3-8B).

Abstract

Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domains often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states.

To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with action-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales.

Across AgentTrek and NNetNav training datasets, evaluations on WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7–12.5× training speed-ups over standard fine-tuning.

Method Overview

WEASEL method figure: left shows a curated trajectory with noisy and erroneous steps removed; right shows the greedy importance-diversity subset selection procedure.

Left: An example curated trajectory after applying Weasel. Although the original demonstration contains noisy ($t=4$) and erroneous ($t=0$) steps, Weasel retains only the most informative steps (highlighted in red).

Right: We compute an unary importance score $\Phi(t) = \text{BERTScore}(g, s_t)$ that measures semantic alignment between the goal $g$ and each step's state $s_t$, and a pairwise diversity score $D(i,j) = \max\!\bigl(\delta(s_i,s_j),\,\delta(y_i,y_j)\bigr)$ where $\delta(\cdot,\cdot)=1-\text{BERTScore}$ and $y_t=[r_t;a_t]$ is the reasoning–action pair. Greedy subset selection initializes with the best-scoring pair and incrementally adds the index with the largest marginal gain on the importance–diversity objective until the budget $T_0$ is met.

Subset selection objective

$\displaystyle \max_{J \subseteq \{0,\dots,T-1\}}\;\sum_{j \in J}\Phi(j)\;+\;\lambda\!\!\sum_{\substack{i\lt j\\ i,j\in J}}\!\!D(i,j)\quad\text{s.t.}\;|J|=T_0$

Although this max-sum diversification problem is NP-hard, a simple greedy algorithm is near-optimal in practice: it matches the exact optimum in over 96% of NNetNav trajectories, with a greedy/optimal objective ratio of $0.9999 \pm 0.0005$.

Target-Centered AXTree Pruning

Illustration of target-centered AXTree pruning: keep a window of nodes centered at the gold action target.
Keep a window of length $2w+1$ around the gold action node.
Histogram of token counts before and after target-centered pruning, showing substantially shorter sequences after pruning.
Token distribution of 10K AgentTrek samples before (green) and after (blue) pruning.

A single linearized AXTree state in AgentTrek can reach 180K tokens: costly to train on, and most of those tokens are irrelevant to the expert's next action.

For node-grounded actions, we retain a contiguous window of length $2w+1$ centered at the gold target node $v_{t,k_t^\star}$, producing a compact pruned state $\tilde{s}_t$ that preserves the local context most useful for predicting the expert action. This is applied as a preprocessing step before Weasel's subset selection.

Pruning alone yields a training speed-up at no extra inference cost, and our target-centered window outperforms prefix and semantic-ranking baselines under a matched 32% token budget.

Self-Reasoning Synthesis

Naive SFT (style mismatch) Expert trace (written by another model) Qwen3 base model degraded performance Self-reasoning synthesis (ours) Self-generated trace (generated by Qwen3) Qwen3 base model style-consistent WebArena-Lite SR, Qwen3-8B (10K subset) SFT 15.8 SFT + RS 17.6 Weasel + RS 21.2

Reasoning-native models (e.g. Qwen3) emit intermediate reasoning traces before final outputs. When the training corpus has reasoning traces produced by a different model, naive SFT introduces a style mismatch that can destabilize training and even drag performance below the base model.

To fix this, we replace the expert reasoning trace with a self-generated rationale produced by the same target model, conditioned on the goal, interaction history, pruned state, and the gold action, while keeping the original action supervision intact.

For Qwen3-8B on WebArena-Lite, self-reasoning is critical: SFT + RS lifts 15.8 → 17.6, and combining with Weasel's selection reaches 21.2, the highest result among all variants.

Zero-Shot Transfer Results

Agents are fine-tuned on AgentTrek and evaluated on WebArena-Lite, WebArena, MiniWob, and WorkArena without any in-domain training. Weasel delivers the best accuracy–efficiency trade-off across all three base models.

Model WebArena-Lite SR WebArena SR MiniWob SR WorkArena L1 / L2 Training Time Speed-up
Qwen2.5-7B-Instruct
Base (no train)5.55.241.84.8 / 0.0n/an/a
+ Full SFT (52K)10.98.744.612.1 / 0.4136.0 hr1.0×
+ Pruning + Random (10K)9.18.146.79.8 / 3.012.0 hr11.3×
+ Pruning + LLM-Judge (10K)8.57.845.48.5 / 3.012.0 hr11.3×
+ Weasel (10K)14.59.548.012.4 / 4.712.0 hr11.3×
Gemma3-4B-IT
Base (no train)6.72.727.43.6 / 0.0n/an/a
+ Full SFT (52K)9.14.328.63.3 / 0.080.0 hr1.0×
+ Pruning + Random (10K)8.54.729.42.4 / 2.16.4 hr12.5×
+ Pruning + LLM-Judge (10K)6.75.229.34.5 / 2.16.4 hr12.5×
+ Weasel (10K)11.55.530.64.5 / 3.06.4 hr12.5×
Qwen3-8B (reasoning-native)
Base (no train)16.418.061.135.2 / 1.7n/an/a
+ Full SFT (52K)17.718.259.433.3 / 2.188.5 hr1.0×
+ Pruning + Random (10K)16.517.561.433.9 / 3.47.0 hr12.6×
+ Pruning + LLM-Judge (10K)19.416.661.935.2 / 3.87.0 hr12.6×
+ Weasel (10K)21.219.261.938.8 / 4.38.3 hr10.7×

Robustness Across Datasets: NNetNav-Live

Replacing AgentTrek with the real-world NNetNav-Live training set, Weasel remains the strongest method, with a 9.7× training speed-up over the 52K full run.

Qwen2.5-7B-Instruct WebArena-Lite WebArena MiniWob WorkArena L1 WorkArena L2 Speed-up
Base (no train)5.55.241.84.80.0n/a
+ Full SFT (52K)10.96.938.95.26.41.0×
+ Pruning + Random (10K)10.37.432.37.06.09.7×
+ Weasel (10K)12.18.341.87.66.89.7×

Ablation

WebArena-Lite SR (Qwen2.5-7B, 10K subset of AgentTrek).

Importance × Diversity is complementary

Combining the unary importance term $\Phi$ with the pairwise diversity term $D$ outperforms either alone.

Importance only
10.9
Diversity only
7.9
Weasel (both)
14.5

State + answer diversity

Diversifying over both states and reasoning–action pairs via a max-composition is best.

State only
9.7
Answer only
13.9
Weasel (both)
14.5

BibTeX

@inproceedings{pesaranzadeh2026weasel,
  title     = {{WEASEL}: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection},
  author    = {Pesaran zadeh, Fatemeh and Choi, Seyeon and L\`u, Xing Han and Reddy, Siva and Kim, Gunhee},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}