TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing
Rui Miao
Department of Mathematical Sciences
Texas AI Research Institute
University of Texas at Dallas
Meta-World (MT50, https://meta-world.github.io/)
The minimum RL vocabulary
An MDP is a sequential data-generating process. The policy changes the data distribution; the critic estimates future reward.
From one MDP to \(K\) related MDPs
Multi-task RL trains one task-aware actor and critic across many related tasks. The task identity is part of the input, but parameters are shared.
- Common structure: shared state/action representation.
- Heterogeneity: task-specific dynamics, rewards, and learning speed.
- Question: whose gradient controls the shared estimator?
Why the worst-\(k\) tasks matter
A high mean can hide tasks that never become reliable. In a pipeline of skills, the weakest stages set the usable system reliability.
Official Meta-World broad-suite GIF. MT50 trains one task-aware policy on all 50 tasks; the official site does not provide a separate MT50 GIF.
Soft Actor-Critic became the default MTRL backbone
Soft Actor-Critic (Haarnoja et al., 2018)
- Maximum-entropy actor-critic objective.
- Off-policy replay reuses past transitions.
- Target critics and replay stabilize TD targets.
- Strong sample efficiency made SAC natural for robot manipulation benchmarks.
The follow-up MT50 line was SAC-style
- Meta-World benchmark: Yu et al. (2020).
- Soft Modularization: Yang et al. (2020).
- CARE: Sodhani et al. (2021).
- PaCo (2022), MOORE (2023), and ARS-family baselines (ARS / ARS-LN / ARS-LoRA, 2025) continue the SAC-centered comparison set.
Proximal Policy Optimization (PPO) is a strong on-policy workhorse, but it was not the historical center of MT50 method design. We use it to expose what breaks in shared multi-task updates.
SAC optimizes reward plus entropy
What this buys in MTRL
- Entropy keeps exploration alive across tasks.
- Replay mixes old and new data across the task set.
- Soft values make target changes less abrupt.
action-value critic
soft state value
SAC trains from replayed Bellman targets
Multi-task SAC variants: a compact map
| Family | Typical change | Examples and references |
|---|---|---|
| SAC backbone | Maximum-entropy off-policy actor-critic | Soft Actor-Critic: Haarnoja et al. (2018) |
| Shared / multi-head | Shared trunk with task ID or task-specific heads | SAC-MT, MT-MH-SAC; Meta-World / Meta-World+ protocols |
| Gradient surgery | Modify cross-task gradient aggregation | PCGrad: Yu et al. (2020); CAGrad: Liu et al. (2021); FairGrad: Ban et al. (2024) |
| Modules / experts | Specialize representations or compose sub-networks | Soft Modularization: Yang et al. (2020); CARE: Sodhani et al. (2021); PaCo: Sun et al. (2022); MOORE: Hendawy et al. (2023) |
| ARS-family | Modern strong actor-regularized SAC-style baselines | ARS / ARS-LN / ARS-LoRA: Cho et al. (2025) |
We now diagnose PPO directly and make a paradigm shift for multi-task RL: fix the critic-side update before adding more architecture.
PPO: collect, estimate, clip, fit value
Multi-task PPO averages per-task losses
Important implementation detail
- Rollouts are stratified by task.
- Advantages are normalized within each task slice before the actor surrogate.
- The actor and critic share information only through the common parameters.
- Every minibatch still needs one actor direction and one critic direction.
Multi-task PPO is gradient aggregation
With shared parameters, the optimizer never applies \(K\) separate updates. It applies one aggregate actor direction and one aggregate critic direction.
- Large norms make a task loud.
- Negative cosines create direction conflict.
- Near-colinear gradients reduce task-specific directions.
- The aggregate direction decides which tasks get update budget.
The task-gradient Gram matrix
- \(G_{ii}\) records per-task gradient scale.
- \(G_{ij}/\sqrt{G_{ii}G_{jj}}\) records alignment or conflict.
- The Gram matrix is enough to score any weighted aggregate direction without storing full gradients.
Geometry summary
Diagonal: \(\log_{10}G_{ii}\), task loudness
Off diagonal: cosine, conflict to alignment
The critic is where PPO breaks
critic gradient norm spread
actor spread after task-wise advantage normalization
The intervention target is not PPO as a whole. It is the shared critic update.
The failure is specific, not generic conflict
Easy tasks are loud
Mean critic aggregation inherits high-gradient task scale.
Directions compress
Critic features lose task-specific directions.
Budget goes to dominant tasks
A better aggregator must use gradient geometry, not only loss values.
PopArt normalizes value targets before gradients form
PopArt standardizes each task's return targets, while reparameterizing the affine head so raw predictions do not jump.
Paper Fig. 3, panels (a) vanilla and (b) +PopArt: early critic Gram diagnostics.
Layer Normalization (LN) conditions the critic features
LN-c applies pre-activation LayerNorm in the critic's hidden linear layers. It is a feature-conditioning fix, not another reward-scale normalization.
- Stabilizes hidden activation scale.
- Reduces co-linear collapse in critic gradients.
- In early MT10 diagnostics, mean off-diagonal \(|\cos|\) drops from 0.34 to 0.20.
Paper Fig. 3, panels (c) +LN and (d) +PopArt+LN: cleaner diagonal and less off-diagonal collapse.
FairGrad with \(\alpha=1\): equalize contribution, not loss
- Mean aggregation gives more update budget to large-norm critic tasks.
- FairGrad increases weights for low-contribution tasks.
- The Gram matrix \(G^c\) is the right summary because \(x_i\) and \(\|d^c\|\) are computed from pairwise inner products.
FairGrad \(\alpha=1\): fixed norm and scale invariance
Let \(G=[\langle g_i,g_j\rangle]\succ0\). If \(w>0\) solves
then the aggregate has fixed norm
For any task rescaling \(\tilde g_i=c_i g_i\), \(c_i>0\), the weights transform as \(\tilde w_i=w_i/c_i\), so \(\sum_i \tilde w_i\tilde g_i=d\).
Paper Fig. 3 panel (a): the raw critic Gram matrix shows the scale problem that FairGrad is designed to neutralize.
TOPPO: a PPO collect-and-update phase
Roll out old policy
Collect \(\mathcal B=\cup_i\mathcal B_i\) with \(\pi_{\theta^a_{\mathrm{old}}}\). Compute \(\hat A_t^{(i)}\) and \(\hat V_t^{(i),\mathrm{targ}}\).
PopArt
Update \((\mu_i,\sigma_i)\), renormalize the affine head, and use \((\hat V_t^{(i),\mathrm{targ}}-\mu_i)/\sigma_i\).
PPO epochs
For \(r=1,\ldots,R\), form stratified minibatches and normalize advantages within each task slice.
Separate actor and critic
For every task \(i\), compute \(g_i^a\) from the actor surrogate and \(g_i^{c,\mathrm{PopArt}}\) from the normalized critic loss.
FG-c
Aggregate the critic with \(d^c=\sum_i w_i g_i^{c,\mathrm{PopArt}}\), where FairGrad chooses weights from the critic Gram matrix.
PCGrad-a and Adam
Set \(d^a=\mathrm{PCGrad}(\{g_i^a\})\), clip \((d^a,d^c)\) by \(\tau\), and take one Adam step on \(\theta\).
TOPPO recovers mean and tail performance
MT50 mean success
worst-10 tail success
parameters
MT50 mean and tail success
| Algorithm | Params | Worst-5 | Worst-10 | Worst-20 | All / Mean |
|---|---|---|---|---|---|
| SAC-based and SAC-family baselines | |||||
| SAC-MT | 2597K | 0.0 | 0.0 | 0.0 | 47.6 |
| MT-MH-SAC | 2754K | 0.0 | 0.0 | 0.0 | 47.2 |
| Soft Modularization | 3534K | 0.0 | 0.0 | 1.8 | 54.1 |
| PCGrad-SAC | 2597K | 0.0 | 0.0 | 0.0 | 51.9 |
| PaCo | -- | 0.0 | 0.0 | 4.6 | 55.6 |
| MOORE | 7403K | 0.0 | 0.0 | 15.9 | 64.2 |
| ARS | 2597K | -- | 9.1 | 26.3 | 65.9 |
| ARS-LN (400) | 2611K | -- | 21.0 | 49.1 | 78.3 |
| ARS-LoRA | 16123K | -- | 29.3 | 58.4 | 83.2 |
| PPO-based methods | |||||
| Vanilla MT-PPO | 716K | 0.0 | 6.6 | 46.3 | 78.5 |
| +LN-c +PopArt +FG-c | 717K | 22.3 | 52.4 | 75.1 | 90.1 |
| TOPPO | 717K | 24.2 | 56.5 | 77.2 | 90.9 |
Values are final-checkpoint success rates (%). The full paper table also reports MT10 and standard deviations where available.
Ablation study
vanilla worst-10
after LN-c
TOPPO worst-10
The ladder separates the critic-side surgeries. LN-c gives the largest tail jump, while PopArt and FG-c stabilize target scale and gradient budget; the combined path matches the diagnosis before the final leaderboard comparison.
TOPPO Optimizes Proximal Policy Optimization for multi-task reinforcement learning with critic balancing.
Contact
rui.miao@utdallas.edu
rui-miao.github.io