01
Multi-task reinforcement learning

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

Rui Miao
Department of Mathematical Sciences
Texas AI Research Institute
University of Texas at Dallas

Meta-World manipulation task animation

Meta-World (MT50, https://meta-world.github.io/)

02
Markov decision processes

The minimum RL vocabulary

An MDP is a sequential data-generating process. The policy changes the data distribution; the critic estimates future reward.

\[ \mathcal M=(\mathcal S,\mathcal A,P,r,\mu,\gamma),\quad s'\sim P(\cdot\mid s,a),\quad r=r(s,a) \]
\[ Q^\pi(s,a)=\mathbb E\!\left[\sum_{t\ge0}\gamma^t r_t\mid s_0=s,a_0=a\right],\quad V^\pi(s)=\mathbb E_{a\sim\pi}Q^\pi(s,a) \]
\[ A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s) \]
STATE \(s_t\)
ACTION \(a_t\)
ENVIRONMENT
uses \(P(s'\mid s,a)\)
and \(r(s,a)\)
NEXT \(s_{t+1}\)
REWARD \(r_t\)
policy \(\pi(a\mid s)\)
03
Multi-task RL

From one MDP to \(K\) related MDPs

Multi-task RL trains one task-aware actor and critic across many related tasks. The task identity is part of the input, but parameters are shared.

\[ \mathcal M_i=(\mathcal S,\mathcal A,P_i,r_i,\mu_i,\gamma),\qquad \max_\theta\; \frac1K\sum_{i=1}^K J_i(\pi_\theta) \]
  • Common structure: shared state/action representation.
  • Heterogeneity: task-specific dynamics, rewards, and learning speed.
  • Question: whose gradient controls the shared estimator?
\(\mathcal M_1\)
\(\mathcal M_2\)
\(\mathcal M_3\)
\(\mathcal M_4\)
\(\mathcal M_5\)
actor \(\pi_{\theta^a}\)
critic \(V_{\theta^c}\)
shared update
every minibatch merges task gradients
04
Meta-World MT50

Why the worst-\(k\) tasks matter

A high mean can hide tasks that never become reliable. In a pipeline of skills, the weakest stages set the usable system reliability.

\[ \begin{aligned} \Pr(\hbox{pipeline succeeds})&\approx \prod_{j=1}^{m} p_j\\ &\Longrightarrow\ \hbox{small }p_j\hbox{ dominates} \end{aligned} \]
Official Meta-World broad-suite animation
ReachPushPickDoorDrawer ButtonPegWindowBoxCoffee DialFaucetHammerHandleLever PlateSweepSoccerShelfStick BasketBinLockUnlockWall PullInsertUnplugAssemblyDisassemble CloseOpenPressSlideTurn Reach-wallPush-wallPick-wallPush-backShelf-place Button-wallCoffee-pushCoffee-pullDoor-closeDrawer-close Window-closeFaucet-closeHandle-sidePlate-sideStick-pull

Official Meta-World broad-suite GIF. MT50 trains one task-aware policy on all 50 tasks; the official site does not provide a separate MT50 GIF.

05
Original literature context

Soft Actor-Critic became the default MTRL backbone

Soft Actor-Critic (Haarnoja et al., 2018)

  • Maximum-entropy actor-critic objective.
  • Off-policy replay reuses past transitions.
  • Target critics and replay stabilize TD targets.
  • Strong sample efficiency made SAC natural for robot manipulation benchmarks.

The follow-up MT50 line was SAC-style

  • Meta-World benchmark: Yu et al. (2020).
  • Soft Modularization: Yang et al. (2020).
  • CARE: Sodhani et al. (2021).
  • PaCo (2022), MOORE (2023), and ARS-family baselines (ARS / ARS-LN / ARS-LoRA, 2025) continue the SAC-centered comparison set.

Proximal Policy Optimization (PPO) is a strong on-policy workhorse, but it was not the historical center of MT50 method design. We use it to expose what breaks in shared multi-task updates.

06
Soft Actor-Critic formulation

SAC optimizes reward plus entropy

\[ J(\pi)=\mathbb E_\pi\!\left[\sum_{t\ge0}\gamma^t \{r(s_t,a_t)+\alpha\,\mathcal H(\pi(\cdot\mid s_t))\}\right] \]
\[ V^\pi(s)=\mathbb E_{a\sim\pi}\!\left[ Q^\pi(s,a)-\alpha\log \pi(a\mid s)\right] \]
\[ Q^\pi(s,a)=r(s,a)+\gamma\,\mathbb E_{s'\sim P}V^\pi(s') \]

What this buys in MTRL

  • Entropy keeps exploration alive across tasks.
  • Replay mixes old and new data across the task set.
  • Soft values make target changes less abrupt.
Q

action-value critic

V

soft state value

07
Critic and actor losses

SAC trains from replayed Bellman targets

\[ y=r+\gamma\;\mathbb E_{a'\sim\pi}\!\left[ Q_{\bar\theta}(s',a')-\alpha\log\pi(a'\mid s')\right] \]
\[ L_Q(\theta)=\mathbb E_{(s,a,r,s')\sim\mathcal D} \left[(Q_\theta(s,a)-y)^2\right] \]
\[ L_\pi(\phi)=\mathbb E_{s\sim\mathcal D,a\sim\pi_\phi} \left[\alpha\log\pi_\phi(a\mid s)-Q_\theta(s,a)\right] \]
policy environment replay buffer
\(Q\) update
actor \(\pi\) update
off-policy reuse target networks smooth TD targets
08
What grew around SAC

Multi-task SAC variants: a compact map

FamilyTypical changeExamples and references
SAC backbone Maximum-entropy off-policy actor-critic Soft Actor-Critic: Haarnoja et al. (2018)
Shared / multi-head Shared trunk with task ID or task-specific heads SAC-MT, MT-MH-SAC; Meta-World / Meta-World+ protocols
Gradient surgery Modify cross-task gradient aggregation PCGrad: Yu et al. (2020); CAGrad: Liu et al. (2021); FairGrad: Ban et al. (2024)
Modules / experts Specialize representations or compose sub-networks Soft Modularization: Yang et al. (2020); CARE: Sodhani et al. (2021); PaCo: Sun et al. (2022); MOORE: Hendawy et al. (2023)
ARS-family Modern strong actor-regularized SAC-style baselines ARS / ARS-LN / ARS-LoRA: Cho et al. (2025)

We now diagnose PPO directly and make a paradigm shift for multi-task RL: fix the critic-side update before adding more architecture.

09
Proximal Policy Optimization

PPO: collect, estimate, clip, fit value

\[ \textbf{PPO objective:}\quad L(\theta)=L^a(\theta^a)+c_vL^c(\theta^c)-c_eH(\theta^a) \]
\[ \begin{aligned} &\textbf{Actor surrogate:}\\ &L^a(\theta^a)=-\mathbb E_t\!\left[ \min\{\rho_t\hat A_t,\right.\\ &\left.\operatorname{clip}(\rho_t,1-\epsilon,1+\epsilon)\hat A_t\}\right] \end{aligned} \]
\[ \begin{aligned} \textbf{Critic loss:}\quad L^c(\theta^c) &=\mathbb E_t\!\left[ (V_{\theta^c}(s_t)-\hat V_t^{\mathrm{targ}})^2 \right] \end{aligned} \]
\[ \begin{aligned} \textbf{Entropy bonus:}\quad H(\theta^a) &=\mathbb E_t\!\left[ \mathcal H(\pi_{\theta^a}(\cdot\mid s_t)) \right] \end{aligned} \]
\[ \begin{aligned} \textbf{Ratio:}\quad \rho_t(\theta^a) &= \frac{\pi_{\theta^a}(a_t\mid s_t)} {\pi_{\theta^a_{\mathrm{old}}}(a_t\mid s_t)} \end{aligned} \]
\[ \begin{aligned} \textbf{GAE:}\quad \delta_t&=r_t+\gamma V_{\mathrm{old}}(s_{t+1})-V_{\mathrm{old}}(s_t),\\ \hat A_t&=\sum_{\ell\ge0}(\gamma\lambda)^\ell\delta_{t+\ell} \end{aligned} \]
\[ \begin{aligned} \textbf{Value target:}\quad \hat V_t^{\mathrm{targ}} &=\hat A_t+V_{\mathrm{old}}(s_t) \end{aligned} \]
10
What changes with many tasks

Multi-task PPO averages per-task losses

\[ L(\theta)=\frac1K\sum_{i=1}^{K}L_i(\theta),\quad L_i=L_i^a(\theta^a)+c_vL_i^c(\theta^c)-c_eH_i(\theta^a) \]
\[ \rho_t^{(i)}= \frac{\pi_{\theta^a}(a_t\mid s_t,i)} {\pi_{\theta^a_{\mathrm{old}}}(a_t\mid s_t,i)},\quad \hat A_t^{(i)}=\sum_{\ell\ge0}(\gamma\lambda)^\ell\delta_{t+\ell}^{(i)} \]
\[ g_i^a=\nabla_{\theta^a}L_i^a,\qquad g_i^c=\nabla_{\theta^c}L_i^c \]

Important implementation detail

  • Rollouts are stratified by task.
  • Advantages are normalized within each task slice before the actor surrogate.
  • The actor and critic share information only through the common parameters.
  • Every minibatch still needs one actor direction and one critic direction.
11
The statistical object

Multi-task PPO is gradient aggregation

loud task tail task mean direction balanced direction

With shared parameters, the optimizer never applies \(K\) separate updates. It applies one aggregate actor direction and one aggregate critic direction.

  • Large norms make a task loud.
  • Negative cosines create direction conflict.
  • Near-colinear gradients reduce task-specific directions.
  • The aggregate direction decides which tasks get update budget.
12
Diagnostic bridge

The task-gradient Gram matrix

\[ g_i^a:=\nabla_{\theta^a}L_i^a,\qquad g_i^c:=\nabla_{\theta^c}L_i^c \]
\[ G_{ij}^{\bullet}=\langle g_i^\bullet,g_j^\bullet\rangle,\quad \bullet\in\{a,c\},\qquad d^\bullet=\sum_i w_i g_i^\bullet \]
  • \(G_{ii}\) records per-task gradient scale.
  • \(G_{ij}/\sqrt{G_{ii}G_{jj}}\) records alignment or conflict.
  • The Gram matrix is enough to score any weighted aggregate direction without storing full gradients.

Geometry summary

Diagonal: \(\log_{10}G_{ii}\), task loudness

Off diagonal: cosine, conflict to alignment

13
Actor and critic are asymmetric

The critic is where PPO breaks

Actor versus critic gradient norm spread
497x

critic gradient norm spread

4.1x

actor spread after task-wise advantage normalization

The intervention target is not PPO as a whole. It is the shared critic update.

14
Three critic pathologies

The failure is specific, not generic conflict

D1. Scale spread

Easy tasks are loud

large \(G^c_{ii}\)

Mean critic aggregation inherits high-gradient task scale.

D2. Co-linear collapse

Directions compress

high off-diagonal cosine

Critic features lose task-specific directions.

D3. Unfair aggregation

Budget goes to dominant tasks

mean desired task 1 task 2

A better aggregator must use gradient geometry, not only loss values.

15
Surgery 1: target scale

PopArt normalizes value targets before gradients form

PopArt standardizes each task's return targets, while reparameterizing the affine head so raw predictions do not jump.

\[ \begin{aligned} V_{\theta^c}(s,i)&=\sigma_i z_{\theta^c}(s,i)+\mu_i,\\ y_i^{\mathrm{norm}}&=\frac{\hat V_i^{\mathrm{targ}}-\mu_i}{\sigma_i} \end{aligned} \]
\[ g_i^{c,\mathrm{PopArt}}=\sigma_i^{-2}g_i^{c,\mathrm{raw}} \]
Figure 3 panels a and b showing vanilla and PopArt Gram matrices

Paper Fig. 3, panels (a) vanilla and (b) +PopArt: early critic Gram diagnostics.

16
Surgery 2: representation conditioning

Layer Normalization (LN) conditions the critic features

LN-c applies pre-activation LayerNorm in the critic's hidden linear layers. It is a feature-conditioning fix, not another reward-scale normalization.

\[ \mathrm{LN}(h)=\gamma\odot \frac{h-\bar h}{\sqrt{\operatorname{Var}(h)+\varepsilon}}+\beta \]
  • Stabilizes hidden activation scale.
  • Reduces co-linear collapse in critic gradients.
  • In early MT10 diagnostics, mean off-diagonal \(|\cos|\) drops from 0.34 to 0.20.
Figure 3 panels c and d showing LN effects

Paper Fig. 3, panels (c) +LN and (d) +PopArt+LN: cleaner diagonal and less off-diagonal collapse.

17
Surgery 3: fair critic aggregation

FairGrad with \(\alpha=1\): equalize contribution, not loss

\[ d^c=\sum_i w_i g_i^c,\qquad x_i=(G^c w)_i=(g_i^c)^\top d^c \]
  • Mean aggregation gives more update budget to large-norm critic tasks.
  • FairGrad increases weights for low-contribution tasks.
  • The Gram matrix \(G^c\) is the right summary because \(x_i\) and \(\|d^c\|\) are computed from pairwise inner products.
\[ \begin{aligned} \hbox{orthogonal toy: }&\|g_1\|=10,\ \|g_2\|=1\\ &\Rightarrow\ w_1=1/10,\ w_2=1\quad(\alpha=1) \end{aligned} \]
\(g_1,\ \|g_1\|=10\)
\(g_2,\ \|g_2\|=1\)
mean: 99.5% aligns with \(g_1\)
FG: equal weighted norms
Orthogonal toy example from Section 4.3:
\(w_1=1/10,\ w_2=1\) when \(\alpha=1\)
18
Theoretical result

FairGrad \(\alpha=1\): fixed norm and scale invariance

Theorem

Let \(G=[\langle g_i,g_j\rangle]\succ0\). If \(w>0\) solves

\[ G w=w^{-1},\qquad d=\sum_{i=1}^{K}w_i g_i , \]

then the aggregate has fixed norm

\[ \|d\|^2=w^\top G w=K . \]

For any task rescaling \(\tilde g_i=c_i g_i\), \(c_i>0\), the weights transform as \(\tilde w_i=w_i/c_i\), so \(\sum_i \tilde w_i\tilde g_i=d\).

Figure 3 panel a showing critic Gram matrix

Paper Fig. 3 panel (a): the raw critic Gram matrix shows the scale problem that FairGrad is designed to neutralize.

19
Algorithm 1

TOPPO: a PPO collect-and-update phase

Require tasks \(\{\mathcal M_i\}_{i=1}^{K}\), policy and critic parameters \(\theta=(\theta^a,\theta^c)\), LN-c critic, PopArt statistics \((\mu_i,\sigma_i)\), repeat count \(R\), and max gradient norm \(\tau\).
01 collect

Roll out old policy

Collect \(\mathcal B=\cup_i\mathcal B_i\) with \(\pi_{\theta^a_{\mathrm{old}}}\). Compute \(\hat A_t^{(i)}\) and \(\hat V_t^{(i),\mathrm{targ}}\).

02 normalize targets

PopArt

Update \((\mu_i,\sigma_i)\), renormalize the affine head, and use \((\hat V_t^{(i),\mathrm{targ}}-\mu_i)/\sigma_i\).

03 minibatches

PPO epochs

For \(r=1,\ldots,R\), form stratified minibatches and normalize advantages within each task slice.

04 task gradients

Separate actor and critic

For every task \(i\), compute \(g_i^a\) from the actor surrogate and \(g_i^{c,\mathrm{PopArt}}\) from the normalized critic loss.

05 critic surgery

FG-c

Aggregate the critic with \(d^c=\sum_i w_i g_i^{c,\mathrm{PopArt}}\), where FairGrad chooses weights from the critic Gram matrix.

06 actor + step

PCGrad-a and Adam

Set \(d^a=\mathrm{PCGrad}(\{g_i^a\})\), clip \((d^a,d^c)\) by \(\tau\), and take one Adam step on \(\theta\).

20
Now show the MT50 result

TOPPO recovers mean and tail performance

TOPPO MT50 mean and worst-k result
90.9%

MT50 mean success

56.5%

worst-10 tail success

717K

parameters

21
Headline success and worst-\(k\) tail

MT50 mean and tail success

AlgorithmParamsWorst-5Worst-10Worst-20All / Mean
SAC-based and SAC-family baselines
SAC-MT2597K0.00.00.047.6
MT-MH-SAC2754K0.00.00.047.2
Soft Modularization3534K0.00.01.854.1
PCGrad-SAC2597K0.00.00.051.9
PaCo--0.00.04.655.6
MOORE7403K0.00.015.964.2
ARS2597K--9.126.365.9
ARS-LN (400)2611K--21.049.178.3
ARS-LoRA16123K--29.358.483.2
PPO-based methods
Vanilla MT-PPO716K0.06.646.378.5
+LN-c +PopArt +FG-c717K22.352.475.190.1
TOPPO717K24.256.577.290.9

Values are final-checkpoint success rates (%). The full paper table also reports MT10 and standard deviations where available.

22
Mechanism evidence after the table

Ablation study

MT50 ablation ladder
6.6

vanilla worst-10

42.3

after LN-c

56.5

TOPPO worst-10

The ladder separates the critic-side surgeries. LN-c gives the largest tail jump, while PopArt and FG-c stabilize target scale and gradient budget; the combined path matches the diagnosis before the final leaderboard comparison.

23
Thank You

TOPPO Optimizes Proximal Policy Optimization for multi-task reinforcement learning with critic balancing.

Li, Y., Lin, G., Qu, A., & Miao, R. (2026). TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing. arXiv. https://arxiv.org/abs/2605.11473

Contact

rui.miao@utdallas.edu

rui-miao.github.io