TOPPO / 20 min

Multi-task reinforcement learning

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

Rui Miao
Department of Mathematical Sciences
Texas AI Research Institute
University of Texas at Dallas

Meta-World (MT50, https://meta-world.github.io/)

02

Background

Markov decision processes

The minimum RL vocabulary

An MDP is a sequential data-generating process. The policy changes the data distribution; the critic estimates future reward.

\[ \mathcal M=(\mathcal S,\mathcal A,P,r,\mu,\gamma),\quad s'\sim P(\cdot\mid s,a),\quad r=r(s,a) \]

\[ Q^\pi(s,a)=\mathbb E\!\left[\sum_{t\ge0}\gamma^t r_t\mid s_0=s,a_0=a\right],\quad V^\pi(s)=\mathbb E_{a\sim\pi}Q^\pi(s,a) \]

\[ A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s) \]

03

MTRL

Multi-task RL

From one MDP to \(K\) related MDPs

Multi-task RL trains one task-aware actor and critic across many related tasks. The task identity is part of the input, but parameters are shared.

\[ \mathcal M_i=(\mathcal S,\mathcal A,P_i,r_i,\mu_i,\gamma),\qquad \max_\theta\; \frac1K\sum_{i=1}^K J_i(\pi_\theta) \]

Common structure: shared state/action representation.
Heterogeneity: task-specific dynamics, rewards, and learning speed.
Question: whose gradient controls the shared estimator?

04

Benchmark

Meta-World MT50

Why the worst-\(k\) tasks matter

A high mean can hide tasks that never become reliable. In a pipeline of skills, the weakest stages set the usable system reliability.

\[ \begin{aligned} \Pr(\hbox{pipeline succeeds})&\approx \prod_{j=1}^{m} p_j\\ &\Longrightarrow\ \hbox{small }p_j\hbox{ dominates} \end{aligned} \]

Official Meta-World broad-suite animation

ReachPushPickDoorDrawer ButtonPegWindowBoxCoffee DialFaucetHammerHandleLever PlateSweepSoccerShelfStick BasketBinLockUnlockWall PullInsertUnplugAssemblyDisassemble CloseOpenPressSlideTurn Reach-wallPush-wallPick-wallPush-backShelf-place Button-wallCoffee-pushCoffee-pullDoor-closeDrawer-close Window-closeFaucet-closeHandle-sidePlate-sideStick-pull

Official Meta-World broad-suite GIF. MT50 trains one task-aware policy on all 50 tasks; the official site does not provide a separate MT50 GIF.

05

Background

Original literature context

Soft Actor-Critic became the default MTRL backbone

Soft Actor-Critic (Haarnoja et al., 2018)

Maximum-entropy actor-critic objective.
Off-policy replay reuses past transitions.
Target critics and replay stabilize TD targets.
Strong sample efficiency made SAC natural for robot manipulation benchmarks.

The follow-up MT50 line was SAC-style

Meta-World benchmark: Yu et al. (2020).
Soft Modularization: Yang et al. (2020).
CARE: Sodhani et al. (2021).
PaCo (2022), MOORE (2023), and ARS-family baselines (ARS / ARS-LN / ARS-LoRA, 2025) continue the SAC-centered comparison set.

Proximal Policy Optimization (PPO) is a strong on-policy workhorse, but it was not the historical center of MT50 method design. We use it to expose what breaks in shared multi-task updates.

06

SAC

Soft Actor-Critic formulation

SAC optimizes reward plus entropy

\[ J(\pi)=\mathbb E_\pi\!\left[\sum_{t\ge0}\gamma^t \{r(s_t,a_t)+\alpha\,\mathcal H(\pi(\cdot\mid s_t))\}\right] \]

\[ V^\pi(s)=\mathbb E_{a\sim\pi}\!\left[ Q^\pi(s,a)-\alpha\log \pi(a\mid s)\right] \]

\[ Q^\pi(s,a)=r(s,a)+\gamma\,\mathbb E_{s'\sim P}V^\pi(s') \]

What this buys in MTRL

Entropy keeps exploration alive across tasks.
Replay mixes old and new data across the task set.
Soft values make target changes less abrupt.

Q

action-value critic

V

soft state value

07

SAC

Critic and actor losses

SAC trains from replayed Bellman targets

\[ y=r+\gamma\;\mathbb E_{a'\sim\pi}\!\left[ Q_{\bar\theta}(s',a')-\alpha\log\pi(a'\mid s')\right] \]

\[ L_Q(\theta)=\mathbb E_{(s,a,r,s')\sim\mathcal D} \left[(Q_\theta(s,a)-y)^2\right] \]

\[ L_\pi(\phi)=\mathbb E_{s\sim\mathcal D,a\sim\pi_\phi} \left[\alpha\log\pi_\phi(a\mid s)-Q_\theta(s,a)\right] \]

08

Literature

What grew around SAC

Multi-task SAC variants: a compact map

Family	Typical change	Examples and references
SAC backbone	Maximum-entropy off-policy actor-critic	Soft Actor-Critic: Haarnoja et al. (2018)
Shared / multi-head	Shared trunk with task ID or task-specific heads	SAC-MT, MT-MH-SAC; Meta-World / Meta-World+ protocols
Gradient surgery	Modify cross-task gradient aggregation	PCGrad: Yu et al. (2020); CAGrad: Liu et al. (2021); FairGrad: Ban et al. (2024)
Modules / experts	Specialize representations or compose sub-networks	Soft Modularization: Yang et al. (2020); CARE: Sodhani et al. (2021); PaCo: Sun et al. (2022); MOORE: Hendawy et al. (2023)
ARS-family	Modern strong actor-regularized SAC-style baselines	ARS / ARS-LN / ARS-LoRA: Cho et al. (2025)

We now diagnose PPO directly and make a paradigm shift for multi-task RL: fix the critic-side update before adding more architecture.

09

PPO

Proximal Policy Optimization

PPO: collect, estimate, clip, fit value

\[ \textbf{PPO objective:}\quad L(\theta)=L^a(\theta^a)+c_vL^c(\theta^c)-c_eH(\theta^a) \]

\[ \begin{aligned} &\textbf{Actor surrogate:}\\ &L^a(\theta^a)=-\mathbb E_t\!\left[ \min\{\rho_t\hat A_t,\right.\\ &\left.\operatorname{clip}(\rho_t,1-\epsilon,1+\epsilon)\hat A_t\}\right] \end{aligned} \]

\[ \begin{aligned} \textbf{Critic loss:}\quad L^c(\theta^c) &=\mathbb E_t\!\left[ (V_{\theta^c}(s_t)-\hat V_t^{\mathrm{targ}})^2 \right] \end{aligned} \]

\[ \begin{aligned} \textbf{Entropy bonus:}\quad H(\theta^a) &=\mathbb E_t\!\left[ \mathcal H(\pi_{\theta^a}(\cdot\mid s_t)) \right] \end{aligned} \]

\[ \begin{aligned} \textbf{Ratio:}\quad \rho_t(\theta^a) &= \frac{\pi_{\theta^a}(a_t\mid s_t)} {\pi_{\theta^a_{\mathrm{old}}}(a_t\mid s_t)} \end{aligned} \]

\[ \begin{aligned} \textbf{GAE:}\quad \delta_t&=r_t+\gamma V_{\mathrm{old}}(s_{t+1})-V_{\mathrm{old}}(s_t),\\ \hat A_t&=\sum_{\ell\ge0}(\gamma\lambda)^\ell\delta_{t+\ell} \end{aligned} \]

\[ \begin{aligned} \textbf{Value target:}\quad \hat V_t^{\mathrm{targ}} &=\hat A_t+V_{\mathrm{old}}(s_t) \end{aligned} \]

10

MT-PPO

What changes with many tasks

Multi-task PPO averages per-task losses

\[ L(\theta)=\frac1K\sum_{i=1}^{K}L_i(\theta),\quad L_i=L_i^a(\theta^a)+c_vL_i^c(\theta^c)-c_eH_i(\theta^a) \]

\[ \rho_t^{(i)}= \frac{\pi_{\theta^a}(a_t\mid s_t,i)} {\pi_{\theta^a_{\mathrm{old}}}(a_t\mid s_t,i)},\quad \hat A_t^{(i)}=\sum_{\ell\ge0}(\gamma\lambda)^\ell\delta_{t+\ell}^{(i)} \]

\[ g_i^a=\nabla_{\theta^a}L_i^a,\qquad g_i^c=\nabla_{\theta^c}L_i^c \]

Important implementation detail

Rollouts are stratified by task.
Advantages are normalized within each task slice before the actor surrogate.
The actor and critic share information only through the common parameters.
Every minibatch still needs one actor direction and one critic direction.

11

Shared update

The statistical object

Multi-task PPO is gradient aggregation

With shared parameters, the optimizer never applies \(K\) separate updates. It applies one aggregate actor direction and one aggregate critic direction.

Large norms make a task loud.
Negative cosines create direction conflict.
Near-colinear gradients reduce task-specific directions.
The aggregate direction decides which tasks get update budget.

12

Diagnostic

Diagnostic bridge

The task-gradient Gram matrix

\[ g_i^a:=\nabla_{\theta^a}L_i^a,\qquad g_i^c:=\nabla_{\theta^c}L_i^c \]

\[ G_{ij}^{\bullet}=\langle g_i^\bullet,g_j^\bullet\rangle,\quad \bullet\in\{a,c\},\qquad d^\bullet=\sum_i w_i g_i^\bullet \]

\(G_{ii}\) records per-task gradient scale.
\(G_{ij}/\sqrt{G_{ii}G_{jj}}\) records alignment or conflict.
The Gram matrix is enough to score any weighted aggregate direction without storing full gradients.

Geometry summary

Diagonal: \(\log_{10}G_{ii}\), task loudness

Off diagonal: cosine, conflict to alignment

13

Diagnosis

Actor and critic are asymmetric

The critic is where PPO breaks

497x

critic gradient norm spread

4.1x

actor spread after task-wise advantage normalization

The intervention target is not PPO as a whole. It is the shared critic update.

14

Diagnosis

Three critic pathologies

The failure is specific, not generic conflict

D1. Scale spread

Easy tasks are loud

Mean critic aggregation inherits high-gradient task scale.

D2. Co-linear collapse

Directions compress

Critic features lose task-specific directions.

D3. Unfair aggregation

Budget goes to dominant tasks

A better aggregator must use gradient geometry, not only loss values.

15

Method

Surgery 1: target scale

PopArt normalizes value targets before gradients form

PopArt standardizes each task's return targets, while reparameterizing the affine head so raw predictions do not jump.

\[ \begin{aligned} V_{\theta^c}(s,i)&=\sigma_i z_{\theta^c}(s,i)+\mu_i,\\ y_i^{\mathrm{norm}}&=\frac{\hat V_i^{\mathrm{targ}}-\mu_i}{\sigma_i} \end{aligned} \]

\[ g_i^{c,\mathrm{PopArt}}=\sigma_i^{-2}g_i^{c,\mathrm{raw}} \]

Figure 3 panels a and b showing vanilla and PopArt Gram matrices

Paper Fig. 3, panels (a) vanilla and (b) +PopArt: early critic Gram diagnostics.

16

Method

Surgery 2: representation conditioning

Layer Normalization (LN) conditions the critic features

LN-c applies pre-activation LayerNorm in the critic's hidden linear layers. It is a feature-conditioning fix, not another reward-scale normalization.

\[ \mathrm{LN}(h)=\gamma\odot \frac{h-\bar h}{\sqrt{\operatorname{Var}(h)+\varepsilon}}+\beta \]

Stabilizes hidden activation scale.
Reduces co-linear collapse in critic gradients.
In early MT10 diagnostics, mean off-diagonal \(|\cos|\) drops from 0.34 to 0.20.

Figure 3 panels c and d showing LN effects

Paper Fig. 3, panels (c) +LN and (d) +PopArt+LN: cleaner diagonal and less off-diagonal collapse.

17

Method intuition

Surgery 3: fair critic aggregation

FairGrad with \(\alpha=1\): equalize contribution, not loss

\[ d^c=\sum_i w_i g_i^c,\qquad x_i=(G^c w)_i=(g_i^c)^\top d^c \]

Mean aggregation gives more update budget to large-norm critic tasks.
FairGrad increases weights for low-contribution tasks.
The Gram matrix \(G^c\) is the right summary because \(x_i\) and \(\|d^c\|\) are computed from pairwise inner products.

\[ \begin{aligned} \hbox{orthogonal toy: }&\|g_1\|=10,\ \|g_2\|=1\\ &\Rightarrow\ w_1=1/10,\ w_2=1\quad(\alpha=1) \end{aligned} \]

18

Theory

Theoretical result

FairGrad \(\alpha=1\): fixed norm and scale invariance

Theorem

Let \(G=[\langle g_i,g_j\rangle]\succ0\). If \(w>0\) solves

\[ G w=w^{-1},\qquad d=\sum_{i=1}^{K}w_i g_i , \]

then the aggregate has fixed norm

\[ \|d\|^2=w^\top G w=K . \]

For any task rescaling \(\tilde g_i=c_i g_i\), \(c_i>0\), the weights transform as \(\tilde w_i=w_i/c_i\), so \(\sum_i \tilde w_i\tilde g_i=d\).

Figure 3 panel a showing critic Gram matrix

Paper Fig. 3 panel (a): the raw critic Gram matrix shows the scale problem that FairGrad is designed to neutralize.

19

TOPPO

Algorithm 1

TOPPO: a PPO collect-and-update phase

Require tasks \(\{\mathcal M_i\}_{i=1}^{K}\), policy and critic parameters \(\theta=(\theta^a,\theta^c)\), LN-c critic, PopArt statistics \((\mu_i,\sigma_i)\), repeat count \(R\), and max gradient norm \(\tau\).

01 collect

Roll out old policy

Collect \(\mathcal B=\cup_i\mathcal B_i\) with \(\pi_{\theta^a_{\mathrm{old}}}\). Compute \(\hat A_t^{(i)}\) and \(\hat V_t^{(i),\mathrm{targ}}\).

02 normalize targets

PopArt

Update \((\mu_i,\sigma_i)\), renormalize the affine head, and use \((\hat V_t^{(i),\mathrm{targ}}-\mu_i)/\sigma_i\).

03 minibatches

PPO epochs

For \(r=1,\ldots,R\), form stratified minibatches and normalize advantages within each task slice.

04 task gradients

Separate actor and critic

For every task \(i\), compute \(g_i^a\) from the actor surrogate and \(g_i^{c,\mathrm{PopArt}}\) from the normalized critic loss.

05 critic surgery

FG-c

Aggregate the critic with \(d^c=\sum_i w_i g_i^{c,\mathrm{PopArt}}\), where FairGrad chooses weights from the critic Gram matrix.

06 actor + step

PCGrad-a and Adam

Set \(d^a=\mathrm{PCGrad}(\{g_i^a\})\), clip \((d^a,d^c)\) by \(\tau\), and take one Adam step on \(\theta\).

20

Result

Now show the MT50 result

TOPPO recovers mean and tail performance

90.9%

MT50 mean success

56.5%

worst-10 tail success

717K

parameters

21

Benchmark table

Headline success and worst-\(k\) tail

MT50 mean and tail success

Algorithm	Params	Worst-5	Worst-10	Worst-20	All / Mean
SAC-based and SAC-family baselines
SAC-MT	2597K	0.0	0.0	0.0	47.6
MT-MH-SAC	2754K	0.0	0.0	0.0	47.2
Soft Modularization	3534K	0.0	0.0	1.8	54.1
PCGrad-SAC	2597K	0.0	0.0	0.0	51.9
PaCo	--	0.0	0.0	4.6	55.6
MOORE	7403K	0.0	0.0	15.9	64.2
ARS	2597K	--	9.1	26.3	65.9
ARS-LN (400)	2611K	--	21.0	49.1	78.3
ARS-LoRA	16123K	--	29.3	58.4	83.2
PPO-based methods
Vanilla MT-PPO	716K	0.0	6.6	46.3	78.5
+LN-c +PopArt +FG-c	717K	22.3	52.4	75.1	90.1
TOPPO	717K	24.2	56.5	77.2	90.9

Values are final-checkpoint success rates (%). The full paper table also reports MT10 and standard deviations where available.

22

Evidence

Mechanism evidence after the table

Ablation study

6.6

vanilla worst-10

42.3

after LN-c

56.5

TOPPO worst-10

The ladder separates the critic-side surgeries. LN-c gives the largest tail jump, while PopArt and FG-c stabilize target scale and gradient budget; the combined path matches the diagnosis before the final leaderboard comparison.

23

Thank you

Thank You

TOPPO Optimizes Proximal Policy Optimization for multi-task reinforcement learning with critic balancing.

Li, Y., Lin, G., Qu, A., & Miao, R.^† (2026). TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing. arXiv. https://arxiv.org/abs/2605.11473

Contact

rui.miao@utdallas.edu

rui-miao.github.io