Offline RL with selective outcome recording

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

Ziheng Wei, Annie Qu, Rui Miao^† University of Michigan | UC Santa Barbara | UT Dallas

OPE under MNAR rewards01

The practical failure mode

The reward we most need may be the one not recorded.

Selection bias can survive infinite data.

H

Healthcare

A patient misses a follow-up questionnaire because symptoms worsened. Labs and vitals keep arriving, but the quality-of-life reward is absent.

$

Business

A high-value purchase crosses devices or triggers manual review. The campaign sees the click, but attribution loses the conversion reward.

RL

Offline RL

The batch data are all we have. If missingness depends on reward, ordinary FQE learns the value of the recorded world, not the real one.

Motivation02

MDP basics

An MDP is a compact language for sequential decisions.

State, action, transition, reward.

Finite horizon

$$\mathcal{M}=(\mathcal S,\mathcal A,\mathcal P,r,T)$$

Dynamics

The environment samples $S_{t+1}$ from $P_t(\cdot\mid S_t,A_t)$.

Reward

The signal $R_t$ scores the transition; in this paper that signal may be missing not at random.

MDP Basics03

MDP basics

A policy induces a value through Bellman recursion.

This is the object OPE estimates.

Policy

$$\pi_t(a\mid s)=P(A_t=a\mid S_t=s)$$

Bellman equations

$$Q_t^\pi(s,a)=\mathbb E[R_t+V_{t+1}^\pi(S_{t+1})\mid S_t=s,A_t=a]$$

$$V_t^\pi(s)=\sum_a\pi_t(a\mid s)Q_t^\pi(s,a)$$

MDP Basics04

Why OPE is the right battleground

We want to evaluate before we intervene.

OPE

Observed batch data

$$\mathcal{D} \sim \pi^b$$

Historical trajectories collected under a behavior policy: standard care, old marketing rules, or the current product policy.

Target policy

$$V(\pi) = \mathbb{E}_{\pi}\!\left[\sum_{t=1}^{T} R_t\right]$$

The new policy may react to whether the previous outcome was observed, without needing risky online exploration.

Standard OPE assumes rewards are available or ignorable after conditioning. MNAR rewards violate that assumption at the reward-modeling step.

Motivation03

The missingness hierarchy

MNAR means the observation process looks at the outcome.

Let $O_t=1$ mean $R_t$ is recorded.

MCAR

Observation is unrelated to rewards and covariates.

MAR

Observation depends on observed variables such as state and action.

MNAR

Observation depends on the possibly unobserved reward itself.

$$\mathbb{E}[R_t\mid S_t,A_t,O_t=1] \ne \mathbb{E}[R_t\mid S_t,A_t]$$

Motivation04

Healthcare example

The chart continues after the reward disappears.

A reward can be missing while future state is observed.

visit t

State $S_t$

Viral load, SOFA score, vitals, treatment history.

decision

Action $A_t$

Adjust dosage, fluids, vasopressors, follow-up intensity.

outcome window

Reward $R_t$

Quality-of-life score or change in SOFA.

more severe -> less likely recorded

Clinical reality: patients with worsening symptoms skip surveys or follow-ups, yet later labs, vitals, and encounters remain in the EHR.

Statistical opportunity: the next clinical state can carry information about the unrecorded reward without directly causing observation.

Motivation05

Business example

The most valuable conversions are also the easiest to lose.

Missing attribution is not neutral.

Logged journey

What the dataset learns

The campaign appears less profitable when high-value rewards are systematically under-attributed.

What we need to recover

$$\bar r_t(s,a)=\mathbb{E}[R_t\mid S_t=s,A_t=a]$$

Motivation06

Where standard FQE breaks

Conditioning on observation changes the regression target.

The error is bias, not just noise.

Ordinary FQE target on recorded rewards

$$\widehat Q_t \approx \mathbb{E}\!\left[R_t+\widehat V_{t+1}(S_{t+1})\mid S_t,A_t,O_t=1\right]$$

MNAR selection shift

$$\mathbb{E}[R_t\mid S_t,A_t,O_t=1]-\mathbb{E}[R_t\mid S_t,A_t]\ne 0$$

Bellman recursion propagates this one-step distortion over time. More data can estimate the wrong conditional mean very precisely.

Problem07

Formal setup

Finite-horizon MDP with reward-dependent observation.

Target policies may react to missingness.

Observed trajectory

$$\tau_i=\{S_{t,i},A_{t,i},O_{t,i},R^{obs}_{t,i},S_{t+1,i}\}_{t=1}^{T}$$

$$R_t^{obs}=O_tR_t$$

MNAR propensity

$$e_t(s,a,r)=P(O_t=1\mid S_t=s,A_t=a,R_t=r)$$

Missingness-aware target policy

$$\pi_t(a\mid s,o_-),\quad \widetilde S_t=(S_t,O_{t-1}),\quad V(\pi)=\mathbb{E}[V_1^{\pi}(S_1,0)]$$

Setup08

Data-generating structure

The target policy can use missingness, but the missingness is generated by the reward.

Black: MDP. Red: MNAR. Blue: target policy.

Setup09

Identification insight

Use the next state as an endogenous shadow variable.

No extra proxy measurement is required.

exclusion

$S_{t+1}$ carries no extra information about observation after conditioning on $R_t,S_t,A_t$.

$$S_{t+1}\perp O_t\mid R_t,S_t,A_t$$

relevance

$S_{t+1}$ remains informative about the reward on the observed subset.

$$S_{t+1}\not\perp R_t\mid S_t,A_t,O_t=1$$

Identification10

Bridge function

Turn a shadow-variable restriction into a recoverable reward.

The bridge solves a conditional moment equation.

Bridge condition

$$\mathbb{E}\!\left[b_t(S_t,A_t,S_{t+1})\mid R_t,S_t,A_t\right]=R_t$$

Recovered reward mean

$$\mathbb{E}\!\left[b_t(s,a,S_{t+1})\mid S_t=s,A_t=a\right]=\mathbb{E}[R_t\mid S_t=s,A_t=a]$$

Why observed rewards are enough to learn it

Recall A1

No future dependence: $O_t$ depends on current $S_t,A_t,R_t$, not future states.

Recall A2

Exclusion: $S_{t+1}\perp O_t\mid R_t,S_t,A_t$.

$$P(S_{t+1}\mid R_t,S_t,A_t,O_t=1)=P(S_{t+1}\mid R_t,S_t,A_t)$$

Identification11

Identification theorem

Under positivity and completeness, the policy value is identified.

The missingness model is not explicitly estimated.

Step 1

Learn $b_t$

Fit the bridge from $\{i:O_{t,i}=1\}$ using $S_{t+1}$ as the shadow variable.

Step 2

Replace missing rewards

$$R^{\mathrm{rec}}_t=R^{\mathrm{obs}}_t+(1-O_t)b_t(S_t,A_t,S_{t+1})$$

Step 3

Run Bellman recursion

Evaluate $\pi_t(a\mid s,o_-)$ on the augmented state $(S_t,O_{t-1})$.

Key identity for FQE: the Bellman regression uses $R^{\mathrm{rec}}_t$, with $\mathbb{E}[R^{\mathrm{rec}}_t\mid S_t,A_t]=\mathbb{E}[R_t\mid S_t,A_t]=\bar r_t(S_t,A_t)$.

Identification12

Estimation

Bridge fitting is a conditional moment problem.

Min-max avoids double sampling.

Population moment on observed rewards

$$\mathbb{E}\!\left[b_t(S_t,A_t,S_{t+1})-R_t\mid R_t,S_t,A_t,O_t=1\right]=0$$

Sample saddle-point objective

$$ \begin{aligned} \min_{b_t\in\mathcal{B}^{(t)}}\sup_{g_t\in\mathcal{G}^{(t)}}\;& \frac{1}{n_t}\sum_{i\in\mathcal I_t^{obs}} \{(b_t(S_{t,i},A_{t,i},S_{t+1,i})-R_{t,i})g_t(R_{t,i},S_{t,i},A_{t,i})\}\\ &+\lambda_b\mathcal P_B(b_t)-\lambda_g\mathcal P_G(g_t) \end{aligned} $$

1

Only observed rewards are used to fit the bridge.

2

The learned bridge predicts at both observed and missing transitions.

3

FQE then operates on recovered rewards.

Estimation13

Proximal FQE

Backward induction with bridge-imputed rewards.

Same FQE rhythm, corrected reward target.

loop

Work backward over stages $t=T,T-1,\ldots,1$; initialize $\widehat{V}_{T+1}^{\pi}\equiv 0$.

bridge

Fit the bridge function $\widehat b_t$ using observed rewards and the decoupled min-max objective.

impute reward

Recovered reward: $\widetilde R^{\,rec}_{t,i}=R^{obs}_{t,i}+(1-O_{t,i})\widehat b_t(S_{t,i},A_{t,i},S_{t+1,i})$.

Bellman target

Bellman target: $y_{t,i}=\widetilde R^{\,rec}_{t,i}+\widehat{V}_{t+1}^{\pi}(S_{t+1,i},O_{t,i})$.

fit Q

Q regression: regress $y_{t,i}$ on $(S_{t,i},A_{t,i})$ to obtain $\widehat Q_t$.

update V

Value update: $\widehat{V}_t^{\pi}(s,o_-)=\sum_a\pi_t(a\mid s,o_-)\widehat Q_t(s,a)$; average $\widehat{V}_1^\pi(S_{1,i},0)$.

Estimation14

Theory, in one slide

The rates look like nonparametric FQE plus the price of an inverse problem.

A scratch, not a proof.

Bridge estimation

$$\|\widehat b_t-b_t^*\|_2 \lesssim \tau_t\delta_t$$

$\tau_t$ measures ill-posedness of the conditional expectation operator.

Policy value estimation

$$ \begin{aligned} |\widehat{V}(\pi)-V(\pi)| &\lesssim K\tau_{\max}T^2\sqrt{\log(T/\zeta)}\\ &\quad\times n^{-\frac{\alpha_{\min}}{2\alpha_{\min}+1}}\log n \end{aligned} $$

nmore trajectories reduce error

TBellman propagation costs horizon length

Kpolicy mismatch control

$\tau$shadow-variable inverse stability

Theory15

What should the experiments verify?

The bound gives three empirical predictions.

Simulation and MIMIC check the same story.

Bound being checked empirically

$$ |\widehat{V}(\pi)-V(\pi)| \lesssim K\tau_{\max}T^2\sqrt{\log(T/\zeta)}\, n^{-\frac{\alpha_{\min}}{2\alpha_{\min}+1}}\log n $$

prediction 1

More $n$ should help ProxFQE.

The bridge estimator improves with sample size, so MSE should decay rather than plateau.

prediction 2

Longer horizons are harder.

Every step compounds estimation error through Bellman recursion, visible as increasing MSE in $T$.

prediction 3

MNAR-blind methods keep bias.

Naive regression, ordinary imputation, and unstable IPW should not reliably recover the oracle target.

Theory16

Simulation design

A controlled MNAR-MDP where the reward drives observation.

50 seeds, five OPE methods.

$Simulated reward, action, and missingness overview$

2D

State $S_t=(S_{t,1},S_{t,2})$, binary action $A_t\in\{-1,+1\}$.

MNAR

$P(O_t=1)=\operatorname{expit}(c_0-0.1A_t+0.2(1,-2)^\top S_t+2.5R_t)$.

policy

The target policy depends on $S_t$ and previous missingness $O_{t-1}$.

Experiments17

Simulation result: sample size

ProxFQE is the method whose error actually decays with data.

MSE shown on log2 scale.

n up

Across 20% to 80% missingness, the teal ProxFQE curve keeps moving down.

bias floors

Naive, IPW, imputation, and SCOPE often flatten because they target the observed reward process.

theory link

This is the empirical footprint of the $n^{-\alpha/(2\alpha+1)}$ term.

Experiments18

Simulation result: horizon

Longer horizons hurt everyone, but the bridge correction remains stable.

The Bellman recursion is doing real work.

T up

MSE grows as the number of stages increases, matching the finite-horizon bound.

variance

IPW can blow up when horizon length and poor overlap compound.

MNAR cost

The method pays for bridge estimation, but avoids persistent selection bias.

Experiments19

MIMIC-III sepsis application

A clinical stress test with high-dimensional states and discrete treatments.

Oracle FQE uses fully observed rewards as reference.

patients13,943 ICU stays

horizonT=10, 4-hour windows

state48 clinical features

actions25 fluid x vasopressor bins

rewardSOFA improvement

Real Data20

MIMIC-III result

ProxFQE stays closest to the fully observed oracle.

SCOPE omitted in the figure due to degenerate estimates.

$Sepsis OPE values and absolute bias versus oracle$

20%

Absolute bias: ProxFQE 0.05 vs Naive 1.59 and Impute 3.27.

80%

Absolute bias: ProxFQE 2.66 vs Naive 9.44 and Impute 17.70.

clinical read

If healthier rewards are more recorded, MNAR-blind methods overestimate policy value.

Real Data21

Conclusion

Takeaways for statisticians and ML researchers.

What to remember after the talk.

01

MNAR rewards are an OPE problem, not only a missing-data nuisance.

Reward-dependent observation changes the Bellman targets and can mis-rank policies.

02

MDP structure supplies a natural shadow variable.

The next state is already in logged transitions and can identify the full-data reward mean.

03

Bridge + FQE gives a practical estimator with theory.

Simulation and sepsis experiments support the predicted sample-size, horizon, and MNAR-bias behavior.

Takeaways23

Thank you

Questions?

Off-policy evaluation with reward MNAR is identifiable when the next state can serve as a valid shadow variable.

website

rui-miao.github.io

email

rui.miao[@]utdallas[DOT]edu

code

github.com/naivlab/ShadOPE

paper

OPE for Missingness-Aware Policies in MDPs with Rewards MNAR
ICML '26, to be released in July

Rui Miao | UT Dallas24

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

The reward we most need may be the one not recorded.

Healthcare

Business

Offline RL

An MDP is a compact language for sequential decisions.

Finite horizon

Dynamics

Reward

A policy induces a value through Bellman recursion.

Policy

Bellman equations

We want to evaluate before we intervene.

Observed batch data

Target policy

MNAR means the observation process looks at the outcome.

MCAR

MAR

MNAR

The chart continues after the reward disappears.

The most valuable conversions are also the easiest to lose.

Logged journey

What the dataset learns

What we need to recover

Conditioning on observation changes the regression target.

Ordinary FQE target on recorded rewards

MNAR selection shift

Finite-horizon MDP with reward-dependent observation.

Observed trajectory

MNAR propensity

Missingness-aware target policy

The target policy can use missingness, but the missingness is generated by the reward.

Use the next state as an endogenous shadow variable.

Turn a shadow-variable restriction into a recoverable reward.

Bridge condition

Recovered reward mean

Why observed rewards are enough to learn it

Under positivity and completeness, the policy value is identified.

Learn \(b_t\)

Replace missing rewards

Run Bellman recursion

Bridge fitting is a conditional moment problem.

Population moment on observed rewards

Sample saddle-point objective

Backward induction with bridge-imputed rewards.

The rates look like nonparametric FQE plus the price of an inverse problem.

Bridge estimation

Policy value estimation

The bound gives three empirical predictions.

More \(n\) should help ProxFQE.

Longer horizons are harder.

MNAR-blind methods keep bias.

A controlled MNAR-MDP where the reward drives observation.

ProxFQE is the method whose error actually decays with data.

Longer horizons hurt everyone, but the bridge correction remains stable.

A clinical stress test with high-dimensional states and discrete treatments.

ProxFQE stays closest to the fully observed oracle.

Takeaways for statisticians and ML researchers.

MNAR rewards are an OPE problem, not only a missing-data nuisance.

MDP structure supplies a natural shadow variable.

Bridge + FQE gives a practical estimator with theory.

Questions?