Offline RL with selective outcome recording

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

Ziheng Wei, Annie Qu, Rui Miao University of Michigan | UC Santa Barbara | UT Dallas
OPE under MNAR rewards01
The practical failure mode

The reward we most need may be the one not recorded.

Selection bias can survive infinite data.
H

Healthcare

A patient misses a follow-up questionnaire because symptoms worsened. Labs and vitals keep arriving, but the quality-of-life reward is absent.

$

Business

A high-value purchase crosses devices or triggers manual review. The campaign sees the click, but attribution loses the conversion reward.

RL

Offline RL

The batch data are all we have. If missingness depends on reward, ordinary FQE learns the value of the recorded world, not the real one.

Motivation02
MDP basics

An MDP is a compact language for sequential decisions.

State, action, transition, reward.
State St Action At Next state St+1 transition Reward Rt or rt(s,a,s') repeat for t = 1,...,T

Finite horizon

$$\mathcal{M}=(\mathcal S,\mathcal A,\mathcal P,r,T)$$

Dynamics

The environment samples \(S_{t+1}\) from \(P_t(\cdot\mid S_t,A_t)\).

Reward

The signal \(R_t\) scores the transition; in this paper that signal may be missing not at random.

MDP Basics03
MDP basics

A policy induces a value through Bellman recursion.

This is the object OPE estimates.
St At St+1 policy transition Policy value expected future reward under the target policy

Policy

$$\pi_t(a\mid s)=P(A_t=a\mid S_t=s)$$

Bellman equations

$$Q_t^\pi(s,a)=\mathbb E[R_t+V_{t+1}^\pi(S_{t+1})\mid S_t=s,A_t=a]$$
$$V_t^\pi(s)=\sum_a\pi_t(a\mid s)Q_t^\pi(s,a)$$
MDP Basics04
Why OPE is the right battleground

We want to evaluate before we intervene.

OPE

Observed batch data

$$\mathcal{D} \sim \pi^b$$

Historical trajectories collected under a behavior policy: standard care, old marketing rules, or the current product policy.

Target policy

$$V(\pi) = \mathbb{E}_{\pi}\!\left[\sum_{t=1}^{T} R_t\right]$$

The new policy may react to whether the previous outcome was observed, without needing risky online exploration.

Standard OPE assumes rewards are available or ignorable after conditioning. MNAR rewards violate that assumption at the reward-modeling step.

Motivation03
The missingness hierarchy

MNAR means the observation process looks at the outcome.

Let \(O_t=1\) mean \(R_t\) is recorded.

MCAR

reward R

Observation is unrelated to rewards and covariates.

MAR

state/action

Observation depends on observed variables such as state and action.

MNAR

reward R

Observation depends on the possibly unobserved reward itself.

$$\mathbb{E}[R_t\mid S_t,A_t,O_t=1] \ne \mathbb{E}[R_t\mid S_t,A_t]$$
Motivation04
Healthcare example

The chart continues after the reward disappears.

A reward can be missing while future state is observed.
visit t
State \(S_t\)

Viral load, SOFA score, vitals, treatment history.

decision
Action \(A_t\)

Adjust dosage, fluids, vasopressors, follow-up intensity.

outcome window
Reward \(R_t\)

Quality-of-life score or change in SOFA.

more severe -> less likely recorded

Clinical reality: patients with worsening symptoms skip surveys or follow-ups, yet later labs, vitals, and encounters remain in the EHR.

Statistical opportunity: the next clinical state can carry information about the unrecorded reward without directly causing observation.

Motivation05
Business example

The most valuable conversions are also the easiest to lose.

Missing attribution is not neutral.

Logged journey

mobile click desktop purchase reward unlinked future state still records downstream behavior

What the dataset learns

The campaign appears less profitable when high-value rewards are systematically under-attributed.

What we need to recover

$$\bar r_t(s,a)=\mathbb{E}[R_t\mid S_t=s,A_t=a]$$
Motivation06
Where standard FQE breaks

Conditioning on observation changes the regression target.

The error is bias, not just noise.

Ordinary FQE target on recorded rewards

$$\widehat Q_t \approx \mathbb{E}\!\left[R_t+\widehat V_{t+1}(S_{t+1})\mid S_t,A_t,O_t=1\right]$$

MNAR selection shift

$$\mathbb{E}[R_t\mid S_t,A_t,O_t=1]-\mathbb{E}[R_t\mid S_t,A_t]\ne 0$$

Bellman recursion propagates this one-step distortion over time. More data can estimate the wrong conditional mean very precisely.

Problem07
Formal setup

Finite-horizon MDP with reward-dependent observation.

Target policies may react to missingness.

Observed trajectory

$$\tau_i=\{S_{t,i},A_{t,i},O_{t,i},R^{obs}_{t,i},S_{t+1,i}\}_{t=1}^{T}$$
$$R_t^{obs}=O_tR_t$$

MNAR propensity

$$e_t(s,a,r)=P(O_t=1\mid S_t=s,A_t=a,R_t=r)$$

Missingness-aware target policy

$$\pi_t(a\mid s,o_-),\quad \widetilde S_t=(S_t,O_{t-1}),\quad V(\pi)=\mathbb{E}[V_1^{\pi}(S_1,0)]$$
Setup08
Data-generating structure

The target policy can use missingness, but the missingness is generated by the reward.

Black: MDP. Red: MNAR. Blue: target policy.
St At St+1 At+1 St+2 Rt Ot Rt+1 Ot+1 Behavior policy: actions depend on current state. Target policy: O_t may guide the next action.
Setup09
Identification insight

Use the next state as an endogenous shadow variable.

No extra proxy measurement is required.
exclusion

\(S_{t+1}\) carries no extra information about observation after conditioning on \(R_t,S_t,A_t\).

$$S_{t+1}\perp O_t\mid R_t,S_t,A_t$$
relevance

\(S_{t+1}\) remains informative about the reward on the observed subset.

$$S_{t+1}\not\perp R_t\mid S_t,A_t,O_t=1$$
Identification10
Bridge function

Turn a shadow-variable restriction into a recoverable reward.

The bridge solves a conditional moment equation.

Bridge condition

$$\mathbb{E}\!\left[b_t(S_t,A_t,S_{t+1})\mid R_t,S_t,A_t\right]=R_t$$

Recovered reward mean

$$\mathbb{E}\!\left[b_t(s,a,S_{t+1})\mid S_t=s,A_t=a\right]=\mathbb{E}[R_t\mid S_t=s,A_t=a]$$

Why observed rewards are enough to learn it

Recall A1

No future dependence: \(O_t\) depends on current \(S_t,A_t,R_t\), not future states.

Recall A2

Exclusion: \(S_{t+1}\perp O_t\mid R_t,S_t,A_t\).

$$P(S_{t+1}\mid R_t,S_t,A_t,O_t=1)=P(S_{t+1}\mid R_t,S_t,A_t)$$
Identification11
Identification theorem

Under positivity and completeness, the policy value is identified.

The missingness model is not explicitly estimated.
Step 1

Learn \(b_t\)

Fit the bridge from \(\{i:O_{t,i}=1\}\) using \(S_{t+1}\) as the shadow variable.

Step 2

Replace missing rewards

$$R^{\mathrm{rec}}_t=R^{\mathrm{obs}}_t+(1-O_t)b_t(S_t,A_t,S_{t+1})$$
Step 3

Run Bellman recursion

Evaluate \(\pi_t(a\mid s,o_-)\) on the augmented state \((S_t,O_{t-1})\).

Key identity for FQE: the Bellman regression uses \(R^{\mathrm{rec}}_t\), with \(\mathbb{E}[R^{\mathrm{rec}}_t\mid S_t,A_t]=\mathbb{E}[R_t\mid S_t,A_t]=\bar r_t(S_t,A_t)\).

Identification12
Estimation

Bridge fitting is a conditional moment problem.

Min-max avoids double sampling.

Population moment on observed rewards

$$\mathbb{E}\!\left[b_t(S_t,A_t,S_{t+1})-R_t\mid R_t,S_t,A_t,O_t=1\right]=0$$

Sample saddle-point objective

$$ \begin{aligned} \min_{b_t\in\mathcal{B}^{(t)}}\sup_{g_t\in\mathcal{G}^{(t)}}\;& \frac{1}{n_t}\sum_{i\in\mathcal I_t^{obs}} \{(b_t(S_{t,i},A_{t,i},S_{t+1,i})-R_{t,i})g_t(R_{t,i},S_{t,i},A_{t,i})\}\\ &+\lambda_b\mathcal P_B(b_t)-\lambda_g\mathcal P_G(g_t) \end{aligned} $$
1

Only observed rewards are used to fit the bridge.

2

The learned bridge predicts at both observed and missing transitions.

3

FQE then operates on recovered rewards.

Estimation13
Proximal FQE

Backward induction with bridge-imputed rewards.

Same FQE rhythm, corrected reward target.
loop
Work backward over stages \(t=T,T-1,\ldots,1\); initialize \(\widehat{V}_{T+1}^{\pi}\equiv 0\).
bridge
Fit the bridge function \(\widehat b_t\) using observed rewards and the decoupled min-max objective.
impute reward
Recovered reward: \(\widetilde R^{\,rec}_{t,i}=R^{obs}_{t,i}+(1-O_{t,i})\widehat b_t(S_{t,i},A_{t,i},S_{t+1,i})\).
Bellman target
Bellman target: \(y_{t,i}=\widetilde R^{\,rec}_{t,i}+\widehat{V}_{t+1}^{\pi}(S_{t+1,i},O_{t,i})\).
fit Q
Q regression: regress \(y_{t,i}\) on \((S_{t,i},A_{t,i})\) to obtain \(\widehat Q_t\).
update V
Value update: \(\widehat{V}_t^{\pi}(s,o_-)=\sum_a\pi_t(a\mid s,o_-)\widehat Q_t(s,a)\); average \(\widehat{V}_1^\pi(S_{1,i},0)\).
Estimation14
Theory, in one slide

The rates look like nonparametric FQE plus the price of an inverse problem.

A scratch, not a proof.

Bridge estimation

$$\|\widehat b_t-b_t^*\|_2 \lesssim \tau_t\delta_t$$

\(\tau_t\) measures ill-posedness of the conditional expectation operator.

Policy value estimation

$$ \begin{aligned} |\widehat{V}(\pi)-V(\pi)| &\lesssim K\tau_{\max}T^2\sqrt{\log(T/\zeta)}\\ &\quad\times n^{-\frac{\alpha_{\min}}{2\alpha_{\min}+1}}\log n \end{aligned} $$
nmore trajectories reduce error
TBellman propagation costs horizon length
Kpolicy mismatch control
\(\tau\)shadow-variable inverse stability
Theory15
What should the experiments verify?

The bound gives three empirical predictions.

Simulation and MIMIC check the same story.
Bound being checked empirically
$$ |\widehat{V}(\pi)-V(\pi)| \lesssim K\tau_{\max}T^2\sqrt{\log(T/\zeta)}\, n^{-\frac{\alpha_{\min}}{2\alpha_{\min}+1}}\log n $$
prediction 1

More \(n\) should help ProxFQE.

The bridge estimator improves with sample size, so MSE should decay rather than plateau.

prediction 2

Longer horizons are harder.

Every step compounds estimation error through Bellman recursion, visible as increasing MSE in \(T\).

prediction 3

MNAR-blind methods keep bias.

Naive regression, ordinary imputation, and unstable IPW should not reliably recover the oracle target.

Theory16
Simulation design

A controlled MNAR-MDP where the reward drives observation.

50 seeds, five OPE methods.
Simulated reward, action, and missingness overview
2D

State \(S_t=(S_{t,1},S_{t,2})\), binary action \(A_t\in\{-1,+1\}\).

MNAR

\(P(O_t=1)=\operatorname{expit}(c_0-0.1A_t+0.2(1,-2)^\top S_t+2.5R_t)\).

policy

The target policy depends on \(S_t\) and previous missingness \(O_{t-1}\).

Experiments17
Simulation result: sample size

ProxFQE is the method whose error actually decays with data.

MSE shown on log2 scale.
MSE versus sample size by missingness
n up

Across 20% to 80% missingness, the teal ProxFQE curve keeps moving down.

bias floors

Naive, IPW, imputation, and SCOPE often flatten because they target the observed reward process.

theory link

This is the empirical footprint of the \(n^{-\alpha/(2\alpha+1)}\) term.

Experiments18
Simulation result: horizon

Longer horizons hurt everyone, but the bridge correction remains stable.

The Bellman recursion is doing real work.
MSE versus horizon by missingness
T up

MSE grows as the number of stages increases, matching the finite-horizon bound.

variance

IPW can blow up when horizon length and poor overlap compound.

MNAR cost

The method pays for bridge estimation, but avoids persistent selection bias.

Experiments19
MIMIC-III sepsis application

A clinical stress test with high-dimensional states and discrete treatments.

Oracle FQE uses fully observed rewards as reference.
patients13,943 ICU stays
horizonT=10, 4-hour windows
state48 clinical features
actions25 fluid x vasopressor bins
rewardSOFA improvement
training Double DQN learn a candidate policy from complete rewards target policy Dose reduction conservative treatment rule can use Ot-1 evaluation MNAR OPE 20%, 40%, 60%, 80% synthetic reward missingness policy logged data
Real Data20
MIMIC-III result

ProxFQE stays closest to the fully observed oracle.

SCOPE omitted in the figure due to degenerate estimates.
Sepsis OPE values and absolute bias versus oracle
20%

Absolute bias: ProxFQE 0.05 vs Naive 1.59 and Impute 3.27.

80%

Absolute bias: ProxFQE 2.66 vs Naive 9.44 and Impute 17.70.

clinical read

If healthier rewards are more recorded, MNAR-blind methods overestimate policy value.

Real Data21
Conclusion

Takeaways for statisticians and ML researchers.

What to remember after the talk.
01

MNAR rewards are an OPE problem, not only a missing-data nuisance.

Reward-dependent observation changes the Bellman targets and can mis-rank policies.

02

MDP structure supplies a natural shadow variable.

The next state is already in logged transitions and can identify the full-data reward mean.

03

Bridge + FQE gives a practical estimator with theory.

Simulation and sepsis experiments support the predicted sample-size, horizon, and MNAR-bias behavior.

Takeaways23
Thank you

Questions?

Off-policy evaluation with reward MNAR is identifiable when the next state can serve as a valid shadow variable.

email

rui.miao[@]utdallas[DOT]edu

paper

OPE for Missingness-Aware Policies in MDPs with Rewards MNAR
ICML '26, to be released in July

Rui Miao | UT Dallas24