Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random
The reward we most need may be the one not recorded.
Healthcare
A patient misses a follow-up questionnaire because symptoms worsened. Labs and vitals keep arriving, but the quality-of-life reward is absent.
Business
A high-value purchase crosses devices or triggers manual review. The campaign sees the click, but attribution loses the conversion reward.
Offline RL
The batch data are all we have. If missingness depends on reward, ordinary FQE learns the value of the recorded world, not the real one.
An MDP is a compact language for sequential decisions.
Finite horizon
Dynamics
The environment samples \(S_{t+1}\) from \(P_t(\cdot\mid S_t,A_t)\).
Reward
The signal \(R_t\) scores the transition; in this paper that signal may be missing not at random.
A policy induces a value through Bellman recursion.
Policy
Bellman equations
We want to evaluate before we intervene.
Observed batch data
Historical trajectories collected under a behavior policy: standard care, old marketing rules, or the current product policy.
Target policy
The new policy may react to whether the previous outcome was observed, without needing risky online exploration.
Standard OPE assumes rewards are available or ignorable after conditioning. MNAR rewards violate that assumption at the reward-modeling step.
MNAR means the observation process looks at the outcome.
MCAR
Observation is unrelated to rewards and covariates.
MAR
Observation depends on observed variables such as state and action.
MNAR
Observation depends on the possibly unobserved reward itself.
The chart continues after the reward disappears.
Viral load, SOFA score, vitals, treatment history.
Adjust dosage, fluids, vasopressors, follow-up intensity.
Quality-of-life score or change in SOFA.
more severe -> less likely recordedClinical reality: patients with worsening symptoms skip surveys or follow-ups, yet later labs, vitals, and encounters remain in the EHR.
Statistical opportunity: the next clinical state can carry information about the unrecorded reward without directly causing observation.
The most valuable conversions are also the easiest to lose.
Logged journey
What the dataset learns
The campaign appears less profitable when high-value rewards are systematically under-attributed.
What we need to recover
Conditioning on observation changes the regression target.
Ordinary FQE target on recorded rewards
MNAR selection shift
Bellman recursion propagates this one-step distortion over time. More data can estimate the wrong conditional mean very precisely.
Finite-horizon MDP with reward-dependent observation.
Observed trajectory
MNAR propensity
Missingness-aware target policy
The target policy can use missingness, but the missingness is generated by the reward.
Use the next state as an endogenous shadow variable.
\(S_{t+1}\) carries no extra information about observation after conditioning on \(R_t,S_t,A_t\).
\(S_{t+1}\) remains informative about the reward on the observed subset.
Turn a shadow-variable restriction into a recoverable reward.
Bridge condition
Recovered reward mean
Why observed rewards are enough to learn it
No future dependence: \(O_t\) depends on current \(S_t,A_t,R_t\), not future states.
Exclusion: \(S_{t+1}\perp O_t\mid R_t,S_t,A_t\).
Under positivity and completeness, the policy value is identified.
Learn \(b_t\)
Fit the bridge from \(\{i:O_{t,i}=1\}\) using \(S_{t+1}\) as the shadow variable.
Replace missing rewards
Run Bellman recursion
Evaluate \(\pi_t(a\mid s,o_-)\) on the augmented state \((S_t,O_{t-1})\).
Key identity for FQE: the Bellman regression uses \(R^{\mathrm{rec}}_t\), with \(\mathbb{E}[R^{\mathrm{rec}}_t\mid S_t,A_t]=\mathbb{E}[R_t\mid S_t,A_t]=\bar r_t(S_t,A_t)\).
Bridge fitting is a conditional moment problem.
Population moment on observed rewards
Sample saddle-point objective
Only observed rewards are used to fit the bridge.
The learned bridge predicts at both observed and missing transitions.
FQE then operates on recovered rewards.
Backward induction with bridge-imputed rewards.
The rates look like nonparametric FQE plus the price of an inverse problem.
Bridge estimation
\(\tau_t\) measures ill-posedness of the conditional expectation operator.
Policy value estimation
The bound gives three empirical predictions.
More \(n\) should help ProxFQE.
The bridge estimator improves with sample size, so MSE should decay rather than plateau.
Longer horizons are harder.
Every step compounds estimation error through Bellman recursion, visible as increasing MSE in \(T\).
MNAR-blind methods keep bias.
Naive regression, ordinary imputation, and unstable IPW should not reliably recover the oracle target.
A controlled MNAR-MDP where the reward drives observation.
State \(S_t=(S_{t,1},S_{t,2})\), binary action \(A_t\in\{-1,+1\}\).
\(P(O_t=1)=\operatorname{expit}(c_0-0.1A_t+0.2(1,-2)^\top S_t+2.5R_t)\).
The target policy depends on \(S_t\) and previous missingness \(O_{t-1}\).
ProxFQE is the method whose error actually decays with data.
Across 20% to 80% missingness, the teal ProxFQE curve keeps moving down.
Naive, IPW, imputation, and SCOPE often flatten because they target the observed reward process.
This is the empirical footprint of the \(n^{-\alpha/(2\alpha+1)}\) term.
Longer horizons hurt everyone, but the bridge correction remains stable.
MSE grows as the number of stages increases, matching the finite-horizon bound.
IPW can blow up when horizon length and poor overlap compound.
The method pays for bridge estimation, but avoids persistent selection bias.
A clinical stress test with high-dimensional states and discrete treatments.
ProxFQE stays closest to the fully observed oracle.
Absolute bias: ProxFQE 0.05 vs Naive 1.59 and Impute 3.27.
Absolute bias: ProxFQE 2.66 vs Naive 9.44 and Impute 17.70.
If healthier rewards are more recorded, MNAR-blind methods overestimate policy value.
Takeaways for statisticians and ML researchers.
MNAR rewards are an OPE problem, not only a missing-data nuisance.
Reward-dependent observation changes the Bellman targets and can mis-rank policies.
MDP structure supplies a natural shadow variable.
The next state is already in logged transitions and can identify the full-data reward mean.
Bridge + FQE gives a practical estimator with theory.
Simulation and sepsis experiments support the predicted sample-size, horizon, and MNAR-bias behavior.
Questions?
Off-policy evaluation with reward MNAR is identifiable when the next state can serve as a valid shadow variable.
rui.miao[@]utdallas[DOT]edu
OPE for Missingness-Aware Policies in MDPs with Rewards MNAR
ICML '26, to be released in July