Statistical Inference with M-Estimators on Adaptively Collected Data
1. Summary
This paper consider the problem of M-estimators for contextual bandit under the adaptive collected data scheme. Using particular adaptive weights, the M-estimators can be used to construct asymptotically valid confidence regions for a variety of inferential targets.
2. Settings
We assume that the data we have after running a contextual bandit algorithm is comprised of contexts , actions , and primary outcomes . is deterministic and known. We use potential outcome notation and let denote the potential outcomes of the primary outcome and let be the observed outcome. We assume a stochastic contextual bandit environment in which for . We define the history for and . Actions are selected according to policies , which define action selection probabilities .
We assume that is a conditionally maximizing value of criterion , i.e., for all , The M-estimator is defined as
3. Adaptively Weighted M-Estimators
Our proposed estimator for , , is the maximizer of a weighted version of the M-estimation criterion, where Here are pre-specified stabilizing policies that do not depend on data . A default choice for the stabilizing policy when the action space is of size is just for all .
This weight is very similar to the result in Hadid et al. (2021), where serves as the propensity score and is basically a constant.
To construct uniformly valid confidence regions for we prove that is uniformly asymptotically normal in the following sense: where and . We define . Similarly we define respectively and as the second and third partial derivatives of with respect to . For any vector we define .
Notice that the basic conditions and intuitions behind these can be traced back to the classical asymptotic results for M-estimator (Van der Vaart. Asymptotic Statistics, chapter 5).