Confidence Intervals for Policy Evaluation in Adaptive Experiments
1. Summary
For adaptively collected data in a potential outcome model, naive average results in a biased estimation, and the inverse-probability weighting estimator results in a non-normal asymptotic distribution. This paper propose a test statistic that is asymptotically unbiased and normal.
2. Details
Denote our data as , The random variables are called the arms, treatments or interventions. Arms are categorical. The reward or outcome represents the individual's response to the treatment. The treatment assignment probabilities , also called propensity scores, are time-varying and decided via some known algorithm. Given this setup, we are concerned with the problem of estimating and testing pre-specified hypotheses about the value of an arm, denoted by , as well as differences between two such values, denoted by .
IPW estimator is defined as and the augmented inverse propensity weighted (AIPW) estimator is defined as The symbol denotes an estimator of the conditional mean function based on the history , but it need not be a good one: it could be biased or even inconsistent (doubly robust estimator). The second term of acts as a control variate: adding it preserves unbiasedness but can reduce variance, as it has mean zero conditional on and, if is a reasonable estimator of , is negatively correlated with the first term.
Inverse-probability weighting fixes the bias problem but results in a non-normal asymptotic distribution. In fact, when the probability of assignment to the arm of interest tends to zero, the inverse probability weights increase, which in turn causes the tails of the distribution to become heavier.
This paper considers the adaptively-weighted AIPW estimator: with a sequence of evaluation weights . Then given some assumptions,
Suppose that either is consistent or has a limit , i.e., either Then (basically, use martingale central limit theorem) A simple choice of could be , and some complicated schemes can be constructed to result in a variance-optimal estimator.