Trust Region Policy Optimization
1. Abstract
The theory justifies optimizing a surrogate objective with a penalty on KL divergence. However, the large penalty coefficient C leads to prohibitively small steps, so we would like to decrease this coefficient. Empirically, it is hard to robustly choose the penalty coefficient, so we use a hard constraint instead of a penalty, with parameter (the bound on KL divergence).
The constraint on is hard for numerical optimization and estimation, so instead we constrain
Our theory ignores estimation error for the advantage function. Kakade and Langford (2002) consider this error in their derivation, and the same arguments would hold in the setting of this paper, but we omit them for simplicity.
2. Comparison between algorithms
- Traditional gradient descent
some parameters change probabilities a lot more than others!
- We can rescale the gradient so this doesn't happen, using trust region optimization:
Notice that
where is the Fisher-information matrix,
- Natural policy gradient, pick
3. Computation
Fisher-information matrix can be estimated using samples. The updated rule for TRPO is a quadratic form.
- we first solve linear system and get the direction of update:
this can be done using conjugate gradient method.
- Having computed the search direction, we next need to compute the maximal step length which satisfies the KL divergence constraint. The solution has a analytic form.