Near-optimal Sample Complexity Bounds for Robust Learning of Gaussians Mixtures via Compression Schemes
1. Summary
This paper prove that samples are necessary and sufficient for learning a mixture of Gaussians in , up to error in total variation distance. The near optimality is because notation hides a polylog factor. The upper bound is shown using a novel technique for distribution learning based on a notion of compression. Any class of distributions that allows such a compression scheme can also be learned with few samples. Furthermore, if a class of distributions has such a compression scheme, then so do the classes of products and mixtures of those distributions.
2. Definition
Definition. Let and be two probability distributions defined over , their total variation distance is defined by
where is the norm of .
Definition. A distribution is an -approximation for if . We also say is -close to . A distribution is an -approximation for with respect to if
3. Compression Scheme
Let be a class of distributions over a domain .
Definition. A distribution decoder for is a deterministic function , which takes a finite sequence of elements of and a finite sequence of bits, and outputs a member of
Definition. Let be functions, and let . We say admits -robust compression if there exists a decoder for such that for any distribution and for any distribution q on with the following holds:
For any if a sample is drawn from then with probability at least , there exists a sequence of at most elements of , and a sequence of at most bits, such that
Note that here can be seen as a corrupted version of real distribution , and the data is drawn from this corrupted one. And here and are sequences therefore can contain repeated elements. Essentially, the definition asserts that with probability , there is a (short) sequence of elements and some (small number of) additional bits, from which can be approximately reconstructed.
Theorem. There exists a deterministic algorithm that, given candidate distributions a parameter and i.i.d. samples from an unknown distribution outputs an index such that
with probability at least .
Notice that although one decoder cannot know the real set of and , it does know the finite length of and as a function of . Then, with the above theorem, one can approximate the optimal decoder from a finite set of candidate decoders.
Theorem. (compressibility implies learnability). Suppose admits -robust compression. Let . Then can be max -learned in the agnostic setting using
Lemma. If admits r-robust compression, then admits -robust compression.
If admits non-robust compression, then can be learned in the realizable setting using the same number of samples.
Lemma. If admits non-robust) compression, then -mix admits (non-robust) compression.
4. Sample Complexity Bounds of Gaussians Mixtures
Lemma. For any positive integrand, the class of d-dimensional Gaussians admits an -robust compression scheme.
Then with our theorems and lemmas above, we can finally have the following main results.
Theorem. The class of k-mixtures of d-dimensional Gaussians can be learned in the realizable setting, and can be 9-learned in the agnostic setting, using samples.