Near-optimal Sample Complexity Bounds for Robust Learning of Gaussians Mixtures via Compression Schemes

1. Summary

This paper prove that samples are necessary and sufficient for learning a mixture of Gaussians in , up to error in total variation distance. The near optimality is because notation hides a polylog factor. The upper bound is shown using a novel technique for distribution learning based on a notion of compression. Any class of distributions that allows such a compression scheme can also be learned with few samples. Furthermore, if a class of distributions has such a compression scheme, then so do the classes of products and mixtures of those distributions.

2. Definition

Definition. Let and be two probability distributions defined over , their total variation distance is defined by

where is the norm of .

Definition. A distribution is an -approximation for if . We also say is -close to . A distribution is an -approximation for with respect to if

3. Compression Scheme

Let be a class of distributions over a domain .

Definition. A distribution decoder for is a deterministic function , which takes a finite sequence of elements of and a finite sequence of bits, and outputs a member of

Definition. Let be functions, and let . We say admits -robust compression if there exists a decoder for such that for any distribution and for any distribution q on with the following holds:

For any if a sample is drawn from then with probability at least , there exists a sequence of at most elements of , and a sequence of at most bits, such that

Note that here can be seen as a corrupted version of real distribution , and the data is drawn from this corrupted one. And here and are sequences therefore can contain repeated elements. Essentially, the definition asserts that with probability , there is a (short) sequence of elements and some (small number of) additional bits, from which can be approximately reconstructed.

Theorem. There exists a deterministic algorithm that, given candidate distributions a parameter and i.i.d. samples from an unknown distribution outputs an index such that

with probability at least .

Notice that although one decoder cannot know the real set of and , it does know the finite length of and as a function of . Then, with the above theorem, one can approximate the optimal decoder from a finite set of candidate decoders.

Theorem. (compressibility implies learnability). Suppose admits -robust compression. Let . Then can be max -learned in the agnostic setting using

Lemma. If admits r-robust compression, then admits -robust compression.

If admits non-robust compression, then can be learned in the realizable setting using the same number of samples.

Lemma. If admits non-robust) compression, then -mix admits (non-robust) compression.

4. Sample Complexity Bounds of Gaussians Mixtures

Lemma. For any positive integrand, the class of d-dimensional Gaussians admits an -robust compression scheme.

Then with our theorems and lemmas above, we can finally have the following main results.

Theorem. The class of k-mixtures of d-dimensional Gaussians can be learned in the realizable setting, and can be 9-learned in the agnostic setting, using samples.

results matching ""

    No results matching ""