Good Semi-supervised Learning that Requires a Bad GAN
1. Summary
This paper shows that given a discriminator objective, good semi-supervised learning indeed requires a bad generator, and propose the definition of a preferred generator.
2. Problem Setups
Given a labeled set , let be the label space for classification. Let and denote the discriminator and generator, and and denote the corresponding distributions. Consider the discriminator objective function of GAN-based semi-supervised learning, (this is the standard objective function for GAN-based semi-supervised learning) where is the true data distribution. The probability distribution is over classes where the first classes are true classes and the ()-th class is the fake class. The objective function consists of three terms. The first term is to maximize the log conditional probability for labeled data, which is the standard cost as in supervised learning setting. The second term is to maximize the log probability of the first classes for unlabeled data. The third term is to maximize the log probability of the ()-th class for generated data.
Let be a nonlinear vector-valued function, and be the weight vector for class . The discriminator can be defined as Since this is a form of over-parameterization, is fixed as a zero vector.
3. Perfect generator
Theorem. If the generator distribution matches the truen data distribution , i.e. , and discriminator has infinite capacity, then for any optimal solution of the following supervised objective, there exists such that maximizes the semi-supervised cost function and that for all , .
This theorem states that given a perfect generator, there is no benefit for introducing the unlabeled data, i.e. the discriminator cannot improved by unsupervised part.
4. Complement generator
The function maps data points in the input space to the feature space. Let be the density of the data points of class in the feature space. Given a threshold , let be a subset of the data support where i.e., . We assume that given , the ’s are disjoint with a margin.
Suppose is bounded by a convex set . If the support of a generator in the feature space is a relative complement set in , i.e., , we call a complement generator.
Then the paper states that when is a complement generator, under mild assumptions, a near- optimal learns correct decision boundaries in each high-density subset (defined by ) of the data support in the feature space. Intuitively, the generator generates complement samples so the logits of the true classes are forced to be low in the complement. As a result, the discriminator obtains class boundaries in low-density areas.
5. Pros and cons of feature matching
By matching the first-order moment by SGD, feature matching is performing some kind of distribution matching, though in a rather weak manner. Loosely speaking, feature matching has the effect of generating samples close to the manifold. But due to its weak power in distribution matching, feature matching will inevitably generate samples outside of the manifold, especially when the data complexity increases. Consequently, the generator density is usually lower than the true data density within the manifold and higher outside. Hence, an optimal discriminator could still distinguish between true and generated samples in many cases.