Tuesday, July 9, 2013

A Generalization Bound for Dropouts

I have been saying for some time that I plan to start blogging and this is a first post. This will be a technical blog covering topics that I deem relevant to artificial intelligence.  This first post is somewhat conservative (it reports on a theorem).  Later posts will be less conservative and may even mention the singularity.

This post concerns the learning theory of dropouts in deep neural networks (DNNs).  Back-propagation learning is stochastic gradient descent on the parameters of a neural network. In dropout learning one selects a training point (an input-output pair), randomly deletes half (or some fraction) of the units in the network, and then performs a back-propagation update (a gradient descent update) on the weights of the remaning units.


There is now evidence that dropout learning is important for DNNs as it seems to be significant in the their recent success in speech and vision. Hinton motivates dropouts with the observation that dropout training is optimizing the performance of a random draw form large family of neural networks.  This happens to be the fundamental idea behind PAC-Bayesian generalization bounds.  I have just posted a tutorial on PAC-Bayesian theory on arxiv and have included a generalization bound for dropouts expressed in terms of a sum of the training error observed in dropout training (the training error of the dropped-out networks) plus an L_2 regularization term.


If the dropout rate is alpha (we drop a fraction alpha of parameters) then the regularization cost is reduced by a factor of (1-alpha) relative to standard PAC-Bayesian L2 bounds.  Presumably as alpha approaches 1 the training loss of the dropped-out networks becomes large.  The PAC-Bayesian analysis also motivates adding noise to the network parameters during training as described in the tutorial.

While this is an interesting bound, I am not sure that it is capturing the real phenomenon of dropouts.  The tutorial also includes a general "training-variance" bound analogous to the bias-variance analysis for square loss but with the bias term replaced by training loss and the variance term replaced by an expected kl-divergence E_S[ KL(Q(S), \bar{Q}) ] where Q(S) is the posterior distribution on sample S and \bar{Q} is the average (expected) posterior distribution.  The training-variance bound provably dominates the PAC-Bayesian L2 regularization bounds.  Unfortunately, the training-variance bound is difficult to interpret or convert into an algorithm.  It seems to suggest variance-reduction methods such as bagging.  Variance reduction may be the real issue in dropouts.

Except for the new dropout bound, the PAC-Bayes tutorial is a presentation of existing material. There is a particular effort to make the results in Catoni's 2007 book more accessible.


  1. This is very interesting. Thank you for making this connection. What I find interesting is that it seems to be a case where traditional Bayesian interpretations do not seem to apply naturally. Indeed, if we interpret the action of sampling from selected subsets of the network as sampling from a posterior, it means that the posterior (and presumably the prior) would have point masses at sparse weight vectors (where a whole set of output weights of a particular hidden unit are set to 0, and independently so for each hidden unit). The posterior would also presumably have a major mode corresponding to the usual parameter configuration (with a variance decreasing as the amount of training data increases). As the amount of training data increases, one would however expect that the correct posterior would give less and less probability (exponentially so) on the sparse parameter vectors (corresponding to hiding some of the hidden units) and more and more on the maximum likelihood solution. However, experiments with dropout suggests that even with large datasets, it is better to keep the dropout rate near 50%. The exponential rate at which the optimal dropout rate should approach 0 (not dropping anything) would depend on how well a network with randomly mutilated hidden units performs on the training data, compared to one with out dropout.

    Maybe I am wrong and the dropout rate should indeed be reduced exponentially fast as the amount of data increases. I would be curious to know, of course, if someone actually does the experiment. The comment also applies to some extent to the PAC-Bayesian approach, which suggests that the dropout rate should be decreased as the amount of data increases. I suspect that although there might be some truth to this, there probably are other effects at play that make it better to stay in the regime around 50% dropout.

    1. The PAC-Bayesian theorem allows the "posterior" to be defined --- it is not determined by the prior and the data as it would be in a Bayesian setting. In the proof of the dropout bound the dropout rate in the posterior is defined to match that in the prior.

      But Yoshua's point is still interesting in the PAC-Bayesian setting. One would expect that a tighter bound could be produced by allowing the dropout rate in the posterior to be different from that in the prior. One might expect the optimal posterior dropout rate to be lower as the amount of data increases.

      However, even in a Bayesian setting the posterior dropout rate might remain high for most parameters. Consider a linear predictor operating on sparse features (consider one unit in a DNN). If we assume a Zipf-like distribution on feature activation then no matter how much data we have there will be features that have just enough data to train weights. The Bayesian dropout rates on these "rare" activations might remain high. As the sample size increases different features become "rare". But we then always have rare features for which posterior dropout rate is high. Perhaps the dropout rate of a unit should depend on the unit activation frequency of that unit or the L2 norm of all the output weights of that unit.

  2. It is a wonderful idea to make Catoni's ideas more accessible. This could benefit greatly the ML community.

    I'm also looking forward to see what you have to say about the Singularity ;).

  3. The reference to Zipf-like distribution on features reminds me of Zipfian corruptions for robust POS tagging by Anders Sogaard.

  4. Thanks for this tutorial, very usefull!!!

    I agree with S├ębastien that it would be nice to make PAC-Bayes theory more accessible. I recently read this paper
    It seems that (most) researchers in the Bayesian community are not aware of the PAC-Baysian approach. There would probably be a mutual benefit if it was the case.