Saturday, September 21, 2013

The Free Lunch Theorem

Different learning theorists have different approaches to teaching learning theory. Shai Shalev-Shwartz and Shai Ben-David have written a book,  Understanding Machine Learning: From Theory to Algorithms, which gives a significant amount of space to the no free lunch theorem (preliminary notes). Here I will state a "free lunch theorem" and argue that the free lunch theorem is a good (better) departure point for learning theory.

I will state the free lunch theorem in terms of C++ code.  One can specify a predictor as a C++ program.  The predictor can be a linear classifier, a polynomial, a decision tree, or whatever.  I will assume access to an arbitrarily large (but finite) standard library so that programs can be written extremely compactly if you know the right library function to call.  The library can include every form of function used as a predictor by machine learning practitioners and be designed in such a way as to allow predictors (of all kinds) to be written as compactly as possible.  The library can also include all of the programming abstractions in use by programmers today so that new programs can be written as compactly as possible.  For any code library L, and program f written using L, I will write | f |_L for the length of the code for f (in bits).  Here we are not counting the length of the code for the library functions, just the number of bits in the name of the library function when the function is called.  The theorem states, in essence, that, as long as the library L is written prior to seeing the training data, any predictor f written in C++ using L will perform well provided that it performs well on the training data and that | f |_L is small compared to the size of the training data.  For example, a polynomial predictor with n numerical parameters and which performs well on the training data will perform well on new data provided that the number of training points is large compared with bn where b is the number of bits used for each numerical parameter.  (Note the lack of any reference to VC dimension.)

To state the free lunch theorem precisely we assume a data distribution D on input-output pairs and a sample S of m such pairs drawn IID from this distribution.  For simplicity we will assume binary classification so that the output is always either 1 or -1.  For any predictor f mapping input to outputs we define the generalization error rate of f, denoted err(f,D), to be the fraction of time that f makes a mistake when drawing input-output pairs form the data distribution D.  We write err(f,S) for the fraction of times that f makes a mistake on the sample.  The theorem states that with probability at least 1-delta over the draw of the sample S (of m training pairs) the following holds simultaneously for all C++ programs f that can be written with library L and all values of the parameter lambda.

(1)   err(f,D) <= (1 - 1/(2lambda)) [ err(f,S)  +  [lambda(ln 2)/m] | f |_L  + [lambda/m] ln(1/delta) ]

This is a special case of theorem 1 of the PAC-Bayesian tutorial referenced in my post on a generalization bound for dropouts. For fixed lambda, this bound expresses a linear trade-off between the accuracy on the training data err(f,S) and the simplicity of the rule as measured by | f |_L.  It is a form of Occam's razor. An important point is that we can use an arbitrarily large code library when writing f. Another important point is that it is meaningless to make lambda very large. The only reason for making lambda large is that this reduces the leading factor in the bound. However, the leading factor rapidly approaches 1 as lambda becomes large.  Hence the weight on | f |_L fundamentally goes as 1/m rather than 1/sqrt{m}. (Some people will argue with this interpretation, but in any case the theorem is true as stated.)

The free lunch theorem says that we do not have to fix any particular form of predictor before we see the training data --- we can use an enormous library of essentially all classes of predictors known to learning theory, and in fact all predictors that can be written in C++.  This includes any predictor that could possibly be shipped to a customer.

So what about the no free lunch theorem.  As stated by Shai Shalev-Shwartz in his notes chapter 6, the theorem states, in essence,  that for any learning algorithm there exists a data distribution for which a low error rate prediction rule exists but where the learning algorithm fails to find such a predictor. When applied in the setting of the above free lunch theorem (1), the no free lunch theorem states that there exists a data distribution such that the only good predictors f are such that | f |_L > m.   In this case the true input-output relation has no compact representation.  Structureless functions exist.  We are not going to learn structureless functions.

The pedagogical role of the no free lunch theorem is to justify the introduction of restricted classes of predictors --- one assumes-away the possibility that the true input-output relation is structureless.  I will admit that one does often want to consider restricted classes, such as linear predictors over a fixed (but very high dimensional) feature map.  This is an important class of predictors and it is useful to try to understand it.  But the no free lunch theorem is often used to justify restrictions to finite capacity classes.  This leads to bad theory in my opinion.  But that is a subject for another post.

In summary, the free lunch theorem (1) seems like a better departure point for the study of learning theory than the observation that structureless functions exist.

2 comments:

  1. This is very interesting and made me think quite a bit. Thank You. I only understood (I think) what you were saying after your post on "Chomsky versus Hinton".

    However I might be missing something so I am going to rephrase the main idea of the above in a language that I am comparatively more comfortable with. I would really appreciate it if you could correct me if I am wrong.

    I don't know any Learning Theory so I don't see the no-free-lunch in terms of "for every learner there exists a distribution in which it fails" but rather as a simple corollary of the following incompleteness theorem in Kolmogorov Complexity.

    Theorem (Chaitin): There exists a constant L (which only depends on the particular axiomatic system and the choice of description language) such that there does not exist a string s for which the statement the "The Kolmogorov Complexity of s is greater than L" (as formalized in S) can be proven within the axiomatic system S.

    Stated informally: "There's a number L such that we can't prove the Kolmogorov complexity of any specific string of bits is more than L. "

    In particular you can think of any learning machine as a compression scheme of some fixed Kolmogorov Complexity (consider the most optimal program for that machine, the programming language is irrelevant as long as it is Turing Complete. Consider the description length of it in bits. That is its Kolmogorov Complexity). For any such learning machine you can construct a data string which has Kolmogorov Complexity greater than what this learning machine has. i.e. this data string is random from the point of view of this learning machine. In short IF the data string has some Kolmogorov Complexity K and your learning machine has Kolmogorov Complexity k and k < K then you will never be able to find any structure in your data string i.e. can never show that it has a KC greater than k. Which actually means that the data string given to our learner is structureless (or said differently the function is structureless for our encoding scheme). Thus no free lunch is just a simple diagonalization argument. That is - one can always construct a string that is Kolmogorov Random or structureless w.r.t our encoding scheme (i.e. will have a KC more than the KC of our program).

    From your Hinton post I got the following (something that I do infact believe to be the case, perhaps a platonist viewpoint): That there is no reason to believe that structureless functions exist or stated differently - it is less informative to observe that structureless functions exist. That is - IF the data is generated by a computational process then there is no reason to believe that it is structureless (or we are only talking of computable functions). And if this is the case, then you would always have a machine in your library of learning machines that will be able to find some upper bound on the Kolmogorov Complexity of the data string (much less than the machine itself). This search might be done using Levin's universal search, for example.

    I have not looked at the paper. But just at the post here but then your free lunch theorem says - if a function in our "universal" library does well on the training set, then with high probability it will do better on the test set. Viewed a bit differently this can be used to argue that a "universal learning" algorithm exists.

    In general the argument is that using no free lunch as a justification for restricted concept classes is bad theory. From what I understood I completely agree. This is ofcourse predicated on if I understood you correctly.

    ReplyDelete
    Replies
    1. In my post I made the implicit assumption that we are only discussing C++ programs which define an input-output function (a predictor). I am restricting my class to the C++ that halt. Furthermore, I took the complexity (bit length) of the program to be the number of bytes in the source code rather than its Kolmogorov complexity. To improve the bound we could use a standard compression algorithm, such as Gzip, to compress the source code and use the compressed bit length for the complexity of the predictor. Here there are no issues with the undecidability of Kolmogorov complexity. The free lunch theorem, like the no free lunch theorem, is information-theoretic --- it ignores computational issues.

      Delete