Sunday, August 3, 2014

This blog has moved.  There is a new post on metaphor and mathematics on the new blog.

Saturday, September 28, 2013

Chomsky vs. Hinton

This is a continuation of my "free lunch" post.  I want to provide more context.  I also feel bad about phrasing the discussion in the context of a particular textbook.  The no free lunch theorem is widely quoted and seems to be a standard fixture of the machine learning community.

 I would like to rephrase the issue here as a debate between Chomsky and Hinton.  I will start with caricatures of their positions:

Chomsky:  Generalizing to unseen inputs --- for example judging grammatically on sentences never seen before --- is impossible without some a-priori knowledge of the set of allowed predictors (the set of possible human languages).  Hence we must be born with knowledge of the space of possible human languages --- some form of universal grammar must exist.

Hinton:  Neural networks are a Turing-complete model of computation.  Furthermore, there exists some (perhaps yet to be discovered) universal algorithm for learning networks which can account for leaning in arbitrary domains, including language.

This debate centers on an empirical question --- what algorithms or knowledge is provided by the human genome?

It is important here to distinguish information-theoretic issues from computational complexity issues.

Information-Theoretic Issues: Chomsky takes the position that on purely information theoretic grounds universal grammar must exist. Chomsky's argument is essentially an invocation of the no free lunch theorem.  But the free lunch theorem shows, I think, that there exist information-theoretically adequate universal priors on the set of all computable functions.  On information-theoretic grounds I think Hinton's position is much more defendable than Chomsky's.

Computational Issues: It is clearly silly to consider enumerating all C++ programs as part of a learning algorithm.  But training deep neural networks is not silly --- it has lead to a 30% reduction in word error rate in speech recognition.  Chris Manning is working on deep neural network approaches to learning grammar.

Computational issues do provide good motivations for certain learning algorithms such as the convex optimization used in training an SVM.  But a computational efficiency motivation for a restricted predictor class is different from claiming that for information-theoretic reasons we must be given some restriction to a proper subclass of the computable functions.

There are many Turing complete models of computation and the choice of model does seem to matter. We seem to gain efficiency in programming when we become familiar with the C++ standard library. Somehow the library provides useful general purpose constructs.

We seem to be capable of general purpose thought.  Thought seems related to language. Chomsky might be right that some form of linguistic/cognitive architecture is provided by the human genome. But the adaptive advantage provided by the details of an architecture for general-purpose thought seem likely to be related to computational issues rather than an information-theoretic requirement for a learning bias.

Saturday, September 21, 2013

The Free Lunch Theorem

Different learning theorists have different approaches to teaching learning theory. Shai Shalev-Shwartz and Shai Ben-David have written a book,  Understanding Machine Learning: From Theory to Algorithms, which gives a significant amount of space to the no free lunch theorem (preliminary notes). Here I will state a "free lunch theorem" and argue that the free lunch theorem is a good (better) departure point for learning theory.

I will state the free lunch theorem in terms of C++ code.  One can specify a predictor as a C++ program.  The predictor can be a linear classifier, a polynomial, a decision tree, or whatever.  I will assume access to an arbitrarily large (but finite) standard library so that programs can be written extremely compactly if you know the right library function to call.  The library can include every form of function used as a predictor by machine learning practitioners and be designed in such a way as to allow predictors (of all kinds) to be written as compactly as possible.  The library can also include all of the programming abstractions in use by programmers today so that new programs can be written as compactly as possible.  For any code library L, and program f written using L, I will write | f |_L for the length of the code for f (in bits).  Here we are not counting the length of the code for the library functions, just the number of bits in the name of the library function when the function is called.  The theorem states, in essence, that, as long as the library L is written prior to seeing the training data, any predictor f written in C++ using L will perform well provided that it performs well on the training data and that | f |_L is small compared to the size of the training data.  For example, a polynomial predictor with n numerical parameters and which performs well on the training data will perform well on new data provided that the number of training points is large compared with bn where b is the number of bits used for each numerical parameter.  (Note the lack of any reference to VC dimension.)

To state the free lunch theorem precisely we assume a data distribution D on input-output pairs and a sample S of m such pairs drawn IID from this distribution.  For simplicity we will assume binary classification so that the output is always either 1 or -1.  For any predictor f mapping input to outputs we define the generalization error rate of f, denoted err(f,D), to be the fraction of time that f makes a mistake when drawing input-output pairs form the data distribution D.  We write err(f,S) for the fraction of times that f makes a mistake on the sample.  The theorem states that with probability at least 1-delta over the draw of the sample S (of m training pairs) the following holds simultaneously for all C++ programs f that can be written with library L and all values of the parameter lambda.

(1)   err(f,D) <= (1 - 1/(2lambda)) [ err(f,S)  +  [lambda(ln 2)/m] | f |_L  + [lambda/m] ln(1/delta) ]

This is a special case of theorem 1 of the PAC-Bayesian tutorial referenced in my post on a generalization bound for dropouts. For fixed lambda, this bound expresses a linear trade-off between the accuracy on the training data err(f,S) and the simplicity of the rule as measured by | f |_L.  It is a form of Occam's razor. An important point is that we can use an arbitrarily large code library when writing f. Another important point is that it is meaningless to make lambda very large. The only reason for making lambda large is that this reduces the leading factor in the bound. However, the leading factor rapidly approaches 1 as lambda becomes large.  Hence the weight on | f |_L fundamentally goes as 1/m rather than 1/sqrt{m}. (Some people will argue with this interpretation, but in any case the theorem is true as stated.)

The free lunch theorem says that we do not have to fix any particular form of predictor before we see the training data --- we can use an enormous library of essentially all classes of predictors known to learning theory, and in fact all predictors that can be written in C++.  This includes any predictor that could possibly be shipped to a customer.

So what about the no free lunch theorem.  As stated by Shai Shalev-Shwartz in his notes chapter 6, the theorem states, in essence,  that for any learning algorithm there exists a data distribution for which a low error rate prediction rule exists but where the learning algorithm fails to find such a predictor. When applied in the setting of the above free lunch theorem (1), the no free lunch theorem states that there exists a data distribution such that the only good predictors f are such that | f |_L > m.   In this case the true input-output relation has no compact representation.  Structureless functions exist.  We are not going to learn structureless functions.

The pedagogical role of the no free lunch theorem is to justify the introduction of restricted classes of predictors --- one assumes-away the possibility that the true input-output relation is structureless.  I will admit that one does often want to consider restricted classes, such as linear predictors over a fixed (but very high dimensional) feature map.  This is an important class of predictors and it is useful to try to understand it.  But the no free lunch theorem is often used to justify restrictions to finite capacity classes.  This leads to bad theory in my opinion.  But that is a subject for another post.

In summary, the free lunch theorem (1) seems like a better departure point for the study of learning theory than the observation that structureless functions exist.

Saturday, September 14, 2013


The concept of a "situation" seems central to artificial intelligence and the semantics of natural language. The situation calculus (McCarthy and Hayes, 1969) underlies the notion of STRIPS planning used in the AI planning community.  The notion of "possible world" underlies the Kripke semantics of modal logic which is used in theories of the natural language semantics.  Situations (or possible worlds) play a central role in type-logical semantics.  "Situation Semantics" has been proposed as an alternative to possible world semantics where situations contain partial rather complete information. Situations seem closely related to the events of Davidsonian event semantics.  The notion of situation also seems closely related to the notion of state in reinforcement learning.

Before going any further, the first sentence of the lead story at at the moment of this writing is:

With the announcement on Saturday that the U.S. and Russia have reached an agreement on securing Syria's chemical weapons stockpile, the American threat of U.S. military action was effectively taken off the table.

A search of with keywords "situation in Syria" finds the following title of a Sept 10 Washington Post article.

Syria situation further strains Obama’s relationship with the antiwar movement.

But what is "The Syria situation".  Is the (actual) Syria situation a natural example of the general concept of a situation?  Should the first sentence above be considered to be a statement about, or true of, the Syria situation?  It seems to be true that the Syria situation evolves over time --- there is the current Syria situation, a history of the Syria situation, and possible future outcomes of the Syria situation.

This may seem hopelessly complex.  But perhaps the complexity arises from a feeling that we need to define a system of concepts and rules that specify exactly what situations are.  Perhaps formal definitions and rules are not necessary.  Perhaps there are just soft (weighted) lexically-driven syntactic entailment rules where nothing is formally defined.

Quine took up the slogan "to be is to be the value of a variable".  I would modify this and say "to be is to be the referent of a mention".  If we think in terms of mention equivalence classes, as in mention-mention models of coreference, then the slogan becomes "to be is to be a mention equivalence class". But in a database model of reality it seems more natural to work with a mention-entity model and use the term "reference" rather than coreference.  Of course Santa Clause and Unicorns do not exist in reality simply because we mention them.  However, they do exist in certain stories --- in the situations described in certain works of fiction.  Does a fictional situation "exist" as a "possible world"?   I think the most intellectually coherent answer is yes --- fictional worlds and the entities in them exist in much the same sense that the vector space R^100 exists, they are things we can think about.  I am not too concerned about contradictory entities such as a round square --- the simple assumption that all mentions refer to some entity, where some entities may not be part of our physical reality, seems likely to work well for practical NLP systems.

Perhaps what really defines the notion of a situation is the fact that the same proposition --- the same proposition entity or fluent --- can have different truth values in different situations.  For example, the sentence "David is hungry" (for a fixed referent of "David") is true in some situations but not in others. At a syntactic level propositions are interpreted relative to situations as in "When he arrived at the party last night, David was hungry".  Here the phrase "when he arrived at the party last night" seems to be acting as a situation mention and "David was hungry" is being interpreted in the situation that is the referent of the situation mention.  I would extend the slogan "to be is to be the referent of a mention" with a second slogan "to be a situation is to be the referent of a situation mention".  Somewhat more explicitly, a situation mention is a sentential modifier that intuitively gives the situation in which the sentence is to be interpreted.  For example, "Israel is only a minor player in the Syria situation".

This is an informal blog post --- there are no theorems here.  However, I hope that it provides food for thought.  My over-arching theme is that syntax plus reference (to database entities) plus, perhaps, lexically-driven syntactic paraphrase rules, covers a lot of "semantics".

I promise that a significant fraction of future blog posts will discuss theorems.

Friday, August 30, 2013

Tarski and Mentalese

Blogging gives me a chance to unload many years of unpublishable thinking.  One of the things that I have thought long and hard about is semantics in the sense of Tarski's definition of truth and its relationship to Platonist vs. formalist philosophies of mathematics.  Here I will try to reconcile Platonist and formalist philosophies and relate both to the the notion of semantics for natural language.

We are faced with two positions on the nature of mathematics both of which have powerful arguments.

PlatonismGödel and Tarski each took a Platonic approach to meta-mathematics (mathematical logic). This is perhaps clearest when discussing the natural numbers.  The formulas of natural numbers can be defined by the following grammar

term ::= 0 | 1| x (a variable) | (term1 + term2) | (term1 * term2)

formula ::=  (term1 = term2) | (not formula) | (formula1 or formula2) | (formula1 and formula2)
 | (forall x formula) | (exists x formula)

If we let rho denote an assignment of a natural number to each variable then Tarski defined the
meaning or value V[term]rho and V[formula]rho recursively by equations such as

V[x]rho = rho(x)

V[term1 + term2]rho = V[term1]rho + V[term2]rho


V[forall x formula]rho = True if for all natural numbers n we have V[formula]rho[x:=n] = True and False otherwise.

Here we have a clear distinction between syntax and semantics.  The syntax consists of the formal expressions and the semantics is the relationship between the formal expressions and the actual natural numbers.

These definitions of semantic values are Platonic in the sense that we, the mathematicians, are taking the natural numbers to actually exist and are taking English statements such as "for every natural number n we have ..." to be meaningful.  Of course Platonism extends beyond natural numbers --- working mathematicians adopt a Platonic stance toward all of the objects of mathematics.

Platonism seems essential to the clear-headed thinking of Gödel and Tarski both of whom proved fundamental (true) theorems in mathematical logic such as the statement that no finite axiom system for the numbers can be both sound and complete.  Such theorems cannot even be stated without reference to the actual, Platonic, natural numbers.  But neither can any normal arithmetic statement, such as Fermatt's last theorem.  Fermatt's last theorem is, after all, about the actual numbers.

Formalism:  I firmly believe that my brain performs classical (possibly stochastic) computation (I do not believe that my brain exploits quantum computing in any significant way).  Hence, whatever access I have to the actual natural numbers must be explainable by a computational process taking place in my brain.  The best model I know for what such a computation might look like is the manipulation of symbolic expressions.  It seems to be true that the axioms of Zermelo-Fraenkel set theory with the axiom of choice (ZFC) characterize the set of theorems that Platonic mathematicians accept as true.  So at some level the formalist position --- that mathematics is ultimately just the manipulation of formal expressions --- must be true.

Synthesis: Many years ago I was discussing Tarskian semantics with Marvin Minsky and he remarked that it all seemed silly to him --- Tarski is just saying that the symbol string "snow is white" means that snow is white.  Paraphrasing Minsky --- Tarskian semantics is just a way of removing the quotation marks.  Once the quotation marks are removed we become Platonists --- with the quotation marks removed we are talking in the normal (Platonic) way.

This can be made a little clearer, I think, by considering a mathematician robot who, by construction, thinks by formal inference in a formal rule system.  Now, I argue, one can still image both Platonist and formalist Robots.  When thinking about arithmetic the Platonist robot manipulates formulas which are "about" numbers such as 

There do not exist natural numbers x,y,z,w with w greater than 2 such that x^w+y^w = z^w.

I have written an English sentence, but there is no reason an English sentence can't be a formula (see my previous post on Wittgenstein's public language hypothesis).  The formalist robot thinks about manipulating formulas.  The formalist robot manipulates formulas such as

If P is derivable and P -> Q is derivable then Q is derivable.

Again this is an English sentence just like the Platonic formula except that this sentence (this formula) is about formulas rather than about numbers.  In early treatments of logic formulas were taken to be symbol strings rather than abstract syntax trees.   A statement of Tarskian semantics is simply another formula (in English) defining a value function which maps terms to numbers and formulas to truth values.

Bottom Line: My position is that Platonic reasoning is simply formal reasoning taking place in a native mentalese.   Following Wittgenstein, I think that mentalese is closely related to the surface form of natural language.

When I think about the vector space R^100 there is no magical or mystical connection between my brain and an actual 100 dimensional vector space (Roger Penrose seems to think otherwise).  When I think about my mother there is no magical or mystical connection between my brain and my mother (many people think otherwise). Tarskian denotational semantics is a process of erasing quotation marks --- of mapping external formal symbol strings into the language of mentalese.  This said, the Platonic treatment of formal logic is essential and central to the understanding of logic and Marvin should not be calling it silly.  (Marvin, if you read this please forgive me for my 30 year old possibly misremembered quotation).

When we talk about semantics from an NLP perspective we should be talking about a translation into some kind of mentalese and we should be striving for an understanding of that mentalese.

Thursday, August 15, 2013

Wittgenstein and Mentalese

I first want to say that I edited my last post to replace the phrase "vector types" with "vector categories". I think the NLP community is more comfortable with the term category, as in "syntactic category".  The phrase "vector categories" may have a better change of going viral in the NLP community.

This post is on Wittgenstein and Mentalese.   For those not familiar with phrase, "mentalese"refers to a hypothetical language of thought --- some form of symbolic representation of propositions and beliefs manipulated by the computational process of thought.  Mentalese seems related to the concept of an interlingua in machine translation or logical forms in linguistics.  One interpretation of "semantics" is the relationship between the surface form of language and internal mentalese expressions.  Before going any further, the first sentence of the lead story at as of this writing is:

President Obama said Thursday his government "strongly condemns" violence in Egypt, and he is canceling U.S.-Egyptian military exercises that had been scheduled for next month.

What should we take the internal representation of this sentence to be?  I have argued in previous posts that semantics should be viewed as a relationship between sentences and a database model of reality.  The above sentence presents a new fact (I will assume that it is true) that Obama said what he is claimed to have said.  Reading this sentence might cause me to update my model of reality.  I have argued in previous posts that entities can simply be symbolic tokens such as entity-3687 which occur in relations  such as

entity-3687 is an organization
entity-3687 is named "the Muslim Brotherhood".

We might generate a database representation of the above sentence as follows where I will write entity tokens as symbols starting with *.

*E is an event
*E is the referent of " *Obama said *P "
*E happened on *D
*P states *G strongly condemns *V
*P states *Obama is canceling  *Exc

The reader may already be familiar with the entities *Obama, the day Thursday *D, Obama's government *G, the current violence in Egypt *V, and the scheduled exercises *Exc.  If these entities are well established then perhaps the above relations are all that need to be added to the database. Note that in this case the reference resolution drops most of the verbatim wording of the sentence.

I mention Wittgenstein here because of his emphasis on "public language" as opposed to "private language".  I like the emphasis on public language because it seems plausible to me that the relations of the database model of reality are actually just fragments of public language (of, say, English) with certain phrases replaced by entity tokens.  In this view mentalese is essentially just public language --- there is no mysterious mentalese.  This seems consistent with "shallow" machine translation and the apparent lack of a need for an interlingua.  Translation can be viewed as paraphrase --- a direct substitution of one surface form for another.  Thought itself may take place in a similar manner --- as a direct manipulation of surface forms.

I heard Chris Manning take the position at a panel on language understanding that dependency parses were perhaps an adequate representation of semantics.  I would add reference resolution --- a dependency parse with resolved references seems to go a long way toward meaning.

Of course there  is a large literature on tense, aspect, modality, counterfactuals, and quantification.   Rigorous mathematical thought must involve some representation of mathematically rigorous statements.  I don't know how to reconcile these things with the idea that mentalese might be just fragments of public language.  However I find it plausible that some reconciliation exists --- for example, that precise mathematical thought can be carried out in fragments of English.  I certainly take the public language hypothesis seriously.

Wednesday, August 7, 2013

vector semantics or vector categories?

The basic idea in vector semantics is that the "meaning" of a word (or phrase or sentence) can be expressed as a vector.  For a long time I thought that this was just silly.  However, I have recently become more sympathetic to the work done in this area.  It seems to me, however, that what is really being done is "vector categories" rather than "vector semantics".

Before going any further I will look up a sentence from  The first sentence of the lead story at the moment of this writing is:

With his decision to cancel next month's planned summit with Russia's President Vladimir Putin, President Obama took a principled stance that has lawmakers on both sides of the aisle cheering him for finally standing up to an international bully.

I believe that reference is fundamental to semantics.  I think reference should be viewed as a mapping from a mention (a phrase in a sentence) to an entity in a database model of reality.  In the above sentence the phrases "Obama", "Putin", "next month's summit", "lawmakers"and "his decision" all refer to entities that the author seems to expect readers are already familiar with. The phrases "took a stance" and "stood up to" seem to be co-referential with "his decision" and, at the same time, assert properties of, or descriptions of, the entity (the action).  The reference interpretation of semantics is emphasized in a recent post on Google's research blog.

Vector semantics seems to be in direct conflict with modeling reality by a database and taking reference --- the mapping from mentions to database entities --- as core to semantics.  However, if we interpret the vectors as vector categories  rather than as a denotational meaning (a referent) then these vectors seem quite useful.

My first exposure to vector semantics was in the context of latent semantic indexing (LSI) which assigns a vector to each word by doing singular value decomposition on a word-document matrix. The vector assigned to a word in this way can perhaps be viewed as a representation of a distribution over topics.  Indeed we might expect "was not cancerous" and "was cancerous" to have the same topic distribution --- they both will be common in certain medical diagnosis documents --- even though they have opposite meanings.

A more interesting interpretation of vector semantics (in my opinion) was pointed out to me by
Michael Collins.  He has been doing work on using spectral methods for learning of latent syntactic subcategories.  Here we assume that each standard syntactic category, such as NP, has some latent set of subcategories --- perhaps "person" "place" or "thing".  We can then assign a vector to a phrase where the vector can be interpreted as assigning a score for each possible latent subcategory.

Vector representations for the purpose of scoring "grammaticality" naturally yield factored representations analogous to formal models of English Grammar based on feature structures.
The idea is that, for the purpose of determining grammaticality, the syntactic category of a phrase has features such as gender, person, and number.  Feature structures can be viewed as a simple kind of vector with discrete values along each coordinate.  This interpretation of vector semantics is especially suggested by the observation that certain directions in the vector space can be interpreted as gender as in the equation Phi(queen) = Phi(king) - Phi(man) + Phi(woman) where Phi(w) is the vector associated with word w [paper].

Perhaps even more interestingly, we can abandon any direct interpretation of the vectors associated with phrases other than that they be useful in parsing and then discriminatively train parsers which compute latent vectors at each node.  The vectors can be even be computed by deep neural networks trained for this purpose [paper].  While the precise meaning of the vectors remains obscure, the fact that they are trained to produce accurate parsing would seem to imply that they encode information relevant to the selectivity of words for arguments --- what things do people "give" or "eat".  Many parsers use bilexical statistics --- actual verb-object pair statistics --- but abstracting the words to vectors should allow greater generalization to unseen pairs.

It has also been proposed that vector semantics can usefully encode relations analogous to the relations of a database [paper].  Although vectors, like databases and bit strings, are information-theoretically universal (a single real number can hold an infinite number of bits), I do not believe that vectors should replace databases as a model of the facts of reality.  However, it does seem possible that vectors have an important role to play in uncertain inference such as that modeled by Markov logic Networks.  A Markov logic network is essentially a weighted SAT formula where the sum of the weight of violated clauses is interpreted as an energy (cost) for any given truth assignment to the Boolean variables.  Each Boolean variable typically has some internal structure, such as R(b,c) where R is a relation and b and c are entities.  We can assign entities vector categories and assign R a matrix for combining the categories of its arguments.  We can then compute an energy R(b,c) where this energy intuitively represents a negative log prior probability of R(b,c).  In this way vector categories could play an important role in some variant of Markov logic and could even be trained so as to produce correct inferences.

The bottom line is that the term "vector categories" seems more appropriate than"vector semantics". Semantics remains, for me, the relationship between language and a database model of reality.