tag:blogger.com,1999:blog-8258536764898384280.comments2014-05-03T10:39:05.452-07:00Thoughts on Artificial IntelligenceDavid McAllesternoreply@blogger.comBlogger14125tag:blogger.com,1999:blog-8258536764898384280.post-63729943437388528642014-04-09T12:06:03.845-07:002014-04-09T12:06:03.845-07:00"Neural networks are a Turing-complete model ... "Neural networks are a Turing-complete model of computation. Furthermore, there exists some (perhaps yet to be discovered) universal algorithm for learning networks which can account for leaning in arbitrary domains, including language."<br />One of these algorithms might have been discovered. So far from my testing and work, it seems to meet the criteria for creating language, visual processing, creativity, emotions ( with dopamine); and much more.<br />Our cortex is composed of 100K's of cortical cell columns, & except for the motor strip, they all have the same micrograph appearance, 6 levels. The EEG is the same over these, where ever they are, alpha, beta, theta. Cyberneticists believe there is a single, active principle doing the same process in the cortex, according to a Bayesian analysis for predictive values. <br /><br />http://anilkseth.wordpress.com/2014/02/03/all-watched-over-by-search-engines-of-loving-grace/#comments<br />This might well be the Comparison Process, which so far has met the criteria. It has given endless new insights into understanding the brain, language, and much else.<br />You might find it interesting. Google Le chanson sans fin on wordpress.com. The systems works, 1 simple comparison process, and it explains motivations, emotions and much more, simply from a brief look at what makes humor work.<br />Herb Wiggins, (ret.); Diplomat Am. Board of Psychiatry/NeuroAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-85803356839628160842013-10-08T05:29:29.100-07:002013-10-08T05:29:29.100-07:00In my post I made the implicit assumption that we ...In my post I made the implicit assumption that we are only discussing C++ programs which define an input-output function (a predictor). I am restricting my class to the C++ that halt. Furthermore, I took the complexity (bit length) of the program to be the number of bytes in the source code rather than its Kolmogorov complexity. To improve the bound we could use a standard compression algorithm, such as Gzip, to compress the source code and use the compressed bit length for the complexity of the predictor. Here there are no issues with the undecidability of Kolmogorov complexity. The free lunch theorem, like the no free lunch theorem, is information-theoretic --- it ignores computational issues.David McAllesterhttp://www.blogger.com/profile/04233380439362686457noreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-79509784555919145742013-10-05T18:25:38.393-07:002013-10-05T18:25:38.393-07:00This is very interesting and made me think quite a...This is very interesting and made me think quite a bit. Thank You. I only understood (I think) what you were saying after your post on "Chomsky versus Hinton".<br /><br />However I might be missing something so I am going to rephrase the main idea of the above in a language that I am comparatively more comfortable with. I would really appreciate it if you could correct me if I am wrong. <br /><br />I don't know any Learning Theory so I don't see the no-free-lunch in terms of "for every learner there exists a distribution in which it fails" but rather as a simple corollary of the following incompleteness theorem in Kolmogorov Complexity. <br /><br />Theorem (Chaitin): There exists a constant L (which only depends on the particular axiomatic system and the choice of description language) such that there does not exist a string s for which the statement the "The Kolmogorov Complexity of s is greater than L" (as formalized in S) can be proven within the axiomatic system S. <br /><br />Stated informally: "There's a number L such that we can't prove the Kolmogorov complexity of any specific string of bits is more than L. "<br /><br />In particular you can think of any learning machine as a compression scheme of some fixed Kolmogorov Complexity (consider the most optimal program for that machine, the programming language is irrelevant as long as it is Turing Complete. Consider the description length of it in bits. That is its Kolmogorov Complexity). For any such learning machine you can construct a data string which has Kolmogorov Complexity greater than what this learning machine has. i.e. this data string is random from the point of view of this learning machine. In short IF the data string has some Kolmogorov Complexity K and your learning machine has Kolmogorov Complexity k and k < K then you will never be able to find any structure in your data string i.e. can never show that it has a KC greater than k. Which actually means that the data string given to our learner is structureless (or said differently the function is structureless for our encoding scheme). Thus no free lunch is just a simple diagonalization argument. That is - one can always construct a string that is Kolmogorov Random or structureless w.r.t our encoding scheme (i.e. will have a KC more than the KC of our program).<br /><br />From your Hinton post I got the following (something that I do infact believe to be the case, perhaps a platonist viewpoint): That there is no reason to believe that structureless functions exist or stated differently - it is less informative to observe that structureless functions exist. That is - IF the data is generated by a computational process then there is no reason to believe that it is structureless (or we are only talking of computable functions). And if this is the case, then you would always have a machine in your library of learning machines that will be able to find some upper bound on the Kolmogorov Complexity of the data string (much less than the machine itself). This search might be done using Levin's universal search, for example.<br /><br />I have not looked at the paper. But just at the post here but then your free lunch theorem says - if a function in our "universal" library does well on the training set, then with high probability it will do better on the test set. Viewed a bit differently this can be used to argue that a "universal learning" algorithm exists. <br /><br />In general the argument is that using no free lunch as a justification for restricted concept classes is bad theory. From what I understood I completely agree. This is ofcourse predicated on if I understood you correctly. Shubhendu Trivedinoreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-38606710698184612402013-09-18T10:36:06.087-07:002013-09-18T10:36:06.087-07:00Thanks for this tutorial, very usefull!!!
I agree...Thanks for this tutorial, very usefull!!!<br /><br />I agree with Sébastien that it would be nice to make PAC-Bayes theory more accessible. I recently read this paper<br />http://arxiv.org/pdf/1306.6430v1.pdf<br />It seems that (most) researchers in the Bayesian community are not aware of the PAC-Baysian approach. There would probably be a mutual benefit if it was the case.Pierre Alquierhttp://alquier.ensae.net/noreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-25079534761605103862013-09-15T15:42:35.306-07:002013-09-15T15:42:35.306-07:00I don't know anything about NLP. What do you m...I don't know anything about NLP. What do you mean by "database?"<br /><br />When I think of "database," I think of things like MySQL. It seems to me, at least from introspection, that we have a sort of lossy, associative memory. I.e. similar memories seem to blend in with each other (I sometimes begin walking to where I parked my car the day before instead of the place I last parked it), and thinking about one thing seems to automatically bring to mind related things (when I think of bacon, thoughts of eggs and all sorts of things related to "breakfast" enter my mind).<br /><br />More on topic: When I see the word "Syria," facts and events that I've read about come to the forefront of my mind. I associate negative things or complex problems to the word "situation," so when I read "Syria situation," the set of thoughts at the forefront of my mind are furthered refined to the conflict and politics related to Syria. Reading the words "Obama," and "antiwar movement" further refines and augments the set of things I'm thinking about. So, what I guess I'm trying to say is that maybe a weighted set of thoughts brought to consciousness by the words read and an associative memory is what defines the "Syria situation?"Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-35835578685649415692013-09-09T08:43:54.739-07:002013-09-09T08:43:54.739-07:00I wrote the post on Tarski and mentalese largely i...I wrote the post on Tarski and mentalese largely in response to this comment, but I think I should also respond more directly. There is considerable work these days on cross-document and ever cross-lingual coreference. The current trend in coreference is to work with mention-entity models rather than mention-mention models. In an mention-entity model of co-reference across thousands of documents the "entities" act like semantic referents. This is especially true if we take the entities directly from a database such as freebase as is done in http://googleresearch.blogspot.com/2013/07/11-billion-clues-in-800-million.html<br /><br />I expect that when Chris says that a dependency parse is largely adequate he means that it is an adequate for applications which would seem to require semantics --- such as translation and perhaps even entailment. But for me "semantics" should include reference resolution in the sense of the above URL but where freebase is replaced by a mentalese database.David McAllesterhttp://www.blogger.com/profile/04233380439362686457noreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-65250420321077793812013-09-05T22:50:31.206-07:002013-09-05T22:50:31.206-07:00I guess "free will" is hidden in Chaos s...I guess "free will" is hidden in Chaos system. For a determined Chaos equation system x(t), the positions x(t) are not predictable computationally, since little error will result in big difference. Zhiyong Wang from Dalianhttp://www.blogger.com/profile/12622263775087744373noreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-29066905219007577522013-08-29T14:57:22.577-07:002013-08-29T14:57:22.577-07:00I think "formal semantics" set up genera...I think "formal semantics" set up generations of linguists to confuse syntax with semantics. Semantics is about meaning and how language connects to the world. Syntax is about syntactic relations. Most linguists really just mean rich syntax when they talk about "semantics".<br /><br />Dependency and other parses are syntax. If I say "It ran", a dependency parse will tell me that "it" is the subject of the verb "ran" and if I'm lucky, that "ran" is "run" is in the past tense. It doesn't tell me what "It" refers to or which sense of "ran" is intended (tear as in stockings, drip as paint or an ice cream, locomotion as with an animal, etc.).<br /><br />Any idea what Chris meant when he said that dependency parses were adequate? They're certainly not adequate for understanding what someone means by a sentence.<br /><br />I also think of anaphora as primarily syntactic. If I say "John ran. And he jumped." and you tell me that "he" corefers with "John", it's still not semantics because the language isn't being linked to the world. Now if I told you who or what "John" referred to, that would be semantics. Or if I told you what it meaned to jump. lingpipehttp://lingpipe-blog.com/noreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-66789545899750110422013-08-02T17:31:59.643-07:002013-08-02T17:31:59.643-07:00Would be interesting to have your thoughts on Witt...Would be interesting to have your thoughts on Wittgenstein. <br />I basically agree, in the sense that, I don't believe that there is any difference as such between using "language" to communicate about "real" objects in the "physical world" and using mathematics, which is just a specialized linguistic activity, to deal with and even create objects that are more purportedly more abstract. <br /><br />Quoting Yuri Manin from "Mathematics as Metaphor"<br />"What are we studying when we do mathematics? A possible answer is this: we are studying ideas that can be handled as if they were real things."<br /><br />ST.<br /><br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-80504265982091526172013-07-27T13:02:33.697-07:002013-07-27T13:02:33.697-07:00Not new to me, as I said long ago:
"the only ...Not new to me, as I said long ago:<br />"the only purpose of objects is to serve as carriers of properties in our discourse"<br /><a href="http://www.kevembuangga.com/blog/comment.php?comment.news.10" rel="nofollow">Objects as epistemological artifacts</a>jldhttp://www.blogger.com/profile/17072092162516537909noreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-22646744844876649332013-07-10T16:23:24.411-07:002013-07-10T16:23:24.411-07:00The reference to Zipf-like distribution on feature...The reference to Zipf-like distribution on features reminds me of <a href="http://aclweb.org/anthology/N/N13/N13-1077.pdf" rel="nofollow">Zipfian corruptions for robust POS tagging</a> by Anders Sogaard.halhttp://www.blogger.com/profile/02162908373916390369noreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-13932760591439983052013-07-10T14:17:26.506-07:002013-07-10T14:17:26.506-07:00The PAC-Bayesian theorem allows the "posterio...The PAC-Bayesian theorem allows the "posterior" to be defined --- it is not determined by the prior and the data as it would be in a Bayesian setting. In the proof of the dropout bound the dropout rate in the posterior is defined to match that in the prior.<br /><br />But Yoshua's point is still interesting in the PAC-Bayesian setting. One would expect that a tighter bound could be produced by allowing the dropout rate in the posterior to be different from that in the prior. One might expect the optimal posterior dropout rate to be lower as the amount of data increases.<br /><br />However, even in a Bayesian setting the posterior dropout rate might remain high for most parameters. Consider a linear predictor operating on sparse features (consider one unit in a DNN). If we assume a Zipf-like distribution on feature activation then no matter how much data we have there will be features that have just enough data to train weights. The Bayesian dropout rates on these "rare" activations might remain high. As the sample size increases different features become "rare". But we then always have rare features for which posterior dropout rate is high. Perhaps the dropout rate of a unit should depend on the unit activation frequency of that unit or the L2 norm of all the output weights of that unit.David McAllesterhttp://www.blogger.com/profile/04233380439362686457noreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-26090629398337660552013-07-10T13:30:13.138-07:002013-07-10T13:30:13.138-07:00It is a wonderful idea to make Catoni's ideas ...It is a wonderful idea to make Catoni's ideas more accessible. This could benefit greatly the ML community.<br /><br />I'm also looking forward to see what you have to say about the Singularity ;). Sebastien Bubeckhttp://www.blogger.com/profile/02379917343302020131noreply@blogger.comtag:blogger.com,1999:blog-8258536764898384280.post-63044128581574443072013-07-10T13:13:58.020-07:002013-07-10T13:13:58.020-07:00This is very interesting. Thank you for making thi...This is very interesting. Thank you for making this connection. What I find interesting is that it seems to be a case where traditional Bayesian interpretations do not seem to apply naturally. Indeed, if we interpret the action of sampling from selected subsets of the network as sampling from a posterior, it means that the posterior (and presumably the prior) would have point masses at sparse weight vectors (where a whole set of output weights of a particular hidden unit are set to 0, and independently so for each hidden unit). The posterior would also presumably have a major mode corresponding to the usual parameter configuration (with a variance decreasing as the amount of training data increases). As the amount of training data increases, one would however expect that the correct posterior would give less and less probability (exponentially so) on the sparse parameter vectors (corresponding to hiding some of the hidden units) and more and more on the maximum likelihood solution. However, experiments with dropout suggests that even with large datasets, it is better to keep the dropout rate near 50%. The exponential rate at which the optimal dropout rate should approach 0 (not dropping anything) would depend on how well a network with randomly mutilated hidden units performs on the training data, compared to one with out dropout.<br /><br />Maybe I am wrong and the dropout rate should indeed be reduced exponentially fast as the amount of data increases. I would be curious to know, of course, if someone actually does the experiment. The comment also applies to some extent to the PAC-Bayesian approach, which suggests that the dropout rate should be decreased as the amount of data increases. I suspect that although there might be some truth to this, there probably are other effects at play that make it better to stay in the regime around 50% dropout.<br />Yoshua Bengionoreply@blogger.com