The biggest reason I immediately switched to Google when I first tried it was Google only matched documents containing exactly the words I searched for, nothing more.
It's fine (and good) if search engines add more intelligence like this, but I'll always need a way to search for exact phrases. The default behavior of Google is much "fuzzier" than it was 5 years ago, so I'm surprised they don't already do something like this (or do they?)
That's a good point. It's sometimes difficult trying to search for programming topics now. "Did you mean UIView?" No, Google, I really did mean NSView.
Yes but that's a different message ("Showing results for..."), and there's a link to force a search for the original term. For me it's right at least 50% of the time that it does this (probably much more, but I can confidently say >50%), so it's a net win for me personally.
You are right, I remembered wrong for that one. It usually does "Showing results for ... UIView click here to see results for ... NSView". Sometimes I encounter something that gets changed without the option to search for what I typed and I have to quote it.
The interviewer wasnt terrible. If the interview was being shown where the primary audience was geeks your argument makes sense. Given that the audience of that video could be a mix of geeks and non-geeks, I think it was just appropriate. After all, if he starts talking in terms of Markov chains, how many people in his audience are gonna understand that ?
I'm not asking the interviewer to talk about Markov chains, or even know much about them. I'm asking that he be able to form a coherent question on the subject. You would expect an interviewer asking about politics to at least be able to ask reasonable questions about politics and not say something like: "so how long have you been interested in politicizing, and voting, and things like that?". A good interviewer can tailor the interview to the audience, a bad one can hardly hold his own (this guy was an example of that).
I think he wasn't very good. He was obviously underprepared, or he would have had a couple of nicely phrased questions in his head. For a media professional, he was just too inarticulate.
Dude it's his job to form coherent and well structured sentences in ad hoc interviews, it's in the job description. My thoughts were this guy doesn't take his job seriously, doesn't know jack about the field, or isn't cut out for that line of work.
Why call it anything? Simply responding with a clearly articulated, contrary opinion is more than enough. I agree with your opinion on the matter, however.
probably you're right. I could have left out the ignorance part. But in the last couple of weeks, this is the 4th instance where I read someone just jumping to some conclusion and making a rash judgement about something that they have no experience about. I refrained from commenting the earlier 3 times, but this time, it just really irked me.
If I have to see another story with 'XYZ YEAR OLD BUILDS XYZ', I'm going to go thermonuclear war on HN. Adolescents have been exposed to high speed internet and technology before they can even remember. What do you expect is going to happen, honestly?
While the internet was probably a good source of information I'd play it down to mentorship as the decisive factor. In all of these cases, the successful teenagers seem to be lucky enough to either have a good private mentor, be enrolled in a mentorship program, or be enrolled in a highschool famous for the encouragement and resources that it provides to it's enterprising students.
In contrast many of us would have been told to "aim for a simpler project" if we proposed something like that to our computer science teachers.
Your post summarizes Malcolm Gladwell's book Outliers on extremely successful people quite well -
it's mostly being born to a) rich parents who can afford to pay for extra-curricular activity, good schools or who have the time to "improve" their kids or b) being around schools or universities that can recognise the talent and give talented kids the input and resources they need. Both a) and b) are based on pure luck.
Rich parents, exclusive school (with their own computer! unheard of at the time) and local companies that let Gates and other kids screw around on their computers. Luck!
I both agree and disagree with your post. I do agree that people's discoveries and successes are very much based on luck (i.e. external factors), but I don't think personal accomplishment, progression, and overall contribution is luck based. Sure, becoming a Mark Zuckerberg takes luck, but the only difference between him and other people (assuming sufficient time in a certain subject) is that the general public liked and found a use for Zuckerberg's idea more. Essentially, success is luck-based, but genius is not.
I do think that without hard work/talent Bill Gates and all these other successful people wouldn't be where they are now, but without luck and the correct circumstances all the hard work in the world doesn't bring you anywhere, your genius won't help you.
Look at William S. Burroughs' life - an extremely talented writer from a rich family, who still "dropped out of society", later befriended Kerouac and Ginsberg who were the big factors (reasons?) in him becoming a great writer from age 40, without these people we wouldn't know the name; his genius wouldn't have helped him alone.
I don't like this term "genius", as there is no such thing. We need to move on from the idea of "being a genius", which is a modern phenomenon originating from then phrase to "have a genius", where genius is latin for genie. When we call someone a genius, we're saying that they are a mythical all-knowing being, which seems to absolve us from having to think about why a person is the way they are, and why we are not. It is the intellectual equivalent of burying our heads in the sand when we see someone brilliant, and ensures that we'll never really understand or appreciate success.
The term 'genius' bothers me too because it seems calling someone a genius is less about the object (ie the person we are talking about), but more about the speakers feelings toward that object (eg godlike awe etc).
Its a good point. Had I have all the resources I have now I would imagine I would have made all sorts of things at that age.
Even without them I was building things like Enigma machines, games and such in QBasic.
This isn't about belittling this guys achievements, but realizing that with the free software movement and the internet in general the barrier of entry has dropped so much that stories like this are naturally going to become increasingly common.
I absolutely agree with your sentiments on the matter. When you're a child you're essentially a sponge, I wonder what teenagers like himself will be producing within the next two decades.
One of those ideas that, after you hear it, you think ya, that makes total sense, why isn't this already being done? Very cool project idea to be working on still in high school.
Who says it isn't already being done? I'm pretty damn sure Google and Bing do some sort of topic modeling (LDA, LSA or something else) that basically does the same thing - though the exact method is not the same.
Watching the TED talk, he seemed to be comparing his search results to that of some academic search engine, which are generally very dumb and purely keyword based. Google does a whole lot more than that. Kudos to the guy for figuring this all out on his own, but I don't think this research is as original as portrayed by the media here.
It's not novel work. It's an toy implementation of standard technology. It's impressive for a young person to understand and synthesize the components, but it's an engineering practice project, not an innovation.
Armchair analysis of his algorithm after watching his TED talk: a version of LSA that uses PageRank instead of a straight SVD to calculate rankings.
LSA[1] has been around since the 80s and is used in many applications from GRE testing to Apple's junk mail filtering[2]. It's used a lot since the patent expired, it's relatively good and can be computed quickly. Of course, a lot of text-retrieval research has happened in the past few decades, one of my favorites being LDA[3] which relies on a much more sound statistical basis than finding lower-dimensional representations of term-document vectors. Unfortunately LDA's model is not directly computable and answers must be determined via Monte-Carlo methods.
As for 'indepdendence,' his terminology gets a little confused here. At first I thought he was talking about the 'bag-of-words' assumption that most large-scale language models have. These effectively ignore grammar (other than stemming) in order to efficiently determine the 'gist' of a document without its intricacies. However, his videos imply he is talking about word-sense disambiguation[4], which is certainly known about and was the crux of LSA in the first place. If he is talking about lifting the bag-of-words assumption, there has been some interesting work going on, such as [5] (disclaimer: I am a coauthor on that paper).
If you're interested in this stuff, I highly recommend trying out the LSA demo server at [6] (it can get swamped sometimes so don't kill it) and David Blei's LDA implementation at [7]. The LDA-C inputs and parameters are a little obtuse when you first look at it, and I don't have my notes on how to use it at the moment but if you play around with it it should make sense.
This kid is crazy smart, and I hope he gets exposed to a lot of really cool research since he can obviously pull off a lot at a young age. Best of luck to him.
From what I have understood by watching his TED talk, the algorithm he designed seems very similar to papers published in the NLP field recently [1] [2] [3]. I find it quite elegant, and I agree that the kid is really smart.
The idea behind those algorithms is to build a graph on words, using WordNet, or a corpus such as Wikipedia. Edges are added between words which are semantically close, or which often appear in the same documents. Then, to compare two words (or two bags of words), you compute the limiting distributions of the two random walks starting at each of the words. Those random walks will more explore nodes which are close to the starting nodes, and so, similar words will have similar random walks.
You are correct. My algorithm is conceptually rather similar to a number of ones recently published. The work closest to mine in the literatures is [1], by Lafferty and Zhai at CMU in 2001.
That said, my method is somewhat different than these in the way that it explicitly treats unlinked documents as distributions over a graph of words and the theoretical framework (based on a theoretical process for document generation) employed to derive it.
At their scale I doubt they are using LSA or anything of similar complexity. My guess would be something like tf-idf coupled with any thing from naive bayes to logic programming for entity recognition.
Final output would likely use some hand curated rules on how to combine the above with infromation from curated databases like dbepida.
________________
There is this thing called random projection based on the observation that in high dimensions, most direction are nearly orthogonal to each other. This is a wonderful observation because it allows us to write stupendously simple (and fast) algorithms that give good results for mathematically sound reasons.
I strongly suggest anyone who is interested in LSA but wants something scalable, simpler to write and faster to look into Random Indexing.
While all these techniques are related and LSI is a good model for what he is doing, I think his thought process is spiritually closer to PageRank. Although I don't think he merely adapted the page rank algorithm.
Based on his emphasis on markov chains and his diagrams, I think his algorithm works by building a markov chain from word co-ocurrances. Then when the user searches, it uses the words from the highest incoming edges to keywords to disambiguate word meaning and cluster tweets.
I would argue based on his justification by random walks that given enough time and study he would have ended up with LDA and not LSA.
With respect to LDA, there is a new paper on arxiv,
Two SVDs Suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation [1].
The results look really similar to what used to be google wonder wheel. You start with one word, and it would show you a graph of other related words/phrases.
It's not really appropriate to this particular story -- high school student vs VC-funded startup -- but what really bugs me about all "the next Google" reporting is that they ignore the biggest gotcha of search: corpus size matters.
The secret sauce of a successful search engine isn't the core algorithm which handles 90% of the work 90% of the time, it's the millions of tweaks which keep the millions of pathological cases from making every query useless. On a small enough corpora, even grep does a great job. But once you are trying to search the world, it stops being an insight problem; at scale there are so many corner cases that you can't ignore them.
In this day and age building "X is better than well known Y product" means nothing. You could build a car with 5 times the gas mileage of any car around today and I bet you'll be hard pressed to make a dime for the first few years while competing with the likes of Ford, Toyota or General Motors. People don't care about things being better when it comes to the web, people stick with what they know and if you're a Google user like me who is also vested in Gmail, Google Docs, Google Analytics, Google Adwords, etc then you're in too deep to switch to any other search engine.
I remember Cuil when it launched. They were onto something great, made some pretty bold claims and in many ways were better than Google at search and look what happened to them? Nobody cared and they died in a huge internet tire fire. Sure the issues with poor results didn't help and people wanted them to succeed but lets be honest here, Cuil would have ended up like Bing (a few users but nothing to gloat about). People are too lazy to switch to anything new especially when it comes to search, it takes time to woo a user from another product that still does the job perfectly.
Having said that, this kid is 17 and he's done a f*cking amazing job. How many people can say at the age of 17 they built anything remotely cool like this? I'm sure some can, but not many. If he keeps on this path, he'll be achieving bigger things in his 20's and 30's than this and it'll be well-deserved. The interviewer was pretty bad though, he had no clue whatsoever an insult to the kid they were interviewing who deserves at least an interviewer with a remotely above average IQ.
Cuil didn't die because people were "too lazy to switch to something new." It died because among the few who did know what Cuil was, opinions ranged from "disappointment" to "laughingstock". Remember "Cuil theory", in which degrees of disconnection from reality were measured in Cuils?
Cuil was ambitious, and that was laudable and did get it some attention. But unfortunately it was nowhere near good enough for the level of hype they built. Many of these "Better than the big guy" products have the same problem, albeit to a lesser degree — they are better by some very specific metric, but that metric isn't the one that most people associate with the product's value. In Cuil's case, relevant and accurate results were what people wanted, but Cuil was worse on those axes than ancient search engines like Mamma.com.
> Cuil didn't die because people were "too lazy to switch to something new." It died because among the few who did know what Cuil was, opinions ranged from "disappointment" to "laughingstock".
Cuil died the day it was announced. They went live with a great PR fanfare and a corrupted index, everyone marginally interested in it saw it when it was broken, which is how it got its reputation as a laughing stock.
Let's not forget the Cpedia pivot, where they turned their search engine into a sort of Dadaist Wikipedia. That was probably an even bigger mistake than launching a broken search engine (and more ambitious, too).
Cuil is an interesting example, but I have a very different recollection of it.
Cuil may have been on to something great, but their technology didn't produce results. To say "nobody cared" is completely false. They launched to pretty serious fanfare. The technology press was, for a brief period, infatuated with the idea of ex-Googlers killing Google.
It was the technology that failed. The results were a mess and very often completely irrelevant. Cuil was well positioned to compete head-to-head with Google, but they failed to execute. I gave them a shot and they sent me right back into Google's arms (I did the same with DDG as well fwiw).
I think search, more than anything else, is an area that gives you the best chance of gaining a toe-hold against a monolithic company like Google. Although you'll be acquired the moment you gain any real traction:)
A few years ago, I worked at a startup peripherally related to the search space and we did some usability studies on the google search page. What we found then, more or less, is that people generally feel like there is no problem with google's search results, and if you pressed them to try to find something that google could not find easily, instead of thinking google let them down, they essentially blamed themselves for not being able to find it. It's hard to sell a solution to a problem people don't think they have.
With that kind of thinking, it's really hard to build a product to go head to head with google that is so much better that can change people's habits directly. From my perspective, you either need to somehow change the equation for some sort of query attribute (e.g. google may not be as timely as twitter so I'll switch to it when something like an earthquake happens) and eat slowly away at the google advantage, or you need to build a product that first targets the thought leaders (something like building a better code search or social search) and then make google un-cool to use.
Maybe it's difficult to topple the giants, but you should not pick Cuil as an example. Cuil was awful and it died because it put unrelated information together. It certainly did not "still [do] the job perfectly".
Cuil maybe have been an initial failure but I seem to recall numerous tweaks being made to the results a few months after launch that fixed most of those issues and made it a decent search engine. They had the index, the algorithm and priority of results just needed to be tweaked a bit.
If you want a better example of a good search engine with decent results that has failed to compete with Google it would be Bing. If you use this blind search tool: http://blindsearch.fejus.com/ that returns results from both Google and Bing. Bing is winning in terms of most relevant results and yet has failed to really steal away any of Google's user base in the long run is this not more than enough proof that people don't care if another search engine returns better results? I think it does.
There are people out there who think Google is the Internet, hows that for brand recognition? People who don't even own computers or don't have the Internet have most likely heard of Google at some stage in their lives.
Internet Explorer isn't a search engine and to debase your statement a little, a lot of people still use Internet Explorer. Chrome is the most popular browser at the moment and to be honest I'd be shocked if it weren't considering how heavily Google market it (front page of Google, Youtube, etc).
If Google were to advertise a worthy adversary to their search engine on the home page I am willing to almost bet that the competitor would score quite a few new users.
Hi, this is Nicholas, a long time lurker on HN and the person in the video. I saw this thread during my morning commute to work (and was very surprised, to say the least!) and wanted to register to mention a few important details that the news articles always omit. Hopefully this helps correct a few misconceptions!
To begin, I'd like to flatly deny that I "built a better search engine." I did my (very academic) work in information retrieval and developed a new algorithm that seems to give significantly better search results (when compared to other academic search techniques, more on this later) on short documents like Twitter tweets. Specifically, my algorithm uses random walks (modelled as Markov chains) on graphs of terms representing documents to perform a type of semantic smoothing known as document expansion, where a statistical model of a document's meaning (usually based on the words that appear in the document) is expanded to include related words. My system is in no way, shape, or form a "search engine" or even comparable with something like Google---rather, it is an algorithm that could help improve search results in a real, commercial search engine.
My work is not, by far, the first to attempt document expansion. A number of related techniques, including pseudo-relevance feedback expansion, translation models, some forms of latent semantic indexing, and some of those mentioned by exg already exist. However, to my knowledge, the knowledge of my science fair juges (some of whom are active IR researchers), and the knowledge of my research mentor (also more on this later), my work is a novel method (not a synthesis of existing methods) that seems to work quite well in comparison to other, similar, algorithms on collections of small documents like tweets.
The last point is certainly important: it is simply impossible to compare my algorithm to something like Google, for several reasons. First, I'm not a software engineer or a large company; it is downright impossible for me to craft a combination of algorithms like that found in Google to get comparable results. No commercial search engine would be so foolish as to use only a single algorithm (essentially a single feature, from an ML perspective). Instead, they use hundreds or thousands. Second, it is essentially impossible to compare search engines with any level of scientific rigour. I evaluated my system using a standard corpus of data published by NIST as part of TREC (the Text REtrieval Conference), consisting not only of 16+ million tweets, but also of sample queries and the correct, human-determined results for these queries. However, to achieve statistically comparable results, many variables have to be controlled in a way that is impossible with a large, complex search engine. Instead, the academic approach compares individual algorithms one-on-one and postulates that these can be combined to give better search results in aggregate.
Specifically, my research showed that my system achieved above-median scores on the official evaluation metrics of the 2011 Microblog corpus when compared to research groups that published last November. Furthermore, my system did the best of all of the "single algorithm" systems, including those that used other document expansion techniques like I described above.
Most of my work was spent on the development of the algorithm, proofs of its convergence and asymptotic complexity, a theoretical framework, and a statistical analysis of my results. Notably absent from this list is engineering. My project is not, by any means, "a toy engineering project" as some commenters have suggested. Actually, the engineering in my project is quite poor, as that area is not one I've had much exposure to.
To briefly address my research mentor: my parents had nothing to do with my project other than providing emotional support when I was stressed. I had a research mentor at a university who I found after I did very well at the 2011 Canada-Wide Science Fair. He provided me with important computational and data resources (such as the corpus I used), but did not develop my algorithm, proofs, or code, which were my own work.
Given the recent attention of my project (and Jack Andraka's project on cancer detection), I'd like to point out a general trend in news articles about science fair projects. In general, the media has a tendency to focus on the potential applications of a project and ignore the science in it, leading to (seemingly fair) criticism. Using me as an example, the talk about "toy" projects and "synthesis" is fair given how it is portrayed in the media. Somehow, "novel IR algorithm based on Markov chain-based document expansion," even with careful (and thorough!) explanation, gets turned into "Teen builds a better search engine." Similarly, a great friend (and roommate) of mine whose project on drug combinations to treat cystic fibrosis was completely shredded on Reddit when it got significant media attention last year. In his project, he never once claimed or tried to claim that he had done anything with immediate (or even near) medical applications. Instead, he discussed his work to identify molecules that bind to different sites on the damaged protein and can work synergistically as drugs. The media spin-machine quickly turned this into "Teen cures cystic fibrosis" and other such nonsense. Even Jack's project (I know both him and his project), which is unusually "real world" has being overspun by the media. It's just what happens. Heck, people even make fun of it in upper-level science fairs, but it still happens.
Finally, thank you for the encouraging words! To finish with a shameless plug, I'd like to point out that, while fairs like ISEF tend to be very well-funded (because of the positive publicity). However, many regional and state (in the US) or national (outside of the US) youth science organizations struggle to find funding (and even volunteers) to run fairs that send people to ISEF. If you ever find yourself in a position where you can help (financially, with your time, whatever), I'd strongly encourage it. Given the impact the science fairs have had on my life, I know that I certainly will.
Wonderful work, Nicholas. I'm adding you to my list of good role models for my little boys. The 17-yr-old sports stars get most of the press, because their accomplishments are easily seen. You are the equivalent of a 17-yr-old basketball star with ten years of training behind him, but you play on a court that is nearly invisible (mathematical terrains and state spaces).
You are the kind of role model I want for my boys. Now, I have my work cut out for me explaining to them why. ;-)
So much of the coverage of your research was brief video, that having just this small-but-precise description is an immense help in understanding.
Many also need to hear your other message, about the distorting effects of both popular-media-attention, and of transplanting-results-outside-their-original-context. So much online discussion is knocking down decontextualized caricatures of real work... resulting in a lot of unnecessary waste and negativity.
I was instantly reminded of David Evans (from UVA and Udacity fame: http://www.cs.virginia.edu/~evans/) listening to this kid, both physically and in his manner of speech.
I really expect to... eeeh... see more of him in the future.
I know very little about search but what I love about how he looked at the graph element, exploring relationships between "entities".
The relationships between entities in our world holds so much information and yet in most databases it's reduced to a join between tables. Mapping the relationship and capturing the "hidden" information and therefore making it available for use unlocks amazing potential.
This kid would definitely make a great marketing person. There's absolutely nothing new in what he's done but he's presenting it really well, thumbs up to his parents who have probably done quite a bit of the work ;)
When I was 18 (10 years ago) I did loads of research into that stuff and knew just as much about information retrieval, vector space tf-ids models, latent semantic indexing, wordnet analysis, etc. At the time it was fairly cutting edge research. This stuff isn't anything new now, I was actually forced to decipher some research papers instead of reading popular books on the subject. It was fairly obvious to me back then that none of these techniques worked well for general web search. I did end up building a system that clustered google search results (in realtime) into DMOZ categories letting you refine your search results by clicking a category (which was actually useful and worked quite well in case you were searching for something ambiguous like "jaguar").
None of these techniques are new to anyone working in information retrieval. Just looking at co-occurrence of words in tweets and expanding the query with some related terms (weighted appropriately) would probably achieve what he has done (weekend project for an average dev).
I'd call this kid really smart if he'd actually figure out how to improve general web search, or could think of a useful application at least. Talking about existing research and making it look like your own isn't great form in my opinion. Coming up with your own definition for a "word" just makes you look stuck up. Much better off acknowledging work of other researchers and quoting them, although that would never generate as much press I guess.
Sorry if the rant is quite negative, I'm just getting a bit fed up with all the marketing surrounding "young geniuses" and teenage entrepreneurs these days. If I wanted to read that stuff I'd get the local paper.
Most of your feedback is perfectly fair. My algorithm was not meant for general search, not am I a "young genius" or anything of the sort. Similarly, techniques like tf-idf, latent semantic analysis, document modelling, and even pseudo-relevance feedback expansion are no longer "cutting edge" techniques.
However, your blanket characterization that "there's absolutely nothing new" in my work and that I just talked "about existing research" while "making it look like [my] own" is somewhat offensive. Based on a fairly extensive review of the literature, the algorithm that I developed is novel and seems to outperform a number of these standard techniques on short document like tweets. As for "just looking at co-occurrence," that's essentially a type of pseudo-relevance feedback expansion and is, of course, well known and easy to implement.
Please realize the difference between a news interview published online and the paper I submitted to the fair when assessing the novelty of my work and when suggesting that I made no reference to others' work.
Regarding my "definition for a 'word'", I apologize for appearing pretentious. I was asked to speak on the conference's theme of "redefinition" in relation to my research and did the best I could.
Finally, I think it's kind of strange that you automatically assumed that my parents even worked on my project. Neither of them are even familiar with the details of the project. My work was my own, and your automatic assumption of what is tantamount of broad academic fraud and plagiarism assumed particularly bad faith. In any case, thanks for your honest feedback; I try to be careful about how I come across and I'll be even more mindful in the future.
First time I read Bulgakov's "Master & Margarita" was when I was 7. Enjoyed it very much.
But it was only two years later that I read it again, and this time was left with no questions. See the first time I read it, many things were more grownup than my understanding of such matters could be at that age.
And unless it's "Harry Potter and the Methods of Rationality", I'm not that impressed anyway.
I didn't. His accomplishments sound pretty awesome! My reply above was directed more at the article and it's lack of links to the guy's work. After watching the interview I just wanted to try some searches but there was no link.
It's fine (and good) if search engines add more intelligence like this, but I'll always need a way to search for exact phrases. The default behavior of Google is much "fuzzier" than it was 5 years ago, so I'm surprised they don't already do something like this (or do they?)