Google may be the dominant search engine, but it’s far from ideal. One major problem: How do you search for things you don’t know exist?
Using Google’s own experimental algorithms, a graduate student may have build a solution: a search engine that allows you to add and subtract search terms for far more intuitive results.
The new search engine, ThisPlusThat.Me, similarly looks for context clues among the terms. For instance: Entering the arithmetic search “Paris – France + Italy” gives the top result as “Rome,” but if I search the same thing in Google, I’ll get directions between Paris and Italy, restaurants in France and Italy, and a depressing Yahoo Answers of whether Italy is in Paris (or vice versa). “Rome,” on the other hand, is an association you, a human, would make (I want This, without That but including Those)–and the engine makes that decision based on each answer’s semantic value compared to your search.
Until now, search has been stuck in a paradigm of literal matching, unable to break into conceptual associations and guessing what you mean when you search. There’s a reason Amazon and Netflix have scored points for their item suggestions: They’re thinking how you think.
The engine, created by Astrophysics PhD candidate Christopher Moody, uses Google’s own open-source word2vec algorithm research to take the terms you searched for and ranks the query results by relevance, just like a normal search–except the rankings are based on “vector distances” that have a lot more human sense. So in the above example, other results could have been, say, Napoleon or wine–both have ties with the above search terms, but within the context of City – Country + Other Country, Rome is the vector that has the closest “distance.”
All the word2vec algorithm needs is an appropriate corpus of data to build its word relations on: Moody used Wikipedia’s corpus as a vocabulary and relational base–an obvious advantage in size, but it also had the added benefit of “canonicalizing” terms (is it Paris the city, or Paris from the Trojan War? In Wikipedia, the first is “Paris” and the second “Paris_(mythology).” But millions of search-and-replaces in Wiki’s 42 GB of text was intensive, so Moody used Hadoop’s Map functions to fan those search-and-replaces to several nodes.
A search query then spits out an 8 GB table of vectors with varying distances; Moody tried out a few data search systems before settling on Google’s Numexpr to find the term with the closest vector distance.