New AI Based Yandex Search Algorithm Palekh
Search Engine Optimisation

New AI Based Yandex Search Algorithm Palekh

5th January 2017

Yandex recently announced its new search algorithm Palekh, which improves how Yandex understands the meaning behind every search query by using its deep neural networks as a ranking factor among others. Ultimately, the new algorithm helps Yandex improve its search results across the board but especially for long-tail search queries.

As most State of Digital readers know, long-tail search queries are categorized by searches that the search engine very rarely processes. There is a correlation between the rarity of a query and the length of it. Typically, the shorter the query the more common it is and the longer it is the more rare it is. Such queries are often conversational and describe something in detail when a user doesn’t know the exact phrase or word but tries to explain it to a search engine. For instance, writing a description of a movie without knowing the title like “a movie about a guy growing potatoes on some planet.”

These long-tail queries challenge search engines to fully understand the query intent in order to offer the most relevant search results. Search engines more easily offer search results based on the similarity of words in a query to the similarity and relevancy of the words in the results. The challenge of longer-tail queries is that they don’t match up as easily for relevant word synonyms and there is far less data around these rare queries.

However, long-tail queries and search results can be best matched by finding and pairing the similarity of meanings. Yandex decided to incorporate advanced artificial intelligence to improve how it finds matches between queries and results by better understanding the intent behind the query as opposed to similarities of the words themselves.

As a company that specializes in machine learning, Yandex has historically built machine learning into 70% of its products and services, starting with search. Most recently with Palekh, the Yandex search team taught its neural networks to see the connections between a query and a document even if they don’t contain common words.

This new algorithm was named after the Russian city Palekh because of the firebird on its coat of arms that has a long tail. Yandex has named all of its search algorithms after Russian cities and chose Palekh based on the symbol of the long tail and the impact of this algorithm on long-tail queries.

This blog explains the machine learning dynamics behind Yandex’s latest search algorithm Palekh and what differentiates it from other uses of deep neural networks for web search ranking.

What is machine learning? What are neural networks?

Machine learning is exactly that – a machine learning from itself by making connections from patterns of input data. As Yandex puts it, “a machine that can learn is a machine that can make its own decisions based on input algorithms, empirical data and experience.” Once a goal is set, models are trained to reach that goal based on learning samples. The machine teaches itself to make rules that improve over time as it processes more data. Millions of factors contribute to the algorithm results, proving far more complex than a human’s ability to process or program.

Neural networks are a machine learning method modeled after neurons in the human brain that aim to solve problems like a human brain would. Neural networks are based on real numbers and can be trained to find relationships in a set of data after processing input data and recognizing patterns. They can be trained to analyze images, sound, or text and are applied for multiple uses like image recognition, text translation, or web search ranking.

How did Yandex teach its neural networks to better understand queries?

Yandex trained its neural networks with a semantic mapping model that reduces information to numbers, groups them based on content meaning, projects the groups on a semantic map, and then finds matches between the groups based on their proximity on the map. Generally semantic mapping finds connections between two different entities by placing them in a same semantic space and confirming their connections based on the closeness of their proximity to one another. In this case of web page ranking, the two entities being checked for connections are search queries and documents, or the heading of crawled pages.

Before anything happened with the mapping, the search team first needed to train the algorithm by giving it examples of pairs of queries and relevant webpage headings. This training set provided the neural networks with a base understanding of the connections the Yandex search team wanted it to make.

Since computers work better with numbers instead of words, Yandex then converted billions of search queries and its crawled pages into numbers. These numbers then needed to be organized so there was meaning behind them. An arbitrary set of words does not have a real concept or meaning. Only very specific sets of words make sense together and there are millions of possible contexts. The algorithm finds the small subsets of words that are populated by meaning but this still results in millions of possibilities so the numbers must be grouped. By using a method called dimensionality reduction, a matrix therefore compresses the long list of words into a group of 300 and then places it in a 300 dimensional vector. Words can be completely different but if they end up in the same vector, there is a similar meaning. The same is done for the headings of crawled pages.

These semantic vectors are then used for finding matches based on their proximity. Each query and title are checked to see how close the dimensionality projection of the title is to the query on the map. Just like the way words look similar to the search engine, vectors do too.

To simplify the explanation, let’s assume we are dealing with a two-dimensional space so the numbers are then treated as points on a coordinate plane. A given query and a webpage heading are then mapped on the coordinate plane. The distance between the points of a query and of a webpage heading can then be measured to decide how relevant the document is to a query. The closer the two points, the more relevant the query is to the document.

Why is this especially useful for long-tail queries?

By placing the query in a semantic vector with a webpage title, the search engine understands that the query and webpage title make sense even if they don’t have similar words. Previously, algorithms were more limited to finding similarities based on synonyms and concepts. For instance, shoes and boots or the concept of a brand like Kayak and an actual kayak. However, as humans we know long-tail queries may not include words that match with similar words or concepts. By using neural networks, the search engine can find similarities beyond the words to meanings. Due to the fact long tail queries typically demand results based on meaning and there is less data for these rare queries, semantic mapping fills in the gap.

What makes Yandex’s approach different from others?

Yandex also includes other targets to teach its neural networks. These targets include long click prediction, CTR, and “click or not click” models. Instead of just using one of its best neural network model, Yandex includes five. When comparing the benefits of incorporating all of its models, the Yandex search team notices much more accurate search results. By using all of its previous ranking factors plus its best neural network model, Yandex saw a 1% improvement on long-tail queries. By applying all of its previous ranking factors and five neural network models, that improvement doubles and results in a 2% increase of accuracy for long-tail queries.

What does Yandex plan to do with this in the future?

Yandex taught its neural networks to see the headlines of documents but the search team is currently working to also check the text content. In doing so, the Yandex search engine will be able to provide even more accurate results after learning in more detail whether the content of its crawled pages is relevant to a given query. To date, other search engines with similar technology only check headlines.

Yandex is also working to implement the model with more of its crawled pages. Currently, the model looks at hundreds of documents that are already filtered to Yandex’s top search results. The Yandex search team is working to optimize the model at an earlier stage of search so it will eventually cover billions of documents. The more documents Yandex can include, the more accurate the search results will be.

In addition to the overall improvement of accuracy of Yandex search results, this will generally help Yandex better understand conversational queries in the future.

What does this mean for SEO?

As Yandex improves its ability to handle conversational queries, the rest of the SEO specialists and online marketers will also need to adapt to this. As always with SEO, multiple ranking factors matter and it’s hard to tell which factors matter the most. However, ultimately good quality content for the user has always been the main focus of the Yandex search team.   Palekh will not change that.  SEO specialists should still consider what the user needs without focusing on individual keywords or practice keyword stuffing. As long as webmasters are providing content that will help Yandex users, Yandex machine learning will recognize it.

Yandex users can trust that Yandex’s advanced machine learning technology will continue to provide them with more and more relevant search results the more data it processes. Since the Yandex search team has successfully trained Palekh, users can expect to interact with the Yandex search box with much more complicated queries.

Tags

Written By
Melissa McDonald is the International Marketing Manager at Yandex, Russia's leading search engine, which offers free English to Russian translation and optimization for advertisers. Melissa also regularly blogs for RussianSearchMarketing.com, a news and information resource for digital advertising in Russia.
  • This field is for validation purposes and should be left unchanged.