Building Your Search Universe

Building Your Search Universe

19th December 2017

The Rapidly Expanding Martech Space

We have an astounding amount of data available to us as digital marketers. There are now over 5,000 marketing platforms, tools and software available – a massive 43% YOY increase in 2017. Focusing on SEO alone, there are now over 100 dedicated tools.

Martech landscape over 7 years


Given the vast amount of choice, we rightly spend time researching tools and keeping up to date with new feature releases, but settling on your stack is only the beginning.

Once you’re happy with the tools at your disposal, how can you bring together tool outputs together in one place and avoid manually manipulating this data in spreadsheets and avoid Excel hell?

Introducing Your Search Universe

At DeepCrawl, we’re pioneering a concept we call the Search Universe, to help you form a data synergy between your existing SEO tools by combining every URL with metrics and extracted data.

Building out your Search Universe by joining these data sources together will enable you to gain a fuller understanding of your site and more easily identify opportunities and assist you in your SEO projects and tasks.

Building Your Search Universe

While every website’s Search Universe is going to look different, depending on the size, configuration, type and purpose of that individual site, each will have a common set of sources (e.g. backlinks, log files, SERP metrics etc.) which can all be anchored to the URL.

Let’s take a look at a few of the key sources that you will want to populate your Search Universe with.

Crawl Data

Crawl data sits at the centre of every Search Universe and is a crucial component of understanding how your site will be seen by search engines.


On their own, backlink tools highlight your most linked pages. Layering pages receiving links with crawl data, means you’ll understand how effectively they’re performing for your site.

Log Files

Adding log file data to your Search Universe shows you how Google and other search engine bots crawl your site. Log files combined with crawl data confirms whether Google is paying attention to the right pages, or not!

Organic Search Data

Tools like Search Console add information about which pages are appearing in Search results. Paired with crawl data you can identify pages that shouldn’t be receiving clicks or impressions, and see even more data than you can within Search Console itself.

Web Analytics Data

Platforms like Google Analytics and Omniture show you which pages visitors see on your site and the level of engagement. Connecting Analytics tools with crawl data reveals traffic going to broken pages, and more!

Let’s put some meat on the bones of this concept.

For the rest of this post we’ll cover some common SEO activities and consider how the insights you gain from individual tools can be enhanced by joining together different data sources from your various tools.

How Can Building Your Search Universe Assist You With Common SEO Tasks?

Optimising Your Crawl Budget

Search engines only have limited resources which they will dedicate to crawling your site. This means it is important to optimise the site’s crawl budget and ensure search engines are focusing on the most important pages, particularly for sites with a large number of URLs or a lot of page churn.

As part of the crawl budget optimisation process you will want to identify the low value pages on your site that consume significant crawl budget, and try to reduce it.

Log Files Are Key

Log files are a key source you will want to call upon when optimising a site’s crawl budget, as they let you see which pages search engine bots are requesting on your site. Despite being an often overlooked tool in an SEO’s arsenal, 68% of our clients in a recent DeepCrawl survey said they had access to a log file analyser.

Log file poll results

While looking at log files alone will provide you with valuable information about what pages search engines request and how frequently they do so, this data source won’t provide you with the full picture on its own.

Calling in Crawl Data

To prune your site and optimise crawl budget you will need to bring in information about the relative importance of your site’s pages. Using a web crawling platform, like DeepCrawl, will allow you to find out if the page is indexable or not and also gauge the relative importance of a URL by looking at DeepRank; our measurement of internal link weight similar to Google’s PageRank.

Adding SERP Data Into the Mix For the Full Picture

On top of this you can throw data from Google Search Console into the mix to not only verify if a URL is indexable, but whether or not it is ranking and getting clicks from searchers.

Bringing together data from log file analysers, web crawling platforms and Google Search Console puts you in a significantly better position to identify low value pages receiving considerable attention from search engines and then decide on how to deal with these pages in order to optimise the site’s crawl budget.

Your solutions in instances like these are to remove these pages, nofollow links pointing to them, noindex them which can reduce the crawl rate of these pages by search engines, or you may want to disallow them.

On the flip side, you will also want to use log file and Search Console data to identify important indexable pages that are receiving search impressions but aren’t being crawled regularly.

See How Search Engine Bots Crawl Your Site With Different User Agents

Our crawler can split out desktop and mobile requests from Googlebot, so you can understand your mobile set up and see how your site’s crawl budget is being used by different user agents.

If you have a separate mobile site you can see if search engines are crawling your dedicated mobile URLs with a mobile user agent.

Combining Log Files & Site Performance Issues

Ideally you want search engines to request your most important pages the most, but you need to make sure that these pages are returning 200 status codes.

Adding log files into your Search Universe will allow you to see if search engines are requesting pages that are returning non-200 status codes or if these are status codes that vary from crawl to crawl.

You might, for example, see that pages requested by search engines are changing between 200 and 503 codes. This would suggest the site is experiencing server performance issues that need to be resolved.

Improving the Quality of Your Site

Ensuring visitors are met with unique, relevant and useful content when they come to your site is another top priority for search marketers, especially as Google penalises sites with a high volume of low quality pages. Improving site quality can be achieved, in part, by pruning thin pages with low engagement and low search impressions.

Start with DeepCrawl’s Thin Pages Report

Identifying pages with thin content is quickly obtained by running a crawl with DeepCrawl and navigating to the Thin Pages report. This report will give you a full list of all URLs that are less than the customisable thin page threshold (3,072 bytes), as well as the total number of pages that meet this criteria.

Adding in Analytics and Search Console

This list can then be combined with impressions and click metrics from Search Console to find low quality indexable pages. A further source you can add in is engagement metrics from Google Analytics to further filter these pages by bounce rate and average time on page.

Joining these three data sources empowers you to make the best decisions about how you should treat low value indexable pages, whether that be improving the content, removing the page, or merging it with other existing pages.

Section-Specific Metrics with DeepCrawl’s Site Explorer

For good measure you may also want to look at DeepCrawl’s Site Explorer report which provides site quality metrics averaged across the different sections of your site. By looking at the average DeepRank, fetch time, backlinks etc. for specific site sections you can find which parts are of lower quality and prioritise which areas you want to dedicate time to improving.

Identifying low quality pages on your site is a great example of how you can bring together different data sources from your Search Universe to leverage better insights into your website’s performance.

Further Ways You Can Combine Google Analytics With Crawl Data

Insights from consumer analytics platforms, like Google Analytics, will show you what visitors see on your site and what they engage with. Here are a couple of examples of how crawl data can combine with an external data source to leverage better insights.

Soft 404 Pages With Traffic

It isn’t unusual for eCommerce sites to have category pages that don’t contain any products but are still ranked in search. With Analytics data you will be able to see pages with traffic and with DeepCrawl you can run a custom extraction to finds pages with no products. This would provide a poor user experience and you should consider redirecting the page to a relevant, higher level category which does have products available.

Other Pages Receiving Traffic When They Shouldn’t be

Layering crawl and Analytics data also enables you to discover broken, redirecting and non-indexable pages receiving traffic. All three of these scenarios will need to be investigated, understood and possibly rectified to provide your visitors a better user experience and to improve the quality of your site.

Optimising Backlink Authority Flow

Links are still a key signal that Google use to rank websites. According to a recent survey, DeepCrawl users have access to a fairly even spread of backlink tools: 38% have access to Ahrefs, 26% Majestic and 21% Moz Pro.

Regardless of which link analysis tool you use, they are all limited in isolation because they can’t tell you what happens to link authority once it lands on a page.

Only when you layer information from link analysis tools with crawl data can you find out if the link equity that those links bring is being spread effectively throughout your site.

Here are just a few practical ways you can layer backlink and crawl data to find pages that are in need of cleanup work:

  • Backlinks to pages with a meta nofollow tag
  • Broken pages with backlinks
  • Non-indexable pages with backlinks
  • Backlinks to duplicate pages
  • Orphaned pages with backlinks

You can check out this Search Engine Land article to find out how to clean up pages that fall into the above categories.

Auditing Your Sitemaps

Making sure your sitemaps are kept up to date is another essential activity which requires the involvement of multiple data sources in order to get a holistic understanding.

By combining web crawl data with the URLs included in a site’s sitemap, you can identify several key insights that wouldn’t be possible by examining the latter alone.

Don’t Forget About the Orphans

The combination of these two sources will reveal orphan pages on your site; pages which have no internal links pointing to them. Once you’ve identified these pages you can review them on a page-by-page basis and decide if they are giving your visitors value. If they are driving value, then it is worth adding internal links to these pages.

If the orphan pages are redundant then you will want to consider either 404’ing them or redirecting them to relevant pages if they have backlinks.

Further uses of bringing together web crawl data with your sitemap include identifying non-indexable pages that don’t need to be included in your sitemap, disallowed URLs, broken sitemap links, mobile alternates and indexable pages that are missing from your sitemap.

Google Search Console is another part of your Search Universe you will want to call upon when auditing your sitemap. While knowing which pages in your sitemap are receiving impressions in search won’t present you with an obvious opportunity for a quick fix, you will be able to flag pages with little or no impressions and consider ways in which you can improve these pages so as to improve their ranking, impressions and clicks.

Further Ways You Can Combine Google Search Console With Crawl Data

SERP metrics from Google Search Console’s Search Analytics are a powerful data source which can give you insight into organic performance. Here’s a couple of extra ways you can utilise Search Console in combination with crawl data.

Disallowed Pages with Clicks

Search Console will show you which pages have clicks, and crawl data can show you if these pages are indexable. By joining these two sources together you will be able to identify scenarios such as pages disallowed in a site’s robots.txt file that are still receiving clicks from organic search. This scenario may occur if you still have internal links pointing to a disallowed page and you should decide if you want the page to be indexed or not, and either remove the page from the robots.txt or possibly consider removing the internal links to the page.

Pages with Image Search Traffic with Broken Images

Search Console also provides performance metrics for images on your site. By looking at this with crawl data you can highlight broken images that are still receiving impressions in image search and fix them. Image search is an often overlooked area of search, and is especially important for sites in the eCommerce space with many driving significant amounts of revenue through this channel.

Scaling Up…

Identifying these issues in Excel is fairly straightforward for sites with a relatively large number of URLs, but it takes time and isn’t manageable with many of today’s sites which have millions of URLs.

To effectively manage sites of this size you need to have a solution in place which is scalable, robust and automated.

At DeepCrawl, we firmly believe we have the only crawling solution that is able to ingest all of the data sources talked about above and organise this information in a way that saves you time and gives you a full view of your Search Universe.

So what are you waiting for? Start building your Search Universe today.


Written By
Sam Marsden is SEO & Content Manager at DeepCrawl and writer on all things SEO. DeepCrawl is the world’s most comprehensive website crawler, providing clients with a complete overview of their websites’ technical health.
  • This field is for validation purposes and should be left unchanged.