Google answers a question about the crawl budget

Someone on Reddit posted a question about their “crawl budget” issue and asked if a large number of 301 redirects to 410 error responses was causing Googlebot to use up its crawl budget. Google’s John Mueller offered a rationale for why the Redditor may be experiencing a lackluster crawl pattern and clarified a point about crawl budgets in general.

Track budget

It’s a commonly accepted idea that Google has a crawl budget, an idea that SEOs invented to explain why some sites don’t get crawled enough. The idea is that each site is assigned a certain number of crawls, a limit to the amount of crawling for a site.

It’s important to understand the background to the idea of the tracking budget because it helps to understand what it actually is. Google has long insisted that there is no such thing on Google as a crawl budget, although the way Google crawls a site can give the impression that there is a crawl limit.

A senior Google engineer (at the time) named Matt Cutts alluded to this fact about the crawl budget in a 2010 interview.

Matt answered a question about a Google crawl budget by first explaining that there is no such thing as a crawl budget as understood by SEOs:

“The first is that there really isn’t an index limit. A lot of people thought that a domain would only have a certain number of pages indexed, and that’s not really how it works.

There’s also no hard limit on our crawl.”

In 2017 Google published a trace budget explainer which brought together numerous crawl-related facts that resemble what the SEO community called a crawl budget. This new explanation is more accurate than the vague general phrase “crawl budget” (Google’s crawl budget document summarized here by Search Engine Journal).

The short list of main points about a tracking budget are:

A crawl rate is the number of URLs that Google can crawl based on the server’s ability to provide the requested URLs. A shared server, for example, can host tens of thousands of websites, resulting in hundreds of thousands, if not millions, of URLs. Therefore, Google must crawl servers based on the ability to fulfill page requests. Pages that are essentially duplicates of others (such as faceted navigation) and other low-value pages can waste server resources, limiting the amount of pages a server can give Googlebot to crawl. Pages that are lightweight are easier to crawl more. Soft 404 pages can cause Google to focus on these low-value pages instead of the pages that matter. Internal linking and input patterns can help influence which pages are crawled.

Reddit question about crawl rate

The Reddit person wanted to know if the perceived low-value pages they were creating were influencing Google’s crawl budget. In short, a request to an insecure URL of a page that no longer exists redirects to the secure version of the missing web page that returns a 410 error response (meaning the page is gone permanently).

It’s a legitimate question.

Here’s what they asked:

“I’m trying to get Googlebot to forget to crawl some very old non-HTTPS URLs, which are still being crawled after 6 years. And I placed a 410 response, next to HTTPS, on such old URLs.

So Googlebot is finding a 301 redirect (HTTP to HTTPS) and then a 410.

-301-> (answer 410)

Two questions. Is G**** happy with this 301+410?

I’m having “crawl budget” issues and don’t know if these two answers are draining Googlebot

Is 410 effective? I mean, should I return the 410 directly, without a 301 first?”

Google’s John Mueller responded:

G*?

301’s are fine, a 301/410 mix is fine.

Crawl budget is really only an issue for massive sites ( ). If you’re seeing issues with it and your site isn’t really massive, Google probably doesn’t see much value in crawling more. It’s not a technical problem.”

Reasons for not crawling enough

Mueller responded that Google “probably” doesn’t see the value in crawling more web pages. This means that web pages could probably use a review to identify why Google may determine that these pages are not worth crawling.

Some popular SEO tactics tend to create low-value web pages that lack originality. For example, a popular SEO practice is to review top-ranking web pages to understand what factors on those pages explain why those pages rank, and then leverage that information to improve your own pages by replicating what works on them Search Results.

This sounds logical, but it is not creating something of value. If you think of it as a One and Zero binary option, where zero is what’s already in the search results and One represents something original and different, the popular SEO tactic of emulating what’s already in the search results search is doomed to create another Zero, a website that offers nothing more than what is already in the SERPs.

Obviously, there are technical issues that can affect crawl speed, such as server health and other factors.

But in terms of what is meant by crawl budget, this is something that Google has long maintained is a consideration for massive sites and not for small to medium sized websites.

Read the Reddit discussion:

Is G**** happy with 301+410 responses for the same URL?

Featured image by Shutterstock/ViDI Studio

[ad_2]

Source link