How to improve crawling and indexing efficiency at the enterprise level

Business SEO has different rules.

Strategies that may work for small or niche websites don’t always work at scale.

So what exactly can happen when business SEO is done too big?

In this article, I will share three real-life examples. Then you’ll learn a potential antidote to more efficient SEO management at scale.

Facing the indexing dilemma

Small sites tend to grow one page at a time, using keywords as building blocks of an SEO strategy.

Large sites often take more sophisticated approaches, relying heavily on systems, rules and automation.

Aligning SEO with business goals is critical. Measuring SEO success based on keyword ranking or traffic has negative consequences due to over-indexing.

There is no magic formula for determining the optimal number of indexed URLs. Google does not set an upper limit.

A good starting point, however, is to consider the overall health of your SEO funnel. If a place…

push dozens or hundreds of millionsor even billions of URLs in Google Ranks only for few millions keywords Receive visits aa few thousand pages Convert to fraction of these (if any)

… then it’s a good indication that you need to deal with some serious SEO health needs.

Fixing any site hygiene issues now should prevent even bigger SEO problems later.

Let’s look at three real business SEO examples that illustrate why this is so important.

Case 1: consequences of over-indexing low-quality content

Google has limited resources for web crawling and processing. They prioritize content that is valuable to users.

Google may crawl, but not index, pages it deems thin, duplicate, or low-quality.

If it’s only a few pages, no problem. But if it’s widespread, Google may ignore entire page types or most of the site’s content.

In one case, an e-commerce marketplace found that tens of millions of its listing pages were affected by selective crawling and indexing.

After crawling through millions of pages of thin and near-duplicate listings and not indexing them, Google finally pulled the plug on the website, leaving many in “Discovered, Currently Not Indexed” limbo.

This market relied heavily on search engines to promote new listings to users. New content was no longer being discovered, which was a major business challenge.

Some immediate steps were taken, including improving internal linking and deploying dynamic XML sitemaps. Ultimately, these attempts were futile.

The real solution required controlling the volume and quality of indexable content.

Case 2: Unforeseen consequences of tracking cease

When crawling is stopped, unwanted content will remain in Google’s index, even if it is changed, redirected or deleted.

Many websites use redirects instead of 404 errors for removed content to maintain authority. This tactic can squeeze additional traffic from ghost pages for months, if not years.

However, this can sometimes go horribly wrong.

For example, a well-known global marketplace that sells handmade products accidentally disclosed private information about sellers (eg, name, address, email, phone number) on localized versions of their website pages. list Some of these pages were indexed and cached by Google, displaying personally identifiable information (PII) in search results, compromising users’ security and privacy.

Since Google didn’t crawl these pages again, deleting or updating them wouldn’t remove them from the index. Even months after deletion, cached content and user PII continued to exist in Google’s index.

In a situation like this, it was the marketplace’s responsibility to correct the mistakes and work directly with Google to remove sensitive content from Search.

Case 3: The risks of over-indexing search results pages

Uncontrolled indexing of large volumes of thin, low-quality pages can backfire, but what about indexing search result pages?

Google does not endorse indexing internal search results, and many experienced SEOs would advise against this tactic. However, many large sites have relied heavily on internal search as the primary driver of SEO, often with substantial returns.

If the metrics of user engagement, page experience, and content quality are high enough, Google may turn a blind eye. In fact, there is enough evidence to suggest that Google might even prefer a high-quality internal search results page to a thin listing page.

However, this strategy can also go wrong.

I once saw a local auction site lose a significant portion of its search page rank, and over a third of its SEO traffic, overnight.

The 20/80 rule applies because a small portion of top terms account for the majority of SEO hits in indexed search results. However, it’s often the long tail that makes up the lion’s share of URL volume and boasts some of the highest conversion rates.

As a result, of the sites that use this tactic, few impose limits or strict rules on indexing search pages.

This raises two main problems:

Any search query can generate a valid page, which means an infinite number of pages can be automatically generated. All of them are indexable in Google.

In the case of a classified ad marketplace that monetized its search pages with third-party ads, this vulnerability was well exploited through a form of ad arbitrage:

A large number of search URLs were generated for shady, adult and outright illegal terms. Although these auto-generated pages did not yield actual inventory results, they served third-party ads and were optimized to rank for the requested search queries using the page template and metadata. Backlinks to these pages were created from low-quality forums to get them discovered and crawled by robots. Users landing on these Google pages clicked on third-party ads and proceeded to the low-quality sites that were their intended destination.

By the time the scheme was discovered, the site’s overall reputation had been damaged. It was also hit with several penalties and suffered massive drops in SEO performance.

Adopt managed indexing

How could these problems have been avoided?

One of the best ways for large business sites to thrive in SEO is to scale down through managed indexing.

For a site with tens or hundreds of millions of pages, it’s crucial to move beyond a keyword-focused approach to one focused on data, rules and automation.

Data-driven indexing

A major advantage of large sites is the wealth of internal search data available to them.

Instead of relying on external tools, they can use this data to understand regional and seasonal search trends and demand at a granular level.

This data, when mapped to your existing content inventory, can provide strong guidance on what content to index, as well as when and where to index.

Deduplicate and consolidate

A small number of authoritative, high-ranking URLs is much more valuable than a large volume of pages spread across the top 100.

It pays to consolidate similar pages using canonicals, leveraging rules and automation to do so. Some pages can be consolidated based on similarity scores, others grouped together if they rank collectively for similar queries.

The key here is experimentation. Adjust logic and review thresholds over time.

Clean thin and empty content pages

When present in large volumes, thin and empty pages can cause significant damage to site hygiene and performance.

If it’s too difficult to enhance them with valuable content or consolidate them, they shouldn’t be indexed or even allowed.

Reduce infinite spaces with robots.txt

Fifteen years after Google first wrote about “infinite spaces”, the problem of over-indexing filters, sorting and other combinations of parameters continues to plague many e-commerce sites.

In extreme cases, crawlers can crash servers while trying to work their way through these links. Fortunately, this can be easily fixed using robots.txt.

Representation of the client side

It may be an option to use client-side rendering for certain page components that you don’t want search engines to index. Consider it carefully.

Better yet, these components should be inaccessible to logged out users.

The stakes increase dramatically as the scale increases

Although SEO is often perceived as a “free” source of traffic, this is somewhat misleading. It costs money to host and publish content.

The costs may be negligible per URL, but once the scale reaches hundreds of millions or billions of pages, the pennies start to add up to real numbers.

While SEO ROI is tricky to measure, a penny saved is a penny, and cost savings through managed crawling and indexing should be a factor when considering indexing strategies for sites big.

A pragmatic approach to SEO, with well-managed crawling and indexing, guided by data, rules and automation, can protect large websites from costly mistakes.

The views expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

[ad_2]

Source link

Pages

Categories

How to improve crawling and indexing efficiency at the enterprise level

Facing the indexing dilemma

Case 1: consequences of over-indexing low-quality content

Case 2: Unforeseen consequences of tracking cease

Case 3: The risks of over-indexing search results pages