In a LinkedIn post, Google analyst Gary Illyes reiterated a long-standing guide for website owners: use robots.txt to prevent web crawlers from accessing URLs that trigger actions like adding items to carts or wish lists.
Illyes highlighted the common complaint of overloading servers with unnecessary crawler traffic, which often stems from search engine bots crawling URLs targeted for user actions.
he he wrote:
“Looking at what we’re crawling from complaint sites, too often these are action URLs like ‘add to cart’ and ‘add to wishlist’. They’re useless to crawlers, and it’s you probably don’t want to track them.”
To avoid this wasted server load, Illyes advised blocking access to the robots.txt file for URLs with parameters like “?Add to cart” or “?add to wish list.”
As an example, he suggests:
“If you have URLs like:
https://example.com/product/scented-candle-v1?add_to_cart
i
https://example.com/product/scented-candle-v1?add_to_wishlist
You should probably add a disallow rule for them in your robots.txt file.”
While using the HTTP POST method can also prevent these URLs from being crawled, Illyes noted that crawlers can still make POST requests, so robots.txt is still recommended.
Reinforcing decades of best practices
Thread contributor Alan Perkins noted that this guide echoes web standards introduced in the 1990s for the same reasons.
Quote from a 1993 document titled “A Standard for Bot Opting Out”:
“In 1993 and 1994 there have been occasions when robots have visited WWW servers where they were not welcome for various reasons … robots traversed parts of WWW servers that were not suitable, for example, very deep virtual trees, information duplicate, temporary information or cgi-scripts with side effects (such as voting).
The robots.txt standard, which proposed rules to restrict well-behaved crawler access, emerged as a “consensus” solution among web stakeholders in 1994.
Obedience and exceptions
Illyes stated that Google’s crawlers fully obey robots.txt rules, with rare, well-documented exceptions for scenarios involving “contractual or user-triggered fetches.”
This adherence to the robots.txt protocol has been a mainstay of Google’s web crawling policies.
Why SEJ cares
While the advice may seem rudimentary, the resurgence of this decades-old best practice underscores its relevance.
By leveraging the robots.txt standard, sites can help tame bandwidth-hungry crawlers with unproductive requests.
How this can help you
Whether you have a small blog or a major e-commerce platform, following Google’s advice on leveraging robots.txt to block crawler access to action URLs can help in several ways:
Reduced server load: You can reduce unnecessary server requests and bandwidth usage by preventing crawlers from reaching URLs that invoke actions such as adding items to carts or wishlists.
Improved crawler efficiency: Giving more explicit rules in the robots.txt file about which URLs crawlers should avoid can result in more efficient crawling of the pages/content you want to index and rank.
Better user experience: With server resources focused on actual user actions instead of wasted crawler visits, end users are likely to experience faster load times and smoother functionality.
Stay aligned with standards: Implementing the guide makes your site compliant with the widely adopted robots.txt protocol standards, which have been industry best practices for decades.
Revising robots.txt directives could be a simple but impactful step for websites looking to exert more control over crawler activity.
Illyes’ messaging indicates that the old robots.txt rules are still relevant in our modern web environment.
Featured Image: BestForBest/Shutterstock
[ad_2]
Source link