During the US holiday, some posts were shared about an alleged data leak related to Google rankings. Early posts about the leaks focused on “confirming” long-held beliefs by Rand Fishkin, but not much attention was paid to the context of the information and what it actually means.
Context Matters: Document AI Warehouse
The leaked document shares a relationship with a public Google Cloud platform called the Document AI Warehouse that is used to analyze, organize, search, and store data. This public documentation is titled Document AI Warehouse overview. A publication on Facebook shares that the “leaked” data is the “internal version” of the publicly visible Document AI Warehouse documentation. This is the context of this data.
Screenshot: Document AI Warehouse
@DavidGQuaid he tweeted:
“I think it’s clear that it’s an external API for building a document store as the name suggests”
This seems to throw cold water on the idea that the “leaked” data represents internal information from Google Search.
As far as we know at this point, the “leaked data” shares a similarity with what is on the public Document AI Warehouse page.
Internal search data leak?
The original publication a SparkToro does not say that the data comes from Google Search. He says the person who sent the data to Rand Fishkin is the one who made that claim.
One of the things I admire about Rand Fishkin is that he is meticulously precise in his writing, especially when it comes to warnings. Rand specifically points out that it is the person who provided the data who claims that the data comes from Google Search. There is no proof, just a claim.
He writes:
“I received an email from someone claiming to have access to a massive leak of API documentation from Google’s search division.”
Fishkin himself does not claim that the data has been confirmed by former Googlers that it came from Google Search. Write that the person who sent the data by email made this claim.
“The email further claimed that these leaked documents were confirmed as authentic by former Google employees and that these former employees and others had shared additional and private information about Google’s search operations.”
Fishkin writes about a subsequent video meeting where the leaker revealed that his contact with ex-Googlers was in the context of meeting them at a search industry event. Again, we’ll have to take the leakers’ word for the ex-Googlers and that what they said was after carefully reviewing the data and not a casual comment.
Fishkin writes that he contacted three ex-Googlers about it. What is notable is that these ex-Googlers did not explicitly confirm that the data is internal to Google Search. They only confirmed that the data resembles Google’s internal information, not that it originates from Google Search.
Fishkin writes what ex-Googlers told him:
“I did not have access to this code when I worked there. But this certainly looks legit.” “It has all the features of an internal Google API.” “It’s a Java-based API. And someone spent a lot of time adhering to Google’s internal standards for documentation and naming.” “I’d need more time to be sure, but this matches the internal documentation I’m aware of.” “Nothing I’ve seen in a brief review suggests that this is not legitimate.”
Saying something comes from Google Search and saying it comes from Google are two different things.
Keep an open mind
It is important to keep an open mind about the data because there are many things that are not confirmed. For example, it is not known whether this is an internal document of the search team. That’s why it’s probably not a good idea to take any of this data as actionable SEO advice.
Furthermore, it is not advisable to analyze the data to specifically confirm long-held beliefs. This is how one gets caught up in confirmation bias.
A definition of confirmation bias:
“Confirmation bias is the tendency to search for, interpret, favor, and remember information in a way that confirms or supports prior beliefs or values.”
Confirmation bias will lead a person to deny things that are empirically true. For example, there’s the decades-old idea that Google automatically prevents a new site from ranking, a theory called the Sandbox. Every day, people report that their new sites and new pages rank almost immediately in the top ten of Google search.
But if you are a die-hard believer in the Sandbox, then actual observable experience like this will be removed, no matter how many people observe the opposite experience.
Brenda Malone, technical SEO strategist and freelance web developer (LinkedIn profile)sent me a message about complaints about Sandbox:
“I personally know from real experience that the Sandbox theory is wrong. I just indexed a personal blog with two posts in two days. There is no way a small two post site could have been indexed under the Sandbox theory.
The bottom line here is that if the documentation turns out to be from Google Search, the wrong way to analyze the data is to go looking for confirmation of long-held beliefs.
What is the Google data leak?
There are five things to keep in mind about leaked data:
The context of the leaked information is unknown. Is it related to Google Search? Is it for other purposes? The purpose of the data. Was the information used for actual search results? Or was it used for internal data management or manipulation? The former Googlers did not confirm that the data is specific to Google Search. They only confirmed that it appears to be from Google. Keep an open mind. If you go in search of vindication of long-held beliefs, guess what? You will find them everywhere. This is called confirmation bias. Evidence suggests that the data is linked to an external API to create a document store.
What others are saying about the “leaked” documents.
Ryan Jones, someone who not only has deep experience in SEO, but also has a formidable understanding of IT, shared some insightful observations about so-called data leakage.
Ryan he tweeted:
“We don’t know if this is for production or testing. I’m guessing it’s mainly to test potential changes.
We don’t know what is used for web or other verticals. Some things can only be used for a google home or news etc.
We don’t know what is an input to an ML algo and what is used to train it. I assume that clicks are not a direct input, but are used to train a model on how to predict clickability. (Out of trend impulses)
I also assume that some of these fields only apply to the training datasets and not to all sites.
Am I saying Google didn’t lie? Not entirely. But we examine this leak dispassionately and not with any preconceived bias.”
@DavidGQuaid he tweeted:
“We also don’t know if this is for Google Search or Google Cloud Document Retrieval
The APIs seem to pick and choose, that’s not how I expect the algorithm to run; what if an engineer wants to skip all these quality checks, it looks like I want to build a content warehouse application for my business knowledge base.”
Is the “leaked” data related to Google search?
At this time there is no firm evidence that this “leaked” data is actually from Google Search. There is an overwhelming amount of ambiguity about what the purpose of the data is. It should be noted that there are indications that this data is only “an external API to create a document store as the name suggests” and is in no way related to the ranking of websites in Google Search.
The conclusion that this data doesn’t come from Google Search isn’t definitive at this point, but that’s where the wind of evidence seems to be blowing.
Featured image by Shutterstock/Jaaak
[ad_2]
Source link