Is this dataset used for Google AI search?

Google published a research paper on a new type of dataset to train a language model to retrieve sentences that exactly answer a question within an open dialogue.

We don’t know if Google is using this data set. But the researchers say it outperforms models trained on other data sets.

Many research papers, such as the one published for LaMDA, do not mention specific contexts of how it might be used.

For example, the LaMDA research paper (PDF) vaguely concludes:

“LaMDA is a step closer to practical and secure open dialogue systems, which in turn can unlock a wide range of useful applications.”

This research paper states that the problem they are solving is how to create a dataset to train a machine for an open dialogue by selecting a phrase from a web page.

Why this data set is important

What makes this research paper interesting is that the researchers conclude that it could be used to underpin the production of generative AI, such as that seen in Google’s new generative search experience.

Since the research paper was presented at an information retrieval conference (Proceedings of the 45th ACM SIGIR International Conference on Research and Development), it’s pretty safe to guess that this algorithm is related to information retrieval, which means search

One last thing to note is that research on this new type of data set was presented last year in 2022, but it seems to have gone unnoticed… Until now.

What Google set out to achieve with the new data set

The researchers explain what they are focusing on:

“In this paper we focus on open dialogues: two parties take turns conversing on any number of topics without restrictions on topic changes and the type of discussion on each topic.

Also, the dialog is not based on a specific document, in contrast to the environment used in some previous works…

The task at hand is to retrieve sentences from some document corpus that contain useful information to generate (either automatically or by humans) the next turn of the dialogue.

We note that dialogue turns can be questions, inquiries, arguments, statements, etc.”

A new type of dataset for training linguistic models

The problem the researchers are solving is how to retrieve a sentence from a web page as an answer to an open-ended question, a type of question that needs more than a yes or no answer.

The research paper explains that what’s missing to make this capability happen on a machine is an adequate set of conversational data.

They explain that existing datasets are used for two reasons:

To evaluate dialogue responses by a generative AI, but not to use it to train it to retrieve information relevant to that response. Datasets for use by a search or question-answering engine, focused on a single passage of a question and answer.

They explain the shortcomings of existing datasets:

“…in most of these datasets, the search results returned are not seen as part of the dialog.

… in both conversational passage retrieval and conversational quality control datasets, there is a user asking questions or queries that reflect explicit intentions with information needs, as opposed to natural dialogues where intentions only they can be implicitly represented, for example, in affirmative statements.

In summary, existing conversation datasets do not combine natural conversations between humans and people with relevant sentence annotations retrieved from a large document corpus.

So we built this dataset…

How the new data set was created

The researchers created a dataset that can be used to train an algorithm that can retrieve a sentence that is the correct answer in an open dialogue.

The dataset consists of Reddit conversations that were matched to Wikipedia answers, plus human annotations (relevance scores) of these question-answer pairs.

The Reddit data was downloaded from Pushshift.io, an archive of Reddit conversations (Frequently asked questions about Pushshift).

The research paper explains:

“To address a broader scope of this task where any type of dialogue can be used, we built a dataset that includes open dialogues from Reddit, candidate sentences from Wikipedia for each dialogue, and human annotations for the sentences.

The dataset includes 846 dialogues created from Reddit threads.

For each dialogue, 50 sentences were retrieved from Wikipedia using an unsupervised seed retrieval method.

These sentences were judged by the crowd workers for their relevance, that is, if they contained useful information to generate the next turn in the dialogue”.

The data set they created is available on GitHub.

Example dialogue question:

“Which came first, the chicken or the egg?”

An example of an irrelevant response:

“Domestic chickens have been around for about 10,000 years. Eggs have been around for hundreds of millions of years.”

An example of a correct web page phrase that can be used to answer is:

“In Neil deGrasse Tyson’s simpler words:
‘What came first: the chicken or the egg? The egg laid by a bird that was not a chicken’”.

Recovery methodology

For the retrieval part, they cite previous research on linguistic models and other methods and opt for a weak supervision approach.

They explain:

“Fitting retrieval models requires relevance labels for training examples in a target task.

These are sometimes scarce or unavailable.

One approach to avoid this is to automatically generate labels and train a weakly supervised model on these annotations.

… We follow the weak supervision paradigm in our training model, with a new weak Reddit annotator for retrieval in a dialog context.

Is the dataset successful?

Google and other organizations publish many research papers that demonstrate varying levels of success.

Some research concludes with limited success, moving the state of the art only slightly, if at all.

The research papers that interest me (to me) are those that are clearly successful and exceed the current state of the art.

It is the case of the development of this data set to train a linguistic model to recover sentences that serve precisely as a turn in an open dialogue.

They state how a BERT model trained on this dataset becomes even more powerful.

They write:

“In fact, while RANKBERTMS outperforms all unadjusted models, the RANKBERTMS→R model, which was further adjusted with our weakly supervised training set, outperforms.

This method achieves the highest performance with all performance gains over other methods being statistically significant.

This finding also demonstrates the effectiveness of our weak annotator and unsupervised training set, showing that performance can be improved without manual annotation for training.”

Elsewhere, researchers report:

“We show that a neural classifier that was fitted to our weakly supervised training set outperforms all other models tested, including a neural classifier fitted to the MS Marco passage retrieval dataset.”

They also write that no matter how successful this approach is, they are interested in improving the state of the art even more than it already has.

The research paper concludes:

“In future work, we would like to devise BERT-based recovery models that are trained with only weak supervision, using a pre-trained BERT, without the need for large annotated training sets like MS Marco.

We would also like to underpin generative language models with our retrieval models and study the conversations that emerge from this foundation.”

Could this approach be used?

Google rarely confirms when a specific search is used. There are some cases, such as with BERT, where Google confirms that it is using it.

But in general, the standard answer is that just because Google publishes a research paper or patent doesn’t mean it’s using it in its search algorithm.

That said, the research paper, which dates back to mid-2022, indicated that a future direction was to study how generative language models (which is like Bard and Google’s generative search experience) can be grounded with it.

A generative AI chat experience can cause the AI output to be invented, which is technically known as hallucinating.

Grounding means anchoring the AI chat output with facts, usually from online sources, to help prevent hallucinations.

Bing uses a system called Bing Orchestrator that checks web pages to base GPT output on facts.

Plugging in the AI output helps keep it fact-based, which this dataset may be able to do in addition to selecting phrases from web pages as part of an answer.