Propose a new meta tag for LLM/AI

Propose a new meta tag for LLM/AI

While Google is opening the discussion in giving credit and complying with copyright when training large language models (LLMs) for generative AI products, focus on the robots.txt file.

However, in my opinion, this is the wrong tool to look at.

My former colleague Pierre Far wrote an excellent article on trackers, search engines and the scandal of generative AI companies in which he highlighted some of the immense challenges facing the publishing industry today on line. Similar to your article, I will keep this proposal at a high level, as developments in this field are extremely fast.

Why not use robots.txt?

There are a few reasons why using robots.txt is the wrong starting point for the discussion of how to respect publishers’ copyrights.

Not all LLMs use trackers and identify themselves

The burden is on the website operator to identify and block individual trackers, which may use and/or sell their data for generative AI products. This creates a lot of extra (and unnecessary) work, especially for smaller publishers.

This also assumes that the publisher has edit access to their robots.txt file, which is not always the case with hosted solutions.

This is not a sustainable solution as the number of trackers continues to grow

The usable file size of a robots.txt file is limited to 500 kbaccording to the the newly proposed robots.txt standard.

This means that a large publisher may have problems with their robots.txt file if they need to block many LLM crawlers and/or refined URL patterns in addition to other robots.

An “all or nothing” approach is unacceptable

For larger crawlers like Googlebot and Bingbot, no distinction can be made between the data used for search engine results pages (traditionally where there is an “agreement” between the publisher and the engine of search in the form of “citation” to the original source) and generative AI products.

Blocking Googlebot or Bingbot for their generative AI products also blocks any potential visibility in their respective search results. This is an unacceptable situation where the publisher is forced to choose between “all or nothing”.

Robots.txt is about managing crawling, while the copyright discussion is about how data is used

The latter deals with the indexing/processing phase. As such, robots.txt isn’t really relevant to this discussion, but a last resort if nothing else works and really shouldn’t be the starting point for this particular discussion.

Robots.txt files work fine for crawlers and do not need to be changed for LLMs. Yes, LLM crawlers need to be identified, but what we really need to talk about is indexing/processing of crawled data.

Reinventing the wheel

Fortunately, the web already has some well-established solutions that can be used to manage the use of data in terms of copyright. Is called Creative Commons.

Most Creative Commons licenses would be fine for LLMs. illustrate:

CC0 allows LLMs to distribute, remix, adapt and build upon the material in any medium or format without conditions.

CC BY allows LLMs to distribute, remix, adapt and build upon the material in any medium or format, as long as the creator is credited. The license allows commercial use, but credit must be given to the creator.

CC BY-SA allows LLMs to distribute, remix, adapt and build upon the material in any medium or format, as long as the creator is credited. License permits commercial use. If LLMs remix, adapt or build on the material, they must license the modified material on identical terms.

CC BY-NC permits LLMs to distribute, remix, adapt and build upon the material in any medium or format for non-commercial purposes only as long as the creator is credited.

CC BY-NC-SA permits LLMs to distribute, remix, adapt and build upon the material in any medium or format for non-commercial purposes only as long as the creator is credited. If LLMs remix, adapt or build on the material, they must license the modified material on identical terms.

CC BY-ND permits LLMs to copy and distribute the material in any medium or format in non-adapted form only as long as the creator is credited. The license allows commercial use and credit must be given to the creator, but no derivation or adaptation of the work is permitted.

CC BY-NC-ND permits LLMs to copy and distribute the material in any medium or format only in non-adapted form, for non-commercial purposes only, and provided credit is given to the creator and no derivation or adaptation of the work is permitted.

It is unlikely that the last two licenses can be used for LLM.

However, the first five licenses mean that LLMs must consider how they use the tracked/obtained data and ensure that they comply with the requirements set out for using publisher data, such as attribution and when sharing product based on the data .

This would put the burden on the “few” LLMs of the world rather than the “many” editors.

The first three licenses also support “traditional” use of the data, for example in search engine results where attribution/credit is given via the link to the original website. While the fourth and fifth licenses also support open source LLM research and development.

Side note: Be aware that all of these software companies that build LLMs often use open source software where they have the same copyright licensing challenges regarding the software libraries and operating systems they use to avoid copyright violations at the code level. So why reinvent the wheel when we can use a similar system for the data this code processes?

Once the publisher has identified an appropriate license, that license must still be communicated. Again, this is where robots.txt seems to be the wrong approach.

Just because a page should be blocked from being crawled by search engines doesn’t mean it can’t be used or isn’t useful for LLMs. These are two different use cases.

Therefore, to separate these use cases and allow for a more refined but also easier approach for publishers, I recommend that we use a meta tag.

Meta tags are pieces of code that can be inserted at the page level, within a theme, or within the content (I know, this isn’t technically correct, but HTML is forgiving enough and can be used as a last resort when a publisher has limited access). in the codebase). They do not require the publisher to have additional access rights other than being able to edit the HTML of the published content.

Using meta tags doesn’t stop crawling, like now meta noindex. However, it allows to communicate the rights of use of the published data.

And although there are existing copyright tags that can be used, especially from Dublin Core, rights standard (abandoned proposal), copyright-meta (focuses on owner name rather than license) i other attempts – The current implementation of these on some websites may conflict with what we are trying to achieve here.

So a new meta tag may be necessary, although I’m happy to reuse an existing or old one such as “rights standard” as well. For this discussion, I propose the following new meta tag:

Also, I recommend that this meta tag is also supported when used HTTP headersas noindex is compatible with X-Robots-Tagto help LLM crawlers better manage their crawl resources (they only need to check HTTP headers to validate usage rights).

X-Robots-Tag: usage rights: CC-BY-SA

This can be used in combination with other meta tags. In the example below, the page should not be used for search results, but may be used for commercial LLMs as long as credit is given to the source:

X-Robots-Tag: usage rights: CC-BY, noindex

Note: The name “usage rights” in the meta tag is a proposal and can be changed.

Infallible solution

Of course, there are bad trackers and bad actors who create their LLMs and generative AI products.

The proposed meta tag solution will not prevent the content from being used in this way, but neither will the robots.txt file.

It is important to recognize that both methods depend on acknowledgment and compliance by the companies using the data for their AI products.

conclusion

Hopefully this article illustrates how using robots.txt to manage data usage in LLMs is, in my opinion, the wrong approach/starting point for dealing with usage and rights author in this new era of LLM and generative AI products.

This implementation of meta tags would allow publishers to specify copyright information at the page level using Creative Commons, without preventing the page from being crawled or indexed for other purposes (such as search engine results). It also allows copyright claims to be made for various uses, including LLMs, generative AI products, and potential future AI products.

The views expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

[ad_2]

Source link

You May Also Like

About the Author: Ted Simmons

I follow and report the current news trends on Google news.

Leave a Reply

Your email address will not be published. Required fields are marked *