Several artificial intelligence companies are circumventing a common web standard used by publishers to block the extraction of their content
Several artificial intelligence companies are circumventing a common web standard used by publishers to block the extraction of their content for use in generative AI systems, content licensing startup TollBit has told publishers.
Several artificial intelligence companies are circumventing a common web standard used by publishers to block the extraction of their content for use in generative AI systems.
A letter to editors comes amid a public dispute between AI search startup Perplexity and media outlet Forbes, involving the same web standard and a broader debate between tech companies and the media over the value of content in the age of generative AI.
The business media publisher publicly accused Perplexity of plagiarizing its investigative stories in AI-generated summaries without citing Forbes or seeking its permission.
A Wired investigation published this week found that Perplexity was likely circumventing efforts to block its web crawler through the Robot Exclusion Protocol, or "robots.txt," a widely accepted standard intended to determine which parts of a site can be crawled.
The News Media Alliance, a trade group representing more than 2,200 U.S.-based publishers, expressed concern about the impact that ignoring "do not track" signals could have on its members.
"Without the ability to opt out of mass extraction, we cannot monetize our valuable content and pay journalists. This could seriously damage our industry," said Danielle Coffey, president of the group.
TollBit, an early-stage startup, is positioning itself as an intermediary between content-hungry AI companies and publishers willing to strike licensing deals with them.
The company tracks AI traffic to publishers' websites and uses analytics to help both parties agree on the rates to be paid for the use of different types of content.
For example, publishers can choose to set higher rates for "premium content, such as the latest news or exclusive insights," the company says on its website.
It claims it had 50 active websites in May, although it did not name them.
According to TollBit's letter, Perplexity is not the only infringer that appears to be ignoring robots.txt.
TollBit said its analysis indicates that "numerous" AI agents are circumventing the protocol, a standard tool used by publishers to indicate which parts of their site can be tracked.
"What this means in practical terms is that AI agents from multiple sources (not just one company) are choosing to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote. "The more publisher logs we ingest, the more this pattern emerges."
The robots.txt protocol was created in the mid-1990s as a way to avoid overloading websites with web crawlers. While there is no clear legal enforcement mechanism, historically there has been widespread compliance on the web and some groups, including the News Media Alliance, say there may still be legal recourse for publishers.
More recently, robots.txt has become a key tool that publishers have used to block tech companies from ingesting their content for free for use in generative AI systems that can mimic human creativity and instantly summarize articles.
AI companies use content both to train their algorithms and to generate real-time summaries of information.
Some publishers, including the New York Times, have sued AI companies for copyright infringement for such uses. Others are signing licensing agreements with AI companies willing to pay for content, although the parties often disagree on the value of the materials. Many AI developers argue that they have not violated any laws by accessing them for free.
Publishers have been warning about news summaries in particular since Google launched a product last year that uses AI to create summaries in response to some search queries.
If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they should use the same tool that would also prevent them from appearing in Google's search results, making them virtually invisible on the web.
Collaboration: Grupo Auge | Reuters (International).