Publishers are complaining that Google, because of its dominance in internet search, has them “between a rock and a hard place” over the use of their copyrighted output to power its artificial intelligence models.
This situation potentially gives Google an enormous advantage over its rivals, they claim, as businesses fear that blocking Google’s AI search “crawlers” would mean that they lose out on valuable traffic.
While most publishers, such as media organisations, have blocked OpenAI’s web crawler, a bot that sucks in their content to feed ChatGPT with information, they fear that barring Google’s equivalent, which supplies its Bard chatbot, would disadvantage them in the long term when it comes to making their information findable and accessible on traditional Google.
“We don’t want to do anything that results in a situation where we get less traffic in a world where Google combines AI and search,” one said, “so we’ve turned off the OpenAI crawler but we haven’t turned off the Google one. They have us between a rock and a hard place.”
Towards the end of last year, Google said it would split its crawlers, so that publishers could choose whether to have their information scraped, or extracted, for its AI systems or merely its search engine. However, it has a new iteration, called search generative experience, or SGE, which is a hybrid of generative AI paragraphs and traditional search: this is what publishers fear will erase them from results pages should they block Google’s crawlers. Copyright-holders say they have little leverage in this dispute, because they rely so heavily on Google to drive traffic to their websites.
Owen Meredith, chief executive of the News Media Association, the trade body in Britain, said: “Individual publishers inevitably will take a commercial view on whether to opt out or not, based on their individual business model. The challenge for many publishers will be the interdependency and gatekeeper role of a small number of Big Tech platforms across every part of their business, from discoverability to advertising to operating systems. Publishers may feel exposed about how Big Tech could react if they decide to opt out.”
Google says it is very aware of the importance of generative AI returning traffic to content-makers and argues that the new function will present more possibilities for people searching for information. “As we develop LLM [large-language model]-powered features, we’ll continue to prioritise experiences that send valuable traffic to the news ecosystem,” a spokeswoman said.
“Our intent is for search generative experiences to serve as a jumping-off point for people to explore web content and, in fact, we are showing more links with SGE in search and links to a wider range of sources on the results page, creating new opportunities for content to be discovered.”
Generative AI burst into the public consciousness with the launch of ChatGPT in November 2022. Since then a handful of players, including OpenAI, backed by Microsoft, Google and Meta, have dominated the market, boasting the resources and computing power needed to build the large-language models that underpin the engines that can create everything from text to images in a human-like way.
Creative industries and the technology companies worldwide are clashing heads over the rights to the content used to create AI. In Britain, the Intellectual Property Office was given the task of bringing representatives of the two sides together to find an agreement on the issue. The talks failed and the problem was handed back to the Department for Science, Innovation and Technology, with a response expected imminently.
In addition, there are several test cases under way in the courts that will shape the future of the debate. In the United States, The New York Times has sued OpenAI for alleged copyright infringement, claiming that the technology company used the newspaper’s information to train its artificial intelligence models without permission or compensation, undermining the paper’s business model and threatening independent journalism. In Britain, Getty Images is suing Stability AI, a London-based start-up, in the High Court over copyright, claiming that the latter had “unlawfully” scraped millions of images from its site to train its picture-creation machine.
Crawlers and the race to keep up with them
Crawlers are used primarily by search engines to systematically scan and index (known as “crawling through”) the content of websites across the internet (Adam MacVeigh and Katie Prescott write.)
You could imagine a crawler as the librarian in an impossibly large library. They catalogue each book (webpage) and have information on the key themes and chapters (subpages within those webpages). When asked about a topic, they can search their catalogue and surface relevant pages or subpages.
A search engine’s index is constantly updated with new webpages and information. Newer and more popular pages might be easier to find (ranked higher in search results), while older or less relevant books might be further back in the stacks (lower in search results).
It’s very hard to say how many websites there really are. Most sources agree that it’s a small portion that are actually indexed for search and that the vast majority aren’t indexed.
Some steps can be taken to prevent crawling, but if a human can navigate to a page from your homepage, then a crawler can, too. There’s no ironclad method to stop this.
Sites can issue guidance to crawlers about whether they can go through their information or not and this has been happening more and more to inform AI companies that information cannot be used within large-language models. Most reputable companies respect these rules put up by websites not to “crawl” their information.
If a user frequently violates certain guidelines, search engines may block them based on their internet protocol (IP) address, a unique number assigned to each internet-connected device. Search engines detect and may temporarily block an IP if they receive multiple guideline-breaking requests from it.
Additionally, devices have a unique fingerprint, derived from a combination of hardware, device settings and software, which helps to identify if someone is making multiple requests while attempting to disguise their identity by changing their IP address through a VPN or by deleting cookies. This tracking method is used to prevent users from circumventing blocks by merely changing their IP addresses.
However, there are services that can cycle both IP and fingerprint to avoid detection. There’s a constant technological arms race between scrapers and content creators.
Ultimately, if someone wants your content, there will be a way to obtain it.