Around 245 global news organisations across nine countries are attempting to block the Internet Archive’s crawlers. These are automated software bots that capture, display and archive content from web pages in the Internet Archive’s public-facing interface, the Wayback Machine.

The Archive holds over one trillion web pages dating all the way back to 1996, making it one of the biggest collective public information resources in the world. This includes past articles from major news organisations such as CNN, The New York Times, The Guardian, and USA Today.

These web pages are used for a variety of purposes, for example, as primary sources for historians, or to prove changes after publication.

Several news organisations are now pushing to block the crawlers as AI companies are now using the contents of the Archive to train Large Language Models (LLMs) without offering fair payment or acquiring permission.

More than 20 major news organisations already block ia_archiverbot, the main web crawler the Internet Archive uses for the Wayback Machine, according to an analysis by AI-detection company Originality AI.

However, at least one of the Archive’s four crawling bots is blocked by 241 global news sites. A major chunk of these blocked sites is owned by USA Today Co, the US’s biggest newspaper publisher. This means that hundreds of local publications have been practically removed from historical records.

The risks of archival content being used to train AI

Archival news content provides massive quantities of high-quality text and images to train large-scale AI models in more human writing. This is available through URL and API interface, which allows different software to communicate with each other and request data, acting as a bridge between systems.

This makes it even easier for AI companies to access archived data and train models.

Another advantage is that content in the Internet Archive is already structured, attributed and dated.

Much of the Internet Archive’s data has already been found in key AI-training datasets. However, this is a major weakness for news organisations, which are already suing AI companies such as Perplexity and OpenAI for potential copyright violations.

“The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us,” Graham James, a spokesperson from The New York Times newspaper, said, as cited by The Next Web.

“The Times invests an enormous amount of resources in producing original journalism, and that work should not be used without our permission.”

Other organisations, such as The Guardian, have taken a more conservative approach by limiting, rather than completely blocking the Archive’s access.

Internet Archive maintains that it is “collateral damage”

The Wayback Machine’s director, Mark Graham, has maintained that they are merely “collateral damage” and that the real culprits are the AI companies which access past content through the Archive’s interfaces.

However, the Archive has taken measures of its own to limit this. This includes preventing large downloads of some site materials and limiting automated extraction in certain cases.

Graham highlighted that the Archive functions as a key method of preservation. Without this, articles which are not archived can be edited without authorisation or accountability. This can be anything from changing or removing quotes, amending mistakes or redirecting claims and official statements.

Currently, these changes are tracked by the Wayback Machine.

This has led to some news organisations attempting to work with the Internet Archive to find acceptable compromises or workarounds which involve limiting access rather than hard blocks.

Similarly, non-profit digital rights advocacy group Fight for the Future has also launched a petition, already signed by 100 current journalists, to protest against this blocking. This is especially at a time when public records and history are increasingly contested.

Share.
Exit mobile version