Why news publishers are blocking AI from accessing internet archives

Around 245 global news organisations across nine countries are attempting to block the Internet Archive’s crawlers. These are automated software bots that capture, display and archive content from web pages in the Internet Archive’s public-facing interface, the Wayback Machine.

The Archive holds over one trillion web pages dating all the way back to 1996, making it one of the biggest collective public information resources in the world. This includes past articles from major news organisations such as CNN, The New York Times, The Guardian, and USA Today.

These web pages are used for a variety of purposes, for example, as primary sources for historians, or to prove changes after publication.

Several news organisations are now pushing to block the crawlers as AI companies are now using the contents of the Archive to train Large Language Models (LLMs) without offering fair payment or acquiring permission.

More than 20 major news organisations already block ia_archiverbot, the main web crawler the Internet Archive uses for the Wayback Machine, according to an analysis by AI-detection company Originality AI.

However, at least one of the Archive’s four crawling bots is blocked by 241 global news sites. A major chunk of these blocked sites is owned by USA Today Co, the US’s biggest newspaper publisher. This means that hundreds of local publications have been practically removed from historical records.

The risks of archival content being used to train AI

Archival news content provides massive quantities of high-quality text and images to train large-scale AI models in more human writing. This is available through URL and API interface, which allows different software to communicate with each other and request data, acting as a bridge between systems.

This makes it even easier for AI companies to access archived data and train models.

Another advantage is that content in the Internet Archive is already structured, attributed and dated.

Much of the Internet Archive’s data has already been found in key AI-training datasets. However, this is a major weakness for news organisations, which are already suing AI companies such as Perplexity and OpenAI for potential copyright violations.

“The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us,” Graham James, a spokesperson from The New York Times newspaper, said, as cited by The Next Web.

“The Times invests an enormous amount of resources in producing original journalism, and that work should not be used without our permission.”

Other organisations, such as The Guardian, have taken a more conservative approach by limiting, rather than completely blocking the Archive’s access.

Internet Archive maintains that it is “collateral damage”

The Wayback Machine’s director, Mark Graham, has maintained that they are merely “collateral damage” and that the real culprits are the AI companies which access past content through the Archive’s interfaces.

However, the Archive has taken measures of its own to limit this. This includes preventing large downloads of some site materials and limiting automated extraction in certain cases.

Graham highlighted that the Archive functions as a key method of preservation. Without this, articles which are not archived can be edited without authorisation or accountability. This can be anything from changing or removing quotes, amending mistakes or redirecting claims and official statements.

Currently, these changes are tracked by the Wayback Machine.

This has led to some news organisations attempting to work with the Internet Archive to find acceptable compromises or workarounds which involve limiting access rather than hard blocks.

Similarly, non-profit digital rights advocacy group Fight for the Future has also launched a petition, already signed by 100 current journalists, to protest against this blocking. This is especially at a time when public records and history are increasingly contested.

What's On

‘More grids, more electricity, less fossil fuels,’ energy leaders tell Euronews

Was Erdoğan’s gun gift a faux pas or an old-school diplomatic tradition?

‘Big mistake to believe we’re off the hook in Europe and beyond,’ IEA tells Euronews

Crews drain Lincoln Memorial Reflecting Pool again as part of troubled Trump revamp

Senior Democrat backs Trump’s call for more European defense spending – POLITICO

Why news publishers are blocking AI from accessing internet archives

News outlets seek sanctions against OpenAI in copyright battle

AI for Good summit takes place in Geneva as countries debate global governance

Artemis II astronauts reunite with Orion after record-breaking Moon mission

Meta plans biggest AI data centre outside US in Canada with $9.1bn investment

French watchdog orders Meta back to press payment talks after copyright deals expire

AI’s biggest World Cup star? It’s a fake Erling Haaland

OpenAI, Meta and SpaceXAI push new AI models in a week of major releases

Could your skincare contain microplastics?

What is GlobalEye, the Swedish aircraft set to become NATO’s new ‘eyes in the sky’?

Was Erdoğan’s gun gift a faux pas or an old-school diplomatic tradition?

‘Big mistake to believe we’re off the hook in Europe and beyond,’ IEA tells Euronews

Crews drain Lincoln Memorial Reflecting Pool again as part of troubled Trump revamp

Senior Democrat backs Trump’s call for more European defense spending – POLITICO

Inside the EU’s bittersweet deal to update air passenger rights

Berlin mayor drops reelection bid over blackout lie – POLITICO

Murder probe launched into suspicious death of former British MP Ann Widdecombe

What's On

Why news publishers are blocking AI from accessing internet archives

The risks of archival content being used to train AI

Internet Archive maintains that it is “collateral damage”

Keep Reading