FROM IA_ARCHIVER TO OPENAI: THE PASTS AND FUTURES OF AUTOMATED DATA SCRAPERS

Katherine Mackinnon; Emily Maemura

doi:10.5210/spir.v2024i0.13995

Authors

Katherine Mackinnon University of Copenhagen
Emily Maemura

DOI:

https://doi.org/10.5210/spir.v2024i0.13995

Keywords:

data scrapers, protocol, robots.txt, web archiving, critical data studies

Abstract

Data scraping practices have recently come under scrutiny, as datasets scraped from the web’s social spaces are the basis of new generative AI tools like Google’s Gemini, Microsoft’s Copilot, and OpenAI’s ChatGPT. These practices of scrapers and crawlers are based on the conception of the internet as a mountain of data that’s sitting, waiting, available to be acted upon, extracted and put to use. In this paper, we examine the robots.txt exclusion protocol which has been used to govern the behavior of crawlers and is often taken as a proxy for consent in widespread data scraping and web archiving. By addressing the underlying assumptions of the protocol, we aim to counter a recent narrative that “the basic social contract of the web is falling apart” (Pierce, 2024), and instead argue that data extractive infrastructures have always been at work over the past 30 years of the web. Positioning this work within the field of critical data studies, we aim to find new ways for web archives and modes of collection to become unbound from the “capitalist logics of data extraction” upon which they’re currently built (Theilen et al., 2021).

FROM IA_ARCHIVER TO OPENAI: THE PASTS AND FUTURES OF AUTOMATED DATA SCRAPERS

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Language

Developed By