From the dawn of modern print journalism through the beginning of the web era, newspapers represented an archival medium. Once rendered into print, a news article was immutable and could safely be referenced for perpetuity without fear that a few days later it would say something very different. As libraries and other institutions collected and archived newspapers, their contents were also safely preserved for continued access by future generations. Multiple libraries all held independent copies of an article, ensuring that even if some copies were lost or modified, others survived. In contrast, in the web era, journalism has been largely transformed into live blogging, with articles wholesale rewritten or simply deleted. As online journalism has rapidly risen into a dominant distribution format over the past quarter century, what does its ephemeral nature mean for the archival and preservation of our societal record?
In Fall 2014 my open data GDELT Project joined the Internet Archive’s “No More 404” program, providing the Archive with a live list of the URLs of all online news articles it monitors worldwide, updated every 15 minutes. A year and a half later the Archive was crawling and archiving a large fraction of those URLs each day, creating perhaps the largest initiative to archive the world’s online journalism across all countries and 65 languages. By the end of 2017 this collaboration had archived more than 5.4 billion distinct URLs totaling 221 terabytes of at-risk journalism for perpetuity.
A small pilot experiment in Fall 2015 showed that roughly 1.5-2% of all online news articles monitored by GDELT returned a 404 error when fetched again two weeks later. Over just six months in 2015 an estimated 7-14 million news articles monitored by GDELT were lost forever, representing up to twice the total output of the New York Times over half a century.
What might it look like to more systematically assess the longevity of online news, recrawling every single monitored news article after 24 hours and after one week? That was the vision behind GDELT’s open Global Difference Graph, which launched at the end of August last year.
Over the last four months it has recrawled 88 million online news articles spanning all countries and 65 languages. Using Google’s BigQuery platform, summarizing this massive change dataset takes just a single line of SQL and less than 6 seconds to quantify at planetary scale the lifespan of an online news article today.
In total, 0.68% of articles were no longer accessible after 24 hours, rising to 1.5% after a week. These include HTTP 404, 410 and 451 return codes, but not connection timeouts, since those can be transient. A total of 2% of articles redirected to a different URL after 24 hours, rising to 2.59% within a week. Combined, deletions and redirects affect 2.71% of articles after 24 hours and 4.12% within a week. In all, 63% of URL-level changes occur within the first 24 hours of an article’s life (42% of deletions and 75% of redirections).
For those articles that did return valid…