Facebook announced its first large research dataset last year, consisting of “a petabyte of data with almost all public URLs Facebook users globally have clicked on, when, and by what types of people.” Despite its petabyte stature, the actual number of rows was estimated to be relatively small. Compared to news media, social media isn’t necessarily that much larger. Filtering out retweets, we find that Twitter is just 16 times larger than the Google Books NGrams source collection, while the Internet Archive’s public domain books collection is around 54 times smaller. Counting all trillion tweets sent 2006-present and assuming all of them were the maximum 140 characters, the Twitter archive would be just 47 times larger than global online news output 2014-present as monitored by GDELT. Using the more realistic average tweet length, Twitter would be just 25 times larger and removing retweets it would be just 16 times larger. Comparing the two over the same four-year period, we find that Twitter was around 15 times larger than news, but just 8 times larger if retweets are removed. Thus, if one had access to the complete Twitter firehose 2014-present, the total volume of text would likely be only around 8 times larger than the total volume of online news content over the same time period. The total Decahose output 2014-present is just 1.5 times larger than news. In short, Twitter is certainly a large dataset, but in terms of the actual textual tweet contents that most analyses focus on, we see that a trillion tweets don’t actually work out to that much text due to their tiny size. Just as importantly, we see that traditional data sources like news media are actually just as large as the social archives we work with, reminding us of the immense untapped data sources beyond the glittering novelty of social media.
Social media has become synonymous with “big data” thanks to its widespread availability and stature as a driver of the global conversation. Its massive size, high update speed and range of content modalities are frequently cited as a textbook example of just what constitutes “big data” in today’s data drenched world. However, if we look a bit closer, is social media really that much larger than traditional data sources like journalism?
We hold up social media platforms today as the epitome of “big data.” However, the lack of external visibility into those platforms means that nearly all of our assessments are based on the hand picked statistics those companies choose to report to the public and the myriad ways those figures, such as “active users,” are constantly evolved to reflect the rosiest image possible of the growth of social media as a whole.
Much of our reverence for social platforms comes from the belief that their servers hold an unimaginably large archive of global human behavior. But is that archive that much larger than the mediums that precede it like traditional journalism?
Facebook announced its first large research dataset last year, consisting of “a petabyte of data with almost all public URLs Facebook users globally have clicked on, when, and by what types of people.” Despite its petabyte stature, the actual number of rows was estimated to be relatively small. In all, the dataset was projected to contain just 30 billion rows when it was announced, growing at a rate of just 2 million unique URLs across 300 million posts per week, once completed.
To many researchers, 30 billion rows sounds like an extraordinary amount of data that they couldn’t possibly analyze in their lifetime. By modern standards, however, 30 billion records is a fairly tiny dataset and the petabyte as a benchmark of “big data” is long passé.
In fact, my own open data GDELT Project has compiled a database of more than 85 billion outlinks from worldwide news outlet homepages since March 2018, making it 2.8 times larger than Facebook’s dataset in just half the time.
Compared to news media, social media isn’t necessarily that much larger. It is merely that we have historically lacked the tools to treat news media as big data. In contrast, social media has aggressively marketed itself as “big data” from the start, with data formats and API mechanisms designed to maximize its accessibility to modern analytics.
In its 13 short years Twitter has become the defacto face of the big data revolution when it comes to understanding global society. Its hundreds of billions of tweets give it “volume,” its hundreds of millions of tweets a day give it “velocity” and its mix of text, imagery and video offer “variety.”
Just how big is Twitter anyway?
The company itself no longer publishes regular reports of how many tweets are sent per day or how many tweets have been sent since its founding and it did not immediately respond to a request for comment on how many total tweets have been sent in its history. However, extrapolating from previous studies we can reasonably estimate that if trends have held there have been slightly over one trillion tweets sent since the service’s founding 13 years ago.
At first glance a trillion tweets sounds like an incredibly large number, especially given that each of those trillion tweets consists of a JSON record with a number of fields.
However, tweets are extremely small, historically maxing out at just 140 characters of text. This means that while there are a lot of tweets, each of those tweets says very little.
In reality, few tweets come anywhere near Twitter’s historical 140-character limit. The average English tweet is around 34 characters while the average Japanese tweet is 15 characters, reflecting the varying information conveyed by a single character in each language.
Moreover, while raw Twitter data can be quite large (a month of the Decahose was 2.8TB in 2012), just 4% of a Twitter record is the tweet text itself. The remaining 96% is a combination of all of the metadata Twitter provides about each tweet and JSON’s highly inefficient storage format.
Since most Twitter analyses focus on the text of each tweet, this means the actual volume of data that must be processed to conduct common social analytics is quite small.
Assuming that all one trillion tweets were the maximum 140 characters long, that would yield just 140TB of text (the actual number would be slightly higher accounting for UTF8 encoding).
In 2012 the average tweet length Twitter-wide was 74 bytes (bytes, unlike characters, account for the additional length of UTF8 encoding of non-ASCII text), which would mean those trillion tweets would consume just 74TB of text: a large, but hardly unmanageable collection.
If we extrapolate from the 2012-2014 Twitter trends to estimate that somewhere in the neighborhood of 35% of all trillion tweets have been retweets (assuming no major changes in retweet behavior), then using that 74-byte average length would yield just 48TB of unique text.