The field of processing large quantities of data in commercial settings has grown enormously over the last 20 years, and there are now many aspects to this type of work. The traditional “big-data software” (by this I mean things like Hadoop, Spark, HDFS, distributed object stores, etc) and its architecture plays a significant but by no means only part in it, so is worth thinking back to what these architecture aim(ed) to solve.

Dean & Ghemawat’s MapReduce

A great starting point is Dean & Ghemawat’s “MapReduce: Simplified Data Processing on Large Clusters” which was published in 2004. Here is what strikes me as important in what they were trying to solve:

  1. Data were local to a large number (1800) computers

  2. The specific processing of the data are not a primary reason for the arrangement of the computers and data (e.g., they do the tests in their paper on a weekend afternoon when the cluster is not busy, indicating that it is busy with other jobs at other times)

  3. “Network bandwidth is a scarce resource”

  4. Failure and stragglers (machines or tasks which work but slowly) are a big factor in the architecture

  5. Replace existing software which “works” but is difficult to maintain

There is one interesting ratio which allowed them to make what today may look like surprising choices:

  1. The ratio of single-thread performance (2GHz Xeon processor) vs disk bandwidth (a good disk is given as 30MB/s while when fragmented it could be as low as 1MB/s!)

What has changed since

Big is a relative term

In their paper Dean & Ghemawat give an example of processing a 1 TB data set and achieving an aggregate throughput (on 1800 machines) of 30 GB/s.

At time of writing (not far from 20 years later) such data can be trivially stored in RAM of a server (e.g. AWS x1.16xlarge has 1 TB of RAM and costs between $2 - $5 dollars per hour). However if it were needed to store durably, a single PCI slot with with multple NVMe drives can deliver the same total I/O throughput (30GB/s), capacity greater than 20TB in a simple desktop computer.

Ratio of single thread performance vs I/O bandwidth

The AWS x1.16xlarge mentioned earlier runs a Intel Xeon E7-8880 v3 which has a base CPU frequency of 2.1GHz. This is practially the same as in Dean & Ghemawat! On the other hand if we are storing data in RAM, this processor has a 85 GB/s bandwidth, almost 3000 times higher than their per-disk I/O bandwidth. So in this extreme example the ratio of single thread performance to data flow rate has decreased by a factor of about 3000 – this drastically changes the architectural drivers.

This change in ratio is less extreme in other examples: e.g., it is possible to get processors up to 5GHz frequency and disk bandwidth is somewhat lower, e.g., 30 GB/s for local NVMes or perhaps 20 GB/s for very high efficiency networked attached storage. That still makes the change in the ratio of data I/O to single-thread performance 250 or more.

Specialisation

It is now quite easy to access highly specialised systems through cloud providers. And in fact Google is now itself a cloud provider, and even designs its own specilised microprocessor architectures.

If the data can be moved then it is often possible to specialise the hardware system rather than the software. For example it is possible to use a 24 TB RAM system or a POSIX-like filesystem with 1TB/s of throughput (AWS Lustre).

This means some problems which were best solved by big-data architectured software could be better solved by suitably architectured hardware.

Processing sensor information rather than human-generated information

In around 2004 much of commercial big-data was about processing human generated information – web pages they write, links they click on, songs they choose to listen to etc. Humans interacting with a computer do not generate much information! The information input rate is maybe 1 byte / second (the data rate is higher).

But over time focus has shifted dramtically to processing sensor information: processing voice, images (outputs of CCDs), videos, radar/lidar scans etc. These present quite different characteristics and vastly larger volumes than human-generated information.

Budgets

In 2003, the year before the MapReduce paper Google spent 93 million USD (not a typo!) on R&D and had 188 million USD of property and equipment . In 2021 Alphabet Inc spent 31 billion dollars on R&D and 98 billion USD of property and equipment.

Topics for another day

What is big data today? How has big data software architecture evolved to account for the above changes?