The Rule of Data
In 1979, American historian Elizabeth Lewisohn Eisenstein (1923–2016) published a book titled The Printing Press as an Agent of Change. She proposes that its invention by German-born Johannes Gutenberg (1400–1468) in the fifteenth century created the necessary conditions for the Renaissance and the Scientific Revolution — which, together with the First Industrial Revolution, could be considered as the origins of modern society. According to Eisenstein, between 1453 and 1503 approximately eight million books were printed — more than all the written material produced in the nearly 5,000 years of civilization up until then. And the speed at which more data is being created is mind-blowing: In the early 2020s, it is estimated that each one of us produced almost two megabytes of data — per second. And according to a World Economic Forum post using data from multiple sources, by 2025 we will see the creation of 463 exabytes of data every single day.
One of the main challenges of big data techniques is how to turn massive amounts of data (stored in all different formats) into useful information. This is generally done by analyzing the correlations among thousands of variables, without knowing beforehand which ones will turn out to be relevant. The algorithms extract the necessary recommendations from these correlations. However, it is important to note that correlation does not imply causation. Simply because two variables move in a similar fashion does not necessarily mean that one “explains” the other. In 2015, Tyler Vigen published Spurious Correlations, where he presents several examples of variables that are highly correlated, but that clearly have no cause-and-effect relationship. Consider the divorce rate in the state of Maine and the per-capita consumption of margarine between 2000 and 2009: There is a 99.26% correlation between these two things, but there is clearly no relationship between one and the other.
The smart use of data could also provide benefits in terms of maintaining the infrastructure of a large city. In their 2013 book Big Data: The Essential Guide to Work, Life and Learning in the Age of Insight, Viktor Mayer-Schonberger and Kenneth Cukier describe how New York’s gas and power company, Consolidated Edison (Con Ed), used technology to reduce the risk of manhole explosions.
The company’s objective was to predict which manholes were going to show problems so that corrective measures could be taken. Manhattan has more than 50,000 manholes and more than 150,000 km (93,000 mi) of cabling, so determining exactly which ones should be prioritized for inspection is a complex task. Researchers from Columbia University, led by Cynthia Rudin (professor of computer science, electrical engineering, and statistics at Duke University in North Carolina) tabulated data that had been collected by maintenance teams beginning at the end of the nineteenth century and correlated the data to incidents. Using more than 100 variables to make their predictions, testing indicated that the model was able to correctly predict over 40% of the manholes that would present problems.
Another public utility using the power of information that is constantly made available by users is the United States Department of Education, which in 2012 published a report entitled Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics (pdf link). With the increasing popularity of online courses, it is possible to monitor students’ behavior and performance, assist in their development, and provide input for course providers to adjust content.
The benefits of Big Data application, however, stretch much further than services traditionally run by governments, such as security, infrastructure, and education. Next time, we will discuss the use of this technology by various business sectors. See you then.