As with many emerging technologies, the term “Big Data” is being tossed around with little real understanding of what it is, what it means, to whom and why.
One of the things I marvel about, over and over, is the sheer genius of what modern day, edge-of-the envelope marketing schemes can accomplish. For example, terms such as the Internet of Things, (also referred to as the Cloud of Things, or the Internet of Everything, or a term I hope catches on, the Internet of Anything — IOX) has become a sexy, interesting, exciting camouflage layer over the dull, boring and totally unsexy M2M industry. The same is about to happen with analytics. It is getting a new suit, shave, and a haircut, and being called “Big Data.” Today, big data is one of the most popular topics around the world.
There is a lot of noise being made about the IoX and big data. From retail to medicine to defense, homeland security, travel and logistics. And that only scratches the surface. The IoX is of interest to a lot of vendors because of its potential to sell into it. It is not just one vertical market, it is many, and big data represents a tremendous opportunity for any number of industries.
With the sheer volume of data collection devices, and once the IoX really exists, the amount of data in the virtual universe will be astronomical – 40 zettabytes, conservatively, by the year 2020. And the number of sensors acquiring this data is just as astronomical. No one wants to venture a guess as to the number of sensors, but the numbers being tossed around for IoX devices is between 50 and 200 billion. And most devices are brimming with sensors. Smartphones alone integrate a half-dozen, or more such sensors – accelerometers, compasses, GPS’, light and sound sensors, altimeters and more. If one wants to find a prototypical IoX device, this is it.
Smartphones are envisioned as an intelligent listening station, that can monitor our health, where we are and how fast we are travelling, our touch, the velocity of our car, the magnitude of earthquakes and countless other things that weren’t even on the radar screen a few years ago. And smartphones are only one of a myriad of intelligent IoX devices.
Extrapolating that, if there are only five sensors per intelligent device, and if the 200 billion is anywhere near reality, the number of sensors will be in the trillions, eventually. And with all those sensors collecting all that data, one can understand why there needs to be a revolution in analytics.
Big Data versus Traditional Analystics
What makes big data a bit different than traditional analytics is how the data are looked at, and what the expected results will be. That is actually a credible case. With the amount of data being generated, traditional analytics don’t have the right tools, nor can they process the data efficiently, even with next-generation supercomputers like the Titan and Tianhe-2. The massive amounts of data that needs to be analyzed will choke present analytical methodologies, mainly because the analysis needs to be real-time and transparent.
Under the big data umbrella, virtually every app needs to be an analytical app. Anyone who is doing any kind of data analysis has to figure out a way to manage how best to filter the huge amounts of data coming from the IoX, social media and wearable devices, and then deliver exactly the right information to the right person, at the right time.
For statisticians, Big Data challenges some basic paradigms. One example is the “large p, small n” problem (in this case we define “p” as the number of variables, not a value). Traditional statistical analysis generally approaches data analysis by using a small number of variables on a large number of data. I.e., the number of variables, p is small and the number of data point’s n, is large. A typical example of this might be in sales where there are, say a number of different options for a refrigerator, color, ice maker, door accoutrements, drawers, size, doors, and such. While still a decent number of variables, when compared to the data of the users, i.e. polling them for what they want, the number of variables is still small, when compared to the sample size of the consumer.
Big data looks at it from a different directions. Here an example might be in medicine, and for this example, cancer.
To apply this to a big data application, this situation is reversed. Take a cancer research study, for example. Using genomics, the researcher might collect data on 100 patients with cancer to determine which genes confer a risk for that cancer. The challenge is that there are 20,000 genes in the human genome and even more gene variants. Genome-wide association studies typically look at a half million single nucleotide polymorphisms (SNPs), or locations on the genome where variation can occur. The number of variables (p = 500,000) is much greater than the sample size (n = 100).
This big data approach is the paradigm shift. In traditional analytics, when p is larger than n, the number of parameters is huge relative to the information about them in the data. When using this approach, what happens is that there will be a plethora of irrelevant parameters that will show up as statistically significant. In classical statistical analysis, if the data contain something that has a one-in-a-million chance of occurring, there is a high confidence level that it is statically relevant. But if you analyze the data from a half million places, (big data) that one-in-a-million discovery will show up more often. The trick is to determine its relevancy vs. chance randomness.
This is what statisticians call the “look everywhere” effect and is one of the issues that plagues big data, because data-driven analysis yields so much more, and wider results than the traditional hypothesis-driven approach.
There are a number of solutions that have been developed to tame this effect. In reality, most data sets, no matter how massive, only have a few strong relationships. The rest is just noise. So by filtering out these significant parameters, the rest can be considered irrelevant. If the one-in-a-million data points are outside of the significant filters, then they are simply chance, and can be discarded.
How to do it is fairly simple and a standard mathematical approach to a variety of analytics – setting some parameters to zero. This works well, but requires a lot of iterations of the data. By varying which parameters are set zero, and running redundant analysis, eventually, the “thimbleful” of meaningful data will be uncovered.
The problem with this is that it is computationally intensive and would take a tremendous amount of time to compile with classical statistical hardware/software. But fortunately technology has come to the rescue. Today, because of technological advancements in both hardware and software, the approach is feasible. These methodologies have been around for a while, but only lately has the apps and hardware advanced to where they can be applied.
One of these advancements is called L1-minimization, or the LASSO, invented by Robert Tibshirani. One of the places it works well is in the field of image processing, where it enables the extraction of an image in sharp focus from a lot of blurry or noisy data. There are others, such as the false discovery rate (FDR) proposed by Yoav Benjamini and Yosi Hochberg, which makes some assumptions that a certain percentage of the data will be false. Subsequent analysis can be done on the data to determine the validity of the assumed false data to determine if the random assignment of it being false is valid.
The Third Dimension
Most statistical analysis, up to now, has been in two dimensions – n and p. Big data adds a third one – time (t). Big data analysis within the IoX, will be in real time and that adds orders of complexity. Data will have to be analyzed on the fly and decisions made on the fly. And, these data will be of a whole new type – images, sounds, signals, time-relative measurements, and infinite-space measurements. Such data is not only infinite, but complex. They may require analysis in a geometrical or topological plane, or a three-dimensional paradox.
One of the more interesting applications of this new dimension is web analytics. The pressure for web companies to deliver meaningful results to clients so they can “sell” their services is a relentless driver. Such companies benefit greatly by accurately predicting user reactions to produce specific user behaviors (i.e., clicking on a client sponsored advertisement).
This is a perfect big data analysis case. The number of n will be huge (a million clicks, for example). The p may will likely be large as well (thousands, or more, variables – which ad, where, how often, etc.). Now, since n is much larger than p, in theory, classical analysis can be used – except for the time factor. In many cases, the algorithms may have only milliseconds to respond to the click, with another click right behind the first one, and so on. Therefore, these algorithms have to constantly change to the input variables from the user (rotating ads for example).
An elegant solution to this challenge, across the web, is to use massive parallel processing across banks of computers, the cloud as we are beginning to call it. The interesting condition here is that this approach was a combination of the holy grail of computing – speed, with the holy grail of statistics – analysis. In the end, such a solution is actually works fairly well. Rather than deliver the correct answer every time, but takes too long, this approach delivers the right answer most of the time, quickly.
Privacy: the Sore Thumb
Readers privy to this site are well aware of the looming security issues of the IoX. The depth and breadth of recent breaches only too well reminds us of how vulnerable our data is. There is a wide girth of approaches to trying to protect big data, and traditional means of data security don’t always work efficiently. Thusly, various approaches are being developed.
Oil pipeline monitoring is one of the more widely deployed applications. Along the pipeline, every so many feet (it varies with the pipeline), there is a flow sensor that senses a number of parameters about the oil flow, i.e. pressure, density, flow rate, etc. The sheer volume of data from all of the sensors is staggering and another metric, security, also becomes part of the data. Because of the magnitude of this “big data,” securing the data itself becomes tricky because security means overhead and with such voluminous data to begin with that can bog down traditional M2M data collection methodologies.
Protecting the voluminous amount of big data, and in real time will take some novel solutions. In the traditional sense, anonymizing n and p doesn’t scale well as these variable increase. So the solutions discussed above become more applicable. And, network-like data pose a special challenge for privacy because so much of the information has to do with relationships between individuals, which changes on the fly and is dynamic in content. Overall, there are many challenges with security, but not many solutions at present.
However, there are some bright spots on the horizon. One developing technology is “differential privacy,” a methodology to commoditize security to where the user can purchase just as much security for their data as they need. But by and large, trying to secure big data still is in its infancy.
Summary
There is little doubt that big data will be the backbone of information in the IoX. Big data is new to a lot of sectors. Not the data, nor the collection, but the analysis. Also, new types of data are coming on the scene and new methodologies will be required to analyze it.
One of the largest challenges will be to be able to mine meaningful statistics, in real time and from multiple vectors, simultaneously. To do this will require a merging of scientific, analysis, computational and mathematical practices. New approaches will be required, as well as different perspectives on what is being analyzed.
Statistical analysis is a powerful tool that can, with some degree of certainty, glimpse into the future. With big data and the IoX, and next-generation statistics, we will be better able to understand and direct the effects of logistics, medicine, the weather, infrastructures, economics, environments, finances…the list goes on and on. Statistics and analytics will have the power to save and improve lives, increase reliability and lower costs, and improve and unlimited set of things and processes, which will be critical in the world of the IoX.