Sometimes we come across customers who talk about their Big Data and what they do or expect to get out of it, their “Big Data this,” or their “big data that” but, except for a handful of cases, their data is not that big.
Granted, they do handle hundreds of thousands, millions of records from transactions with tens of thousands of customers, out of a portfolio of several hundred thousand articles, collected over several years (we’ll cover data expiration in another post), and there is no doubt that by using advanced data science techniques there is great value to be mined out these data, but it is not Big Data.
Big Data is characterized by its Vs, which in most texts are 3 (some mention up to 8!): high volume data, generated at high velocity, and of great variety (unstructured data, free text from social networks, video streams, numeric data from IoT sensors…).
In this planet we generate roughly 2,5 exabytes of data each day (that’s about 25 followed by 17 zeros), but we must consider that not all is valuable data, since it includes every video uploaded to youtube, every twit and re-twit, every picture shot by the thousands of cell phones at an event, every data captured by an IoT device, by a sensor in a car, and every coordinate sent by the GPS in an Uber to its central, or by our cell phones to Google (without us knowing, by the way).
A SME with 60 million € yearly sales, with an average line value of 20€, that stores what was sold, to whom, where, with operation date and time, generates less than 3GB of data per year, the equivalent of roughly one hour of HD video, or 8 minutes of 4K video. So, no real volume here, even if we double up what we store with each transaction.
Those few GB of data are generated over one year, and if you compare it with the volume of data that must be processed, not by a fully autonomous car, but just by the lane keeping assist of any modern car, we see that we cannot talk of high velocity either.
And data handled by most companies may be untidy, with missing values and gross errors, but it still is structured data after all, so no variety.
Just a little business activity generates a volume of data big enough to require massive data processing with data science techniques to extract all its value (Excel drowns with over a million lines of data), but unless we deal with volume, velocity, and variety it is not appropriate to speak of Big Data.
Mind you, there’s nothing wrong with it, it just that technicians will appreciate that you speak properly; and if you come across a ruthless consultant, you will not be asking to be sold something you don’t need with the argument that that is what you asked for.
Image by CharlesAPhillips63, CC BY-SA 4.0 , via Wikimedia Commons