When is big data really big data?

When does big data make sense?

Big data technologies are not the only solution when a lot of data is involved. There are several criteria that determine the type of data processing.

by Peter Welker

From a technical point of view, the trend term “Big Data” is about data that requires special techniques for data processing and storage due to their sheer quantity (volume), the required data throughput and the maximum delay (velocity) as well as their complexity and variability (variety). 1) Due to the architecture, the use of traditional technology, such as relational databases, reaches its limits at some point due to the architecture. The same applies to programming paradigms for processing the data that are not consistently designed for parallelization.

At the beginning of the Big Data era around 2010 you could feel it at many IT conferences and in numerous publications: The new solutions for Hadoop and NoSQL were primarily intended to revolutionize traditional analytical applications (e.g. data warehouse) and replace.
Investments in classic analytical applications have by no means suffered as a result. (2) Relational databases and classic data integration tools are not as limited as the advocates of new technology like to proclaim. Loading, transforming and optimized preparation of several hundred million data records per day is no problem today, even with inexpensive hardware. Nevertheless: Even if the largest relational DWH databases are in the two-digit petabyte range, (3) in most cases the two- to three-digit terabyte range is the limit. Often because of license and support costs for required commercial software, not because the limit of technical performance has been reached.
Rather, it is new requirements that are implemented using new techniques. To clarify the decision criteria for or against big data technology, we consider three use cases.

Process data warehouse

Data warehouses by no means only have to prepare financial data. A transmission network operator from the energy sector loads 60 million measurement and application values ​​every day into a relational DWH and saves them there for ten years, for example to analyze defective components or to check the reliability of consumption forecasts. For this purpose, the data must be available in an optimized manner for time series and other analyzes no later than 20 minutes after their generation (latency). The most important evaluations based on this data then each need less than five seconds and can therefore be carried out interactively.

The first criteria speak in favor of big data technology. Surprisingly, it is by no means the amount of data, because despite the large number of data records, the volume is still in the lower terabyte range. However, wide-column NoSQL databases are better suited for time series analysis than their relational counterparts. Nevertheless, due to the traditional know-how of its employees, the stability and secure availability of relational tools, the operator decided on the classic solution with high investment protection and comparatively low costs. And it works. The application also scales by a factor of ten if required and with more powerful hardware.

Measurement data landscape

It would be a mistake to use purely technical criteria to make a technology decision. In a second, very similar case, an automotive supplier fills dozens of spatially separated but similar relational databases with measurement data for the purpose of analyzing the production processes. The total volume is now well into the three-digit terabyte range. All information is prepared within an hour - also primarily for time series analysis.

In the course of merging these databases and increasingly operational monitoring, very real-time analyzes must now be made possible. It is therefore necessary to reduce the latency to a maximum of five minutes. Now new questions arise: How do you replicate data from different locations as quickly as possible into a central system when hundreds of transformation processes are necessary and just as many users are analyzing the data at the same time?

When should I use big data technologies?

Important indicators for the use of big data technologies can be derived from the three “Vs” on the one hand: Volume, Velocity and Variety. If you answer one or more of the following questions with yes for a project, at least a more detailed technology analysis is required:

Processing latency
How long after the creation of data do patterns have to be recognized and activities derived from them? Do I have requirements in the range of minutes or seconds - or even less?
data volume
How big is the amount of data that has to be kept in total? Am I getting far in the terabyte range or beyond?
Scalability
Does the processing have to be "elastic"? So: Are strong fluctuations in the load expected and should the chosen solution also be scalable upwards by a factor of 10, 100 or 1,000? Can the use of cloud services help me?
flexibility
How diverse is my available data material? Do I already know what I want to do with it later?

On the other hand, of course, non-technical criteria also play an important role: The know-how of the employees or the willingness of a department or company to open up new approaches for real issues, for example. Good expectation management and a sufficient budget for implementation efforts can also be important criteria: When it comes to initial questions, revolutionary answers are not always to be expected and big data applications are not simpler or shorter (and therefore cheaper) than conventional projects.

Relational databases and classic data integration alone prove to be a bottleneck at some point due to the necessary latency. Approaches from the Internet of Things - a domain for big data applications - are better suited: At each location, new (sensor) data is prepared within seconds and stored for a while in the main memory of local computers - similar to a gateway - for initial analyzes. The data streams simultaneously flow into an event hub in the cloud. All other transformation processes use this, which store their results in a large, also cloud-based NoSQL database cluster.
Numerous big data techniques are used here: stream analytics and transformation solutions such as Spark, event hubs such as Kafka and NoSQL databases such as Cassandra. They were originally launched by Facebook, LinkedIn, and the University of California's APMLab:

  • In contrast to traditional tools, they are consistently designed for scalability. Most solutions run without significant overhead on clusters with hundreds or thousands of computers.
  • Many tools are being further developed as open source programs by large developer communities.
  • Most of the products are also backed by companies that offer enterprise-level support and commercial variants with an extended range of functions.
  • They were developed as open source software and passed on to the Apache Software Foundation. There are now purely commercial alternatives for this, for example in the Microsoft Azure Cloud (event hubs, stream analytics).

Movement data

In addition to data volume and throughput or latency, another criterion for the use of new technologies is the variety of data. While heavily structured data is used in relational databases, this does not apply to big data applications. Texts, videos, images, geodata or XML and JSON files from web click streams or application logs are also relevant.
In our third case, a mobile network operator uses geodata and GSM data from the radio network as well as numerous other sources for analytical applications. Since almost everyone now has a connected mobile phone with them, the network operator can locate the whereabouts of a subscriber quite precisely, depending on the network expansion and the type of radio traffic. In addition, end devices and in some cases also the participants can be identified. The resulting data volumes are very large and require extensive interpretation. But then detailed movement profiles can be generated.

Of course, for reasons of data protection, the data of all participants may only be processed anonymously, so that no conclusions can be drawn about persons. Nevertheless, the possibilities remain diverse. With these methods one can very easily make reliable statements about the current road traffic flow, visualize commuter flows in order to recognize typical catchment areas, or differentiate transit from regional traffic.

The resulting data volumes exceed the petabyte limit in a short time and are quite diverse. Historical data is important, but some applications, such as traffic observation, also require real-time values. In addition, you want to ensure independence from the data structures when storing the data (variety). While the data structure generally has to be known in advance for storage in relational databases, this is not the case for data storage in file systems such as Hadoop. Here it is sufficient to only know about its structure when reading the data. This also enables the predictive storage of data for which there are no defined purposes yet, but which will most likely be of importance one day.

So all indicators point in the direction of the new technologies. And consequently, for example, Hadoop is used here for the long-term storage of structured and unstructured data and Spark for the transformation and analysis of this data. In addition to the cost-effective storage of mass data, this also ensures scalability for future technical requirements.

Swell:

(1) https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf, accessed on November 2nd, 2016
(2) https://tdwi.org/Articles/2015/03/31/Dimensional-Data-Warehouses-Get-Job-Done.aspx, accessed on November 2nd, 2016
(3) https://www.guinnessworldrecords.com/world-records/largest-data-warehouse, accessed on November 2nd, 2016

The text is available under the CC BY-SA 3.0 DE license.
License terms: https://creativecommons.org/licenses/by-sa/3.0/de/


Information on Peter Welker

Department of Sales, Big Data, Cloud Computing, Digital Change, Digital Finance, Internet of Things, Sustainability & IT