What is active data warehousing

Hadoop and the future of the data warehouse

How important is Hadoop in the latest generation of data warehouses?

Bigger, faster, cheaper: that is the promise of distributed data processing infrastructures such as Hadoop or MongoDB. Applications such as

  • Sentiment analysis,
  • Recommendation engines or
  • Realtime personalization in e-commerce.

How can Hadoop save costs and create value in the traditional business intelligence (BI) and data warehouse (DWH) environment?

Myths about Hadoop
First, it's important to know what Hadoop really is and what isn't. When you talk about Hadoop, you basically mean a whole portfolio of solutions. The core consists of the distributed file system HDFS and the processing framework YARN (formerly MapReduce). Based on this there are a bewildering variety of tools such as

  • Hive: a module for SQL-like access to structured data
  • Pig: a high-level language for creating distributed computing jobs
  • HBase: a column store in the style of Google BigTable
  • Mahout: a toolkit for data mining

And much more. Technology consultants can help you shed light on this jungle and find the right modules for you.

Hadoop is, so to speak, the Linux of data processing. It is not a tool, but provides a whole platform around which an ecosystem of tools has evolved. Distributors such as Hortonworks, Cloudera or MapR open up this platform for use in companies.

The highlight of Hadoop
What is the key to Hadoop's success? The answer is simple: data storage in a cluster of commodity hardware scales with the purpose of the application. So soon

  • particularly large amounts of data or
  • unstructured data

are to be processed, the solution shows its advantages. This is especially true if a later increase in data volume or complexity is essential for the business model.

Case 1: Hadoop as a staging area in the ETL process
In the staging of an ETL process, large amounts of data must be saved in the meantime and processed quickly. At the same time, different requirements are placed on the staging area than on an operational database. A classic relational database is often used for staging in current data warehouses. This is expensive, and in the worst case, additional costs for complex extraction steps or bottlenecks in processing capacity inhibit the productivity and creativity of the analysts. By using Hadoop, cheaper storage space can be gained. Large amounts of data can be processed and complex questions are accessible.

In particular, you can enrich your own data through external influencing factors such as press, weather or trends.

Case 2: Hadoop as an ELT worker
The use case of Hadoop as an ELT worker is similar to its use as a cost-effective staging area. For this purpose, however, the lever is also used due to the improved processing capacity. This means that the calculations are carried out as an ELT process, quasi in-database, on the Hadoop node instead of in the ETL process. This means that computationally intensive analyzes can be carried out that were previously not manageable, for example the question of complicated correlations and links in the data.

Case 3: Hadoop preprocessing unstructured data
Hadoop makes the most of its strengths when it comes to analyzing unstructured data. The reason for this is the distributed infrastructure. This enables parallel processing on multiple nodes. Complexity can be managed through a cost-effective hardware scale-out, i.e. more servers instead of an expensive hardware scale-up, larger servers. There is also no need to scale up the software, for example higher license costs for volume licenses.

An analysis of unstructured data is usually necessary, especially when it comes to enriching business data. This can be, for example, the preparation of the company's own content such as documentation or e-mail communication. Further examples would be portal logs, press feeds or machine data. The Hadoop cluster enables this data to be preprocessed and made available for use in the data warehouse.

Case 4: Hadoop as a database for fine-grained data
Use cases with a particularly high volume of data have so far been blocked with classic database technology because the data rates simply cannot be saved quickly enough, let alone analyzed. Such cases are, for example, the searchable storage of many documents and the filing of server or machine logs.

In these cases, Apache HBase shines. It is a schema-free key-value or wide-column store from the Hadoop environment that follows the Google BigTable paradigm. The great advantage of this solution is that enormous amounts of data can be saved in a redundant and fail-safe manner in a short time. The type of storage also enables the high-performance analysis of the data afterwards. If real-time analyzes are required, the data can be processed live at the same time by a stream processor such as Apache Storm or S4.

Case 5: Hadoop as a long-term active archive for raw data
Classic database solutions have one problem: they are expensive when it comes to large amounts of data. Many companies therefore archive their old data offline either on tape or on disk. In the worst case, the data will even be deleted.

Those who act in this way are wasting great potential. One way to realize this potential is to use Hadoop as an active archive. By using commodity hardware, the costs remain manageable and the data can be accessed at any time. The purposes of application have enormous business value:

  • Batch run analyzes on the entire database since the beginning of data recording
  • ETL processing of subsets of the data in data marts
  • Interactive exploration of the raw data

The exploratory analysis of raw data is still a niche application that requires in-depth technical know-how. All other analyzes do not require any additional qualification from the specialist user. The data is then available in the usual form, only much more detailed and with a long-term reference to the past. In the final expansion stage of such a data store, there is the vision of a universal enterprise data hub, a kind of bus on which all of the company's information converges, is automatically processed and stored for the future. All of this is already possible with today's technology.

Hadoop does not eliminate the need for data warehouses or relational databases. Its role is currently more in the technical and professional preparation of the data so that it is then available in the data warehouse. It therefore supplements the DWH and enhances it for the specialist user.

Contact: Dipl.-Phys. Johannes Knauf, Consultant BI, Ancud IT-Beratung GmbH, [email protected]