What are the real-time HDFS problems

Big data

Hadoop is a framework written in Java that is designed for the distributed management and calculation of large amounts of data in a cluster environment. The system is currently the recognized standard for big data analysis. Thanks to open interfaces, it can be easily expanded with additional tools like a construction kit, some of which perform special tasks, but some simply represent alternatives to the standard Hadoop tools. In the following - without claiming to be exhaustive - the most important tools for the essential functional areas of Hadoop are presented.

File systems: Hadoop Distributed File System (HDFS) and the alternatives

With the "Hadoop Distributed File System" (HDFS), the analysis framework already contains a file system that is specially designed for the distributed management of very large amounts of data. An HDFS cluster essentially consists of one or more nodes that manage the metadata (NameNodes) and the "DataNodes" on which the actual files are redundantly and distributed as data blocks of fixed length. Each DataNode usually corresponds to a server. In order to perform arithmetic operations, the client contacts the NameNode for the metadata, but otherwise exchanges data directly with the DataNodes. The biggest advantages of HDFS are:

+ Integrated high availability: HDFS clusters are inherently redundant and fail-safe. The user can therefore concentrate on the actual task of data analysis.

+ Optimal and high-performance processing of large amounts of data: HDFS can manage several hundred million files without any problems, without any performance problems.

+ easy transferability: data in an HDFS can be easily transferred from one Hadoop distribution to another.

+ low costs: HDFS does not require expensive storage area networks (SAN), but uses storage on standard hard drives in standard servers.

However, the hierarchical structure also has disadvantages. Since all users exchange information via the NameNode, the performance of the entire cluster can suffer if a large number of users access metadata in parallel. If only one NameNode is used, as was the case in older Hadoop distributions, it also forms a single point of failure. If the node fails or if it has to be shut down for maintenance work, the entire cluster cannot be reached. This is why the high availability feature (HA) for HDFS was introduced with version 2 of Hadoop. It offers the option of defining a second node for metadata management in active / passive mode as hot standby.

HDFS alternative 1: CassandraFS (CFS)

Apache Cassandra is a NoSQL database (see also section "Databases"), which scores above all with its scalability and speed. With "CassandraFS" (CFS), the company DataStax has designed a file system that enables Big Data analyzes on the basis of Cassandra. In contrast to the "master-slave" approach of HDFS, CFS is a non-hierarchical "peer-to-peer" system. For reliability and redundancy, CassandraFS uses the replication mechanisms implemented in Cassandra. Advantages compared to HDFS are:

+ Easier structure: Instead of having to define and set up cluster nodes with different tasks, CFS does not need any further configuration steps in addition to the Cassandra database.

+ higher availability: According to the provider DataStax, the reliability of the replication and redundancy mechanisms integrated in Cassandra is even higher than with HDFS, and data loss is practically impossible.

+ Support for multiple data centers: With CFS, database clusters can be operated across multiple data centers. If necessary, the administrator can use Keyspaces and Job Tracker to determine which data is at which location and where which analyzes are to be carried out.

HDFS alternative 2: OrangeFS and other parallel file systems

OrangeFS is a further development of PVFS (Parallel Virtual File System). It is open source and also belongs to the file systems that can be combined with Hadoop (HCFS, Hadoop Compatible File System). In a test at Clemson University, clients accessing a storage cluster with OrangeFS calculated 25 percent faster with classic MapReduce routines than when using an HDFS cluster. Parallel file systems, which also include Luster and CephFS, offer the following advantages:

+ Parallel access from the client to metadata and user data.

+ Less bandwidth and latency problems.

HDFS alternative 3: SwiftFS and other file systems for object storage

More and more Hadoop instances are not operated in a local server cluster, but in a public or private cloud. Object-based storage systems such as Amazon S3, Microsoft Azure Blob Storage or Swift are often used in OpenStack environments. SwiftFS and other similar file systems allow Hadoop operations to be performed on object storage. This has the following advantages:

+ Computing unit and storage unit can be separated. For example, you can leave the data in the cloud for later analyzes, but shut down the machines for calculation and only restart them when necessary, which saves costs.

+ Computing power and memory requirements can be scaled independently of one another.

+ Several computing clusters can access the same data.

+ Processes such as ETL (extract, transform, load) can access the data without even having to start Hadoop.

Databases for Hadoop: Cassandra, HBase, MongoDB & Co.

There are a number of so-called NoSQL data storage and management systems that can be used in Hadoop. These databases are intended to overcome the limitations of the classic relational SQL systems, which usually get into difficulties with very large amounts of data in the petabyte range, because they are not scalable at will. The column-oriented database HBase, an open source implementation of the Google development Bigtable, is used very often. HBase offers the following advantages:

+ distributed, scalable and fault-tolerant.

+ is based on the Hadoop stack.

+ Real-time processing of big data analyzes.

HBase is, however, complex and shows its advantages especially when real-time analyzes are to be carried out on a subset of the data. An alternative that is easier to implement is Cassandra, the database developed by DataStax, which was already mentioned in the section on file systems. For performance reasons, Cassandra refrains from immediately distributing data to all servers in a cluster when it is written. This can lead to different levels. The consistency is only ensured afterwards (eventual consistency). The advantages of Cassandra at a glance:

+ Easy to set up and wait.

+ highly available.

+ very good scalability.

The document-oriented database MongoDB can also be connected to Hadoop via a connector and used as a data source or storage location for query results. With a strong guarantee of consistency, a simple query language and secondary indices, MongoDB comes closest to a typical SQL database in terms of functionality, but is still highly scalable and available as a NoSQL DB. The main advantages of MongoDB:

+ Index support for high performance.

+ automatic partitioning (auto sharding) for high scalability.

+ Master-slave model facilitates integration in applications.

Other databases in the Hadoop cosmos are Accumulo, a further development of the bigtable approach on which HBase is based, the cloud-based NoSQL database Amazon DynamoDB, the in-memory database Redis or the graph-oriented Neo4j, to name just a few.

SQL queries on Hadoop clusters: Hive, Impala, Phoenix

Various tools make it easier for users to define, execute and manage SQL queries on Hadoop clusters. One of the best known is the Hive data warehouse system. It allows SQL queries on distributed data via HiveQL, can structure data and carry out analyzes with Tez, Spark or MapReduce. The main advantages of Hive are:

+ Easily expandable via User Defined Functions (UDF).

+ can handle almost any data format.

+ Hive queries can be expanded with statistics packages such as Apache Mahout for complex analyzes.

An alternative to Hive is the Impala developed by Cloudera. According to the manufacturer, Impala is faster than Hive and, thanks to massive parallel programming (MPP), can process larger amounts of data than the Hive Query Engine. However, it requires more memory and is particularly disadvantageous when extensive operations, such as joins, are to be carried out on the data. Originally developed by Salesforce.com, Phoenix offers another way to run SQL queries on Hadoop clusters. It forms an SQL layer on an HBase database. Unlike the batch-oriented Hive, Phoenix accesses the data via native APIs, which considerably speeds up the analyzes depending on the amount of data.

Analysis platforms: Pig, Scalding, Scoobi

With the Pig scripting platform, complex MapReduce transformations can be carried out on a Hadoop cluster. Pig uses its own abstract programming language (Pig Latin) for this. The platform translates the Pig-Latin scripts into MapReduce queries, which are then applied to the data via Hadoop's own job and cluster management framework YARN. Pig programs can be parallelized particularly easily, which speeds up and simplifies queries on very large databases. The main advantages of Pig:

+ easy programming of complex, parallel analysis tasks.

+ automatic code optimization.

+ expandable with your own functions.

Some alternatives to Pig are based on Scala ("Scalable Language"), a scalable, object-oriented and functional programming language, including Scalding developed by Twitter or Scoobi from the Australian research center NICTA (National ICT Australia).

Hadoop administration and workflow management: Ambari, Ooozie & Co.

Several tools make Hadoop management easier. Ambari, for example, offers a web interface for installing, managing and monitoring Hadoop clusters. Various third-party programs can be connected to Hadoop via RESTful APIs. Ambari runs as a server on a node of the cluster and from there installs agents on the hosts that it is supposed to manage. The main advantages of Ambari are:

+ enables automatic cluster installation.

+ central administration.

+ slight expansion of the Hadoop ecosystem.

The Hue developed by Cloudera also facilitates Hadoop management via a browser-based user interface. It enables access to the files stored in the HDFS and allows queries, for example via Hive, Pig or Impala.

For workflow management in Hadoop, Oozie or Azkaban can be used, among others. Both tools translate a number of MapReduce, Pig, Java or Script actions into an executable job. Oozie uses an XML file for this and defines the workflow in a Directed Acyclic Graph (DAG), while Azkaban uses property files that describe a topological order. The advantages of Oozie in detail:

+ supports variables and functions.

+ Workflows can contain decision branches.

+ Job workflow can be time-controlled or input-controlled.

The advantages of Azkaban in comparison:

+ Resources can be controlled and explicitly blocked.

+ Can be run standalone or as a server.

+ All states of a running workflow are kept in memory.

Other workflow managers for Hadoop are, for example, the Luigi developed by Spotify or the Airbnb development Airflow, which only recently received incubator status at Apache.

Data management and messaging: Flume, Sqoop, Kafka

A number of tools make it possible to collect, filter and transport data in Hadoop and make it available again after the analysis. Flume, for example, is a distributed service that can process large amounts of log data. The main advantages are:

+ simple and flexible architecture.

+ robust and fault-tolerant.

+ simple data model.

If data is to be transferred between HDFS and a relational database, Sqoop is the method of choice. Sqoop can both import external data from a structured database in HDFS or a NoSQL system such as HBase, as well as export information vice versa. Benefits include:

+ parallel data transfer and error tolerance during import and export.

+ largely automates the data transfer between Hadoop and databases or mainframes.

Information such as log files, transactions on websites or geodata can be collected and distributed with Kafka in Hadoop. The messaging service written in Scala subscribes to feeds (subscribe) and makes them available for interactive analysis (publish). Instead of having to search through large amounts of data, the user can interactively access exactly the information that he currently needs with Kafka. The main advantages of the system:

+ high data throughput.

+ supports online and offline processing.

+ reduced network load by grouping messages (message sets).