What is the future of HDInsight

Azure HDInsight 4.0 overview

  • 4 minutes to read

Azure HDInsight is one of the most popular services for Apache Hadoop and Apache Spark among enterprise customers. HDInsight 4.0 is a cloud distribution of Apache Hadoop components. This article provides information about the latest Azure HDInsight release and update.

What's new in HDInsight 4.0

Apache Hive 3.0 and Low-Latency Analytical Processing (LLAP)

Apache Hive Low-Latency Analytical Processing (LLAP) uses persistent query servers and in-memory caching. This process provides quick SQL query results for data in remote cloud storage. Hive LLAP uses a number of persistent daemons that run fragments of Hive queries. Query execution under LLAP is similar to Hive without LLAP, with worker tasks running in LLAP daemons rather than containers.

Hive LLAP offers the following advantages, among others:

  • Ability to perform extensive SQL analysis without sacrificing performance or flexibility. These include, for example, complex joins, subqueries, windowing functions, sorting, user-defined functions and complex aggregations.

  • Interactively query data in the same memory as where data is being prepared, without the need to move data from memory to another module for analytical processing.

  • By caching the query results, previously calculated query results can be reused. This saves time and resources in performing the necessary clustering tasks for the query.

Dynamically materialized views in Hive

Hive now supports dynamically materialized views or the pre-calculation of relevant summaries. These views speed up query processing in data warehouses. Materialized views can be stored natively in Hive and can seamlessly access LLAP acceleration.

Transactional hive tables

HDI 4.0 includes Apache Hive 3. Hive 3 requires ACID (Atomicity, Consistency, Isolation, Durability) compliance for transaction tables stored in the Hive warehouse. ACID-compliant tables and tabular data are retrieved and maintained by Hive. Data in CRUD tables (Create, Retrieve, Update, Delete) must be in the ORC (Optimized Row Column) file format. However, insert-only tables support all file formats.

Note

Support for ACID / transactions only works for managed tables and not for external tables. External Hive tables are designed so that external parties can read and write table data without Hive making changes to the underlying data. With ACID tables, Hive may change the underlying data through compression and transactions.

The advantages of ACID tables include:

  • ACID v2 has performance improvements in storage format and execution engine.

  • ACID is enabled by default to allow full support for data updates.

  • Improved ACID capabilities allow you to update and delete at the row level.

  • No additional effort.

  • No bucketing required.

  • Spark can read and write Hive ACID tables using the Hive warehouse connector.

Learn more about Apache Hive 3.

Apache Spark

Apache Spark retrieves updatable tables and ACID transactions using the Hive warehouse connector. Using the Hive warehouse connector, you can register Hive transaction tables as external tables in Spark for access to all transactional functions. In previous versions, only editing of table partitions was supported. Hive Warehouse Connector also supports streaming DataFrames. This process streams reads and writes to Hive transaction and streaming tables from Spark.

Spark executors can connect directly to Hive LLAP daemons to obtain and update data in a transactional manner so that Hive remains in control of the data.

Apache Spark on HDInsight 4.0 supports the following scenarios:

  • Perform training machine learning models on the same transaction table as used for reporting
  • Using ACID transactions to securely add columns from Spark ML to a Hive table
  • Run a Spark streaming job in the change feed from a Hive streaming table
  • Create ORC files directly from a structured Spark streaming job

You no longer have to worry about accidentally accessing Hive transaction tables directly from Spark, causing inconsistent results, duplicate data, or data corruption. In HDInsight 4.0, Spark and Hive tables are kept in separate metastores. Use the Hive data warehouse connector to explicitly register Hive transaction tables as external Spark tables.

Learn more about Apache Spark.

Apache Oozie

Apache Oozie 4.3.1 is included in HDI 4.0 with the following changes:

  • Oozie will no longer perform hive actions. The Hive command line interface has been removed and replaced with BeeLine.

  • You can eliminate unwanted dependencies on shared libraries by going into your job.properties record an exclusion pattern.

Learn more about Apache Oozie.

Upgrading to HDInsight 4.0

Thoroughly test your components before implementing the current version in a production environment. HDInsight 4.0 is available for the upgrade process. However, to prevent glitches, the default option is HDInsight 3.6.

There is no supported upgrade path from previous versions of HDInsight to HDInsight 4.0. Because the metastore and blob data formats have changed, version 4.0 is not compatible with previous versions. It is important to keep the new HDInsight 4.0 environment separate from the current production environment. If you deploy HDInsight 4.0 in your current environment, your metastore will be constantly updated.

restrictions

  • HDInsight 4.0 does not support MapReduce for Apache Hive. Use Apache Tez instead. Learn more about Apache Tez.
  • HDInsight 4.0 does not support Apache Storm.
  • HDInsight 4.0 does not support the "ML Services" cluster type.
  • Hive View is only available for HDInsight 4.0 clusters with a version number equal to or greater than 4.1. You can find this version number under "Ambari Admin -> Versions".
  • Shell interpreter in Apache Zeppelin is not supported in Spark and Interactive Query clusters.
  • You cannot use LLAP in a Spark LLAP cluster deactivate. You can only turn off LLAP.
  • Azure Data Lake Storage Gen2 cannot store Jupyter notebooks in a Spark cluster.
  • Apache Pig runs in Tez by default. However, you can change this setting in MapReduce.
  • The Spark SQL Ranger integration for row and column security is deprecated.
  • Spark 2.4 and Kafka 2.1 are available in HDInsight 4.0. Spark 2.3 and Kafka 1.1 are therefore no longer supported. In HDInsight 4.0, the use of Spark 2.4 and Kafka 2.1 or higher versions is recommended.

Next Steps