Pre-requisites for Hadoop to make inroads into the Enterprise Data Warehouse space is to have the following three items in place:
- Subsecond response times for SQL queries (often refered to as interactive or real time queries). Performance similar to existing MPP RDBMS such as Teradata.
- Support for a rich SQL feature set
- Support for Update and Delete DML operations.
As a deep dive to the SQL engine supported for Hadoop, here is what we analyzed and recommendation for the best SQL engine to go for.
Initially developed by Facebook, Hive is the original SQL framework on Hadoop. The motivation to develop Hive was to provide an abstraction layer on top of Map Reduce (M/R) to make it easier for analysts and data scientists to query data on the Hadoop File System. Rather than write hundreds of lines of Java code to get answers to relatively simple questions the objective was to offer SQL, the natural choice of the data analyst. While this approach works well in a batch oriented environment it does not perform well for interactive workloads in near real time. The problem with the original M/R framework was that it works in stages and at each stage the data is set down to disk and then again read from disk in the next phase. In addition the various stages can not be parallelized. This is highly inefficent and the rationale for the Apache Tez project. Similar to M/R, Tez is a Hive execution engine developed by Hortonworks (also committers from Facebook, Microsoft, and Yahoo).
Hive on Apache Tez
Tez is part of the Stinger initiative led by Hortonworks to make Hive enterprise ready and suitable for realtime SQL queries. The two main objectives of the initiative were to increase performance and offer a rich set of SQL features such as analytic functions, query optimization, and standard data types such as timestamp etc. Tez is the underlying engine that creates more efficient execution plans in comparison to Map Reduce. The Tez design is based on research done by Microsoft on parallel and distributed computing. The two main objectives were delivered as part of the recent Hive 0.13 release. The roadmap for release 0.14 includes DML functionality such as Updates and Inserts for lookup tables.
Hive on Spark
Recently, Cloudera together with MapR, Intel, and Databricks spearheaded a new initiative to add a third execution engine to the mix. They propose to add Spark as a third Hive execution engine. Developers then will be able to choose between Map Reduce, Tez, and Spark as their execution engine for Hive. Based on the design document the three engines will be fully interchangeable and compatible. Cloudera see Spark as the next generation distributed processing engine, which has various advantages over the Map Reduce paradigm, e.g. intermediate resultsets can be cached in memory. Going forward, Spark will underpin many of the components in the Cloudera platform. The rationale for Hive on Spark then is to make Spark available to the vast amount of Hive users and establish Hive on the Spark framework. It will also allow users to run faster Hive queries without having to install Tez. Contrary to Hortonworks, Cloudera don’t see Hive on Spark (or Hive on Tez) to be suitable as a realtime SQL query engine. Their flagship product for interactive SQL queries is Impala, while Databricks see Spark SQL as the tool of choice for realtime queries.
Impala is a massively parallel SQL query engine. It is based on Google Dremel and Google Big Query.
. Alternatively, you can add the Cloudera repository and download it from there. Impala has various per-requisites in terms of the libraries and respective versions it supports, e.g. it relies (as of this post’s writing) on Hive 0.12 and Hadoop 2.3. So if you want to install it on the latest Hortonworks distro, you are out of luck.
Facebook was and is a heavy user of Hive. However, for some of their workloads they required low latency response times in an interactive fashion. This is behind the rationale of Presto.
Spark SQL and Shark
Similar to Impala, Apache Drill is another MPP SQL query engine inspired by the Google Dremel paper. Apache Drill is mainly supported by MapR. At the moment it is in alpha release. Together with Spark SQL It is at the moment of this writing the least mature SQL solution on Hadoop. As outlined by MapR Apache Drill will be available Q2 2014.
- This benchmark by Cloudera compares Impala to Shark (disk and memory), Hive Tez (0.13), and Presto. Unsurprisingly, Cloudera Impala scores best here :-).
- This benchmark by Cloudera compares Impala to Hive (0.12 and not 0.13) and to an unnamed MPP RDBMS. Surprise surprise, Cloudera Impala scores best here :-).
- This benchmark by InfiniDB compares InfinDB to Presto, Impala, Hive on M/R. For all workloads InfinDB is the performance winner.
- This benchmark by AMPLab at UC Berkley compares Redshift to Hive on M/R, Hive on Tez, Impala, and Shark. Performance winner for most workloads is Amazon Redshift
- This benchmark by Hortonworks compares performance between Hive M/R (0.10) and Hive Tez (0.13). Interestingly there is no comparison to other Hadoop SQL engines. You can draw or own conclusions why this is.
- This benchmark by Gruter compares Hive M/R to Impala and Apache Tajo.
Conclusion and Recommendation
- As of this writing the most mature product with the richest feature set is Apache Hive (running on Tez). Crucially it offers analytic functions, support for the widest set of file formats, and ACID support (full support in release 0.14)
- As of the current release, Impala lacks important SQL features. However, this is about to change in Impala 2.0.
- Once it has matured Hive on Spark should be a very good alternative to Hive on Tez.
- While Hortonworks claims that Hive can be used for interactive queries, Cloudera questions this. The various benchmarks are not conclusive. As always you should test yourself if Hive is suitable for realtime queries for your workload and use case.
- All of the different solutions follow a similar approach in that they all first create a logical query plan in a Directed Acyclic Graph (DAG). This is then translated into a physical execution plan and the various components and operators of the explain plan are then executed in a distributed fashion.
- There are various benchmarks out there, which suggest that Impala is the fastest for various workloads. However, I wouldn’t trust any of these too much and would suggest for you to perform your own benchmarks for your specific workload.
- Spark SQL looks very promising for use cases where you want to use SQL to run machine learning algorithms (similar to in database analytics, e.g. in Oracle). As an alternative you could look at using HiveMall. It also looks promising for interactive SQL.
- Performance benchmarks suggest that none of the Hadoop SQL execution frameworks currently match the performance of an MPP RDBMS such as InfiniDB, Amazon Redshift, or Teradata. One Cloudera benchmark suggests otherwise. However, this benchmark is criticised for not implementing the full set of the TPC-DS benchmark and various other items and as a result is somewhat questionable. This does not come as a surprise really as decades of experience have gone into these relational engines.
source : Sonrio