The last 12 months or so has seen a lot of ‘Big Data’ POCs being initiated in our world. The most frequent objective from both the customer side and the Hadoop vendors that we talk to seems to be ‘data warehouse application migration' or 'offloading’. So what do we mean by ‘offloading’, why is it such a common objective and can it be achieved using Hadoop?
Data warehouse application migration (offloading) means moving an existing workload elsewhere. ‘Re-platforming’ or ‘re-homing’ is another way to view it. Good old-fashioned application migration is ultimately what it boils down to.
Data warehouse application migration can consist of the ETL workload being moved elsewhere, or user queries/applications being satisfied outside of the enterprise data warehouse (EDW). Maybe even both.
Re-homing ETL workloads implies that the EDW is/was being used as a data transformation engine in an ‘ELT(L)” style ETL architecture. The re-homing of ETL workloads onto a Hadoop cluster seems to be the main use case we’ve encountered thus far in cases where Hadoop is used in conjunction with an existing EDW. No big surprises there.
Offloading business applications and/or ad hoc queries is of far more interest, and more likely to be the main Big Data POC objective, based on our recent experiences. So what would an alternative to the EDW - by which most folks currently mean Hadoop, and more specifically Hive - have to deliver in order for data warehouse application migration to succeed? Glad you asked…
There are several key EDW capabilities that are probably taken for granted:
• cost-based optimiser (CBO)
• workload management (WM)
In a nutshell, the EDW allows us to run SQL queries of any complexity. The SQL is usually, but not always, turned into an efficient plan by the cost-based optimiser (CBO). The competition for CPU and IO resources is managed by the workload management (WM) sub-system to ensure optimal system throughput. So far so good.
For successful data warehouse application migration to Hadoop, we are going to need good SQL coverage, a robust cost-based query optimiser (CBO) and a workload management (WM) capability to keep it all ticking along. Three core capabilities - that's all, we're not greedy!
Taking SQL first, for most folks doing data warehouse type queries this is delivered in Hadoop via Hive. Support for basic operations such as UPDATE, INSERT and DELETE is delivered in Hive from version 0.13 (Hive 13) onwards - ‘adding ACID to Hive’. Given how recently basic SQL operations such as these have been added to Hive, those that expect a seamless migration of SQL applications from their existing EDW are likely to be somewhat disappointed.
The Hive cost based optimiser (CBO) can be described as rudimentary at best at present. It is being addressed by the Optiq project. It’s a long journey from where we are now to a cost-based optimiser that can be relied on to almost always come up with an efficient plan. As Teradata’s CTO Stephen Brobst put it: “With Hadoop, you are the optimiser. Good luck with that”. We couldn’t agree more.
Of the 3 main EDW capabilities that we take for granted, workload management (WM) is usually the one that gets developed last. Without workload management the system throughput will frustrate and disappoint most folks. There’s nothing worse than a short job sitting waiting for hours behind a long running job that happened to be submitted first. Initiatives such as Yarn and Tez are designed to speed things up in general within the Hadoop cluster, but that’s not the same as managing the conflicting demands from different users/departments/applications.
So, what chance have we currently got of moving non-ETL applications off the production EDW platform and onto Hadoop, and specifically Hive? That’s the $64k question.
At one end of the spectrum, carefully selected applications could be migrated so long as some SQL development is performed to make the applications Hive-friendly. So far so good. Just don't expect existing SQL code to run 'as is'.
At the other end of the spectrum, could the entire EDW workload be moved to Hive? As it stands today, not a chance. No sir. That would be a ‘career limiting move’. Sorry to disappoint the CFO and all that.
To achieve data warehouse application migration most folks will need another platform that supports those features they probably take for granted at present: SQL, CBO and WM (see above).
SQL, CBO and WM are currently best delivered by a DBMS designed specifically for data warehousing, More specifically we'd recommend a massively parallel processing (MPP) DBMS to ensure linear scalability. No doubt a lot of folks exploring EDW application migration as part of a Big Data POC are already MPP users. No surprises there.
Moving to a different MPP platform to enable relatively pain-free data warehouse application migration will either entail:
moving to a different platform from the existing EDW supplier
changing EDW suppliers
Moving from one EDW supplier to another is not always as painful as most folks assume. For an MPP-to-MPP move, thankfully it's ‘just SQL’.
Finally, what motivation is there to want to carry out data warehouse application migration ? As ever, cost reduction of course!