MPP & Cloud - Nothing New Under the Sun
The 'shared nothing' or 'massively parallel processing' (MPP) computing architecture is hardly new. The likes of Teradata, Netezza and Pivotal Greenplum have been offering MPP technology going back as far as the 1980's (well, Teradata has, the others came along in the early 2000's).
Similarly, cloud computing is hardly a new idea. The likes of Amazon's AWS and Microsoft's Azure cloud platforms should need no introduction.Ok, what's the point? Well the point is...combining 'shared nothing' MPP database systems and cloud seems like an obvious next move, and on the face of it, looks simple enough. But, how 'doable' is MPP in the public cloud?
Established MPP Vendors & Cloud
As intimated earlier, the main data warehouse MPP vendors are Teradata, Netezza and Pivotal Greenplum.So, do the established MPP vendors offer cloud as a deployment option? Before we move on, lets be clear that we mean public cloud when we say cloud.
Based on educated guess-work, there is no Teradata public cloud offering for several reasons. The main barrier is likely to be technical. Teradata's proprietary inter-node 'bynet' interconnect is simply that - proprietary. It is also covered by many patents and adds significant value to the Teradata stack. It would be 'non-trivial' to replicate the bynet interconnect in a public cloud environment.
Technical barriers aside, Teradata rightly pride themselves on their industry-leading combination of performance, resilience and workload management. All bets are off in a public cloud environment where these Teradata strengths are concerned.
Like Teradata, the Netezza stack also contains proprietary hardware in the form of the FPGA. Let's not dwell on the migration to IBM blades in this blog post. Suffice to conclude that, like Teradata, the Netezza architecture does not easily transfer to a public cloud environment, if at all. Which is probably why Netezza isn't offered for public cloud deployment.
Unlike Teradata and Netezza, Pivotal's Greenplum DBMS software is available as precisely that - software only. As a consequence, there are are no barriers to deploying Pivotal's Greenplum 'shared nothing' MPP database software in a public cloud environment.
As sceptical as ever, here at VLDB we like to prove these things to ourselves. So, how about a 1.5 billion row scan on a 4 node development Greenplum MPP cluster in the cloud:
$ psql -c 'select count(*) from t1;'count------------1,498,073,726(1 row)
Yes, we could have just made that up, but why would we bother? It's a real customer table. You'll just have to trust us!
Does Cloud Change Everything?
In yet another excellent blog post published yesterday, Rob Klopp suggests that:
"Cloud changes everything and it will significantly change database systems architecture."
While we agree that 'cloud changes everything', as we demonstrated above, we don't agree that cloud computing will necessarily have a significant impact on database systems architecture. Well, not all database systems anyway.That said, when attempting to combine MPP databases and cloud computing, database systems architecture is likely to have to change if:
a) the database system is a poor fit for public cloud deployment
b) the cloud platform is a poor choice to support 'shared nothing' MPP database systems or
c) where both a) and b) are true - the worst of both worlds.
MPP Database Software on AWS & Azure
Assuming you have cloud-friendly MPP database software, surely you can roll-your-own MPP database system on a widely-adopted public cloud platform such as AWS or Azure? Well, in theory, 'yes', in reality, it's never that simple.
Team VLDB kicked a lot of tyres when it came to finding a suitable public cloud platform for MPP database system deployment. A taster of some of the MPP & public cloud issues that need to be overcome:
CPU - can enough CPUs be deployed across the cluster to harness the power of MPP?
RAM - can enough RAM per CPU be deployed (databases like a lot of RAM)?
Interconnect - can the SMP nodes be linked into an MPP cluster be via a private network? Can redundant private networks be configured for resilience/performance? Is there sufficient inter-node bandwidth to support high-speed data re-distribution within the MPP cluster?
Disk IO - is there sufficient read & write IO bandwidth? Is the read & write IO bandwidth sufficiently predictable?
In our opinion, interconnect bandwidth and disk IO bandwidth are the two main blockers to successful MPP deployment via the public cloud.
More general public cloud issues concern things like physical CPU sharing with other cloud users (the 'noisy neighbour' problem), non-persistent storage, shell access, root privileges, proprietary setup/configuration/monitoring APIs, etc, etc...It will come as little surprise to learn that team VLDB concluded that MPP was not a good fit for a 'general purpose' public cloud offering from the likes of Amazon (AWS) or Microsoft (Azure). Several other 'big name' cloud platforms failed to even support clustering SMP nodes into an MPP architecture. Imagine that!
So, how do we run billion+ row MPP queries in the cloud?
Watch this space ;-)