MPP & Redshift Musings

Teradata AWS Access

Teradata AWS Access

What On Earth is MPP?

In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously). Source:

Wikipedia

.

Teradata & MPP Beginnings

Once upon a time, in a land far, far away…well, OK, California in the late 1970’s/early 1980’s to be precise…the MPP database world started to stir in earnest.Following on from research at Caltech and discussions with Citibank’s technology group, Teradata was incorporated in a garage in Brentwood, CA in 1979. Teradata’s eponymous flagship product was, and still is, a massively parallel processing (MPP) database system.

Teradata had to write their own parallel DBMS & operating system. They also used weedy 32bit x86 chips to compete with IBM's 'Big Iron' mainframes to perform database processing. Quite an achievement, to say the least.

The first Teradata beta system was shipped to Wells Fargo in 1983, with an IPO following in 1987.

It is interesting to note, for those of us in the UK, that there were Teradata systems up and running over here as early as 1986/87, partly down to the efforts of folks such as our very own ‘Jim the phone’.

Early Teradata adopters in the UK included BT (who later moved off Teradata when NCR/Teradata was acquired by AT&T), Royal Insurance, Great Universal Stores (GUS) and Littlewoods. GUS and Littlewoods combined to become ShopDirect, who still run Teradata 30 years later.

Yours truly ran his first Teradata query at Royal Insurance in Liverpool from ITEQ on an IBM 3090 back in ’89. It killed the split screen in MVS, but ho-hum, better than writing Cobol against IMS DB/DC to answer basic questions. We were taught SQL by none other than Brian Marshall who went on to write the first Teradata SQL & performance books. I still have the original dog-eared Teradata reference cards, or ‘cheat sheets’ as we called them:

Teradata ITEQ/BTEQ & Utilities Reference Cards.

Teradata ITEQ/BTEQ & Utilities Reference Cards.

From the 1980’s through to the early 2000’s Teradata had a pretty clear run at the high-end analytic DBMS market. There was no serious competition, no matter how hard the others tried. All those big name banks, telecoms companies and retailers couldn’t be wrong, surely?

MPP Upstarts - Netezza, Datallegro & Greenplum

Teradata’s first real competition in the commercial MPP database space came in the form of Netezza in the early 2000’s.

Like Teradata, Netezza consisted of dedicated hardware & software all designed to work in harmony as an engineered MPP database ‘appliance’. Unlike Teradata, Netezza was able to take advantage of open source DBMS software in the form of PostgreSQL, and open source OS software in the form of Linux.We discovered Netezza by accident in 2002/03 after landing on a PDF on their web site following a Google search. “Netezza is Teradata V1” was our initial response. Apart from the FPGAs, we were pretty close.

A few phone calls, a trip to Boston and a training session later, and we’re up and running as Netezza partners.Following a successful Netezza project at the fixed line telecoms division of John Caudwell’s Phones4U empire, yours truly was a guest speaker at the inaugural Netezza user conference in 2005.

Following an IPO in 2007, Netezza was bought by IBM in 2010 where it remains to this day, somewhere in IBM’s kitbag. Poor old Netezza.

Also in the early 2000’s, a bunch of mainly Brits at Datallegro were trying to build an MPP appliance out of Ingres. They were bought by Microsoft in 2008. Why Microsoft needed to buy a startup running Ingres on Linux is anyone’s guess.

The last of the new MPP players we had dealings with all those years ago are none other than Greenplum.Through the Netezza partner ecosystem we actually knew the Greenplum team when they were still called ‘Metapa’, probably around 2002/03. Yours truly was at a dinner with Luke & Scott in Seattle as they wooed none other than Teradata luminary Charles ‘Chuck’ McDevitt (RIP) to head up the Greenplum architecture team. With Chuck on board they were virtually guaranteed to succeed.

Greenplum was always of particular interest to us because, unlike Teradata & Netezza, it was and still is a software only offering. You get to choose your own platform, which can be your favourite servers/storage or even that new fangled ‘cloud’ thing.

Not only have we got Greenplum to work happily on every single platform we ever tried, Greenplum is also the only open source MPP database.

Other Parallel PostgreSQL Players

In addition to early parallel PostgreSQL players such as Netezza and Greenplum (Datallegro used Ingres, not PostgreSQL), a whole raft of ‘me too’ players cropped up in 2005:

  • Vertica - acquired by HP in 2011

  • Aster - acquired by Teradata in 2011

  • Paraccel - acquired by Actian in 2013

It is interesting to note that, unlike Netezza, not one of the new parallel PostgreSQL players put even the smallest dent in Teradata’s core MPP data warehouse market. At least Netezza shook Teradata from their slumber which, in turn, gave the Teradata appliance to the world.That said, perhaps Paraccel will have the biggest impact on the MPP database market which, for so long, has been dominated by Teradata, the main player in the space for 30 years. How so? Read on…

Redshift

Amazon launched the Redshift ‘Data Warehouse Solution’ on AWS in early 2013.Yours truly even wrote some initial thoughts at the time, as you do.

As has been well documented, Redshift is the AWS implementation of Paraccel, which was one of the crop of ‘me too’ parallel PostgreSQL players that appeared in 2005.

Redshift adoption on AWS has been rapid, and reported to be the fastest growing service in AWS.

So, despite Teradata having been around for 30 years, the awareness of MPP and adoption of MPP as a scalable database architecture has been ignited by Redshift, which started out as Paraccel, which is built out of PostgreSQL.

Redshift Observations

Our previous post on Redshift in 2013 made mention of the single column only distribution key, and the dependence on the leader node for final aggregation processing.

There is a workaround for the single column only distribution key restriction, for sure. MPP folks will no doubt be mystified as to why such a restriction exists. However, it neatly removes the possibility of a 14 column distribution key (primary index) that we encountered at a Teradata gig a few years ago (no names). Cloud, meet silver lining.

The dependence on a leader node for final aggregation processing is much more of an issue. Aggregation queries that return a high number of groups will simply choke on the final aggregation step. Just ask Exadata users. There’s also not much you can do to remediate this issue if you can’t over-provision CPU/RAM at the leader node, which is the case with Redshift.More recent Redshift observations from our own experiences, and from trusted contacts (you know who you are) include the following:

  • lack of node OS access – security auditing, hardening & OS customisation not possible.

  • non-persistent storage – if you stop or lose the EC2 cluster you lose the database data & will need to re-load from S3 or a database snapshot.

  • poor concurrency – default is 5 concurrent operations. Although this can be increased, performance drops off quickly as concurrency increases.

  • poor workload management – no short query ‘fast path’. Redshift is essentially a 'free for all'.

  • 1MB block size & columnar only storage – very large table space overhead of 1 MB per column for every table x number of segments.

  • limited tuning options - no partitions or indexes. If the sort keys don't help it's a full table scan every time.

  • automatic database software updates – this might appeal to the 'zero admin is good crowd', but enterprise customers will baulk at the notion of a zero-choice upgrade that could break existing applications.

Lack of OS access and non-persistent storage will no doubt be show-stoppers for some enterprise folks.The SQL Server crowd will probably not care and just marvel at the performance offered by SQL queries that run against all CPU cores in the cluster. Unless, of course, they didn’t read the data distribution notes.

Meet The New Boss, Same as The Old Boss

No matter that Redshift could be improved, what we can applaud is that Amazon has opened many folks eyes to the benefits of a scalable ‘old-skool’ relational database management system (RDBMS), and one with an MPP architecture to boot. For this we can only be thankful.

The rate of Redshift uptake speaks to the usefulness of a scalable RDBMS. The architecture is being socialised by Amazon (high volume/low margin) in a way that never was by Teradata (low volume/high margin).

All of the MPP database vendors, Amazon included, owe a debt of gratitude to Teradata who proved the architecture 30 years ago. That Teradata got so much so right so long ago never ceases to amaze yours truly.To Redshift the old adage remains true, there really is ‘nowt new under the sun’.