Amazon Redshift Initial Thoughts

Amazon Redshift

As reported on El Reg and elsewhere, Amazon announced the general availability of ‘Redshift’ a few days ago:http://aws.amazon.com/redshift/What is Amazon Redshift then?Redshift is a licensed version of the ParAccel Analytic Database (PADB) that Amazon has deployed on the Amazon Web Services (AWS) public cloud infrastructure.The main Redshift features are as follows:

PostgreSQL derived relational database management system (RDBMS)
massively parallel processing (MPP) scale-out architecture
supports petabyte-scale datasets
columnar storage

Perhaps the most newsworthy point is Redshift’s claimed $1,000/TB/year cost with no upfront capital expenditure.

So, Amazon’s Redshift is a public-cloud based, SQL-compliant, scalable MPP database that looks cheap-as-chips…what’s not to like?

Well, funny you should ask…after a quick read through the Redshift documentation the ‘gotchas’ we’ve spotted so far are as follows:

Data loading is only supported from Amazon sources such as S3. External data not already in AWS must be loaded to AWS before it can be loaded to Redshift. It looks like Amazon have decided not to licence to PADB high-speed loader, which might be why existing load strategies must be adopted to load external data to AWS, and from AWS the data can then be loaded locally to Redshift as a two-step process
The distribution key is a single column only. This may turn out to be highly restrictive unless it can be mitigated by column storage and/or fine-grained control over sort sequences within tables. Teradata enforces no effective limit on the number of columns used to hash the data, whereas Netezza allows up to 4, which is a sensible upper limit if one is to be imposed.
An aggregation server is used for final aggregation processing. For large clusters and/or queries with large answer sets this will almost certainly be a performance bottleneck. MPP and ‘shared nothing’ are not the same thing when a single, shared server is used for final aggregation of intermediate results following local aggreagtion across the nodes.

So, as much as we welcome affordable, cloud-based, MPP databases based on PostgreSQL, we can’t help but feel that the above ‘gotchas’ will bite some folks where it hurts. The aggregation server especially is something we’ve seen as a real show-stopper out in the field over several years.

Shared-nothing MPP databases that claim linear scalability simply can’t be architected around an aggregation server…as Datallegro and Exadata have discovered.

VLDB