MapReduce, MPP and Big Data Discussion

MapReduce, MPP and Big Data

MapReduce, MPP and Big Data

There's an interesting article discussing the relationship between the 'old skool' MPP approach to Big Data, and the 'new-fangled' MapReduce approach to Big Data over at ZDnet:

We felt compelled to comment, naturally:

"The central tenet that MapReduce and MPP have a lot in common is correct. From the user's perspective, the main difference is that MapReduce supports procedural languages (Java etc) and MPP systems are typically SQL-only databases. Both run by default on clusters of SMP nodes in a 'shared nothing' architecture.

Teradata has been owned by NCR and AT+T, which bought then sold NCR, so has not been independent for most of it's almost 30 years. The MPP usual suspects are indeed Teradata, IBM Netezza and EMC Greenplum with Microsoft's PDW yet to really make much of an appearance in the field.

MPP systems do not have to get 'used on expensive, specialized hardware'.

Teradata uses Dell SMP servers and LSI storage, although the BYNET interconnect is proprietary. The software version of Greenplum offers choice of SMP node, storage, OS and filesystem and can be scaled to as many nodes as you choose, all on COTS hardware. Netezza now uses IBM blades after migrating away from home-grown hardware, although the blade that hosts the FPGA is proprietary.

Not all MPP products are supplied in fixed appliance form. Teradata's non-appliance 'enterprise' offerings can be grown incrementally. This has been the case for decades. Only the relatively new 'appliance' offerings - a response to the competition Netezza brought to the MPP market - are non-expandable, and that's a product positioning choice not a technology limitation.

The software only version of Greenplum is similarly unhindered. You can scale Greenplum on as many commodity nodes as you wish. The DBMS licence is charged per TB of input data, not per node, so you are only bounded by how much data you wish to process, not the cost of the tin or the DBMS. Seems very reasonable.

SQL is not just 'easier and more productive', it's also understood by millions of users and developers worldwide. It is a standard after all. SQL is also generated by every query tool out there in the enterprise. For this reason alone it will never be displaced.

SQL and MPP are very much being implemented on commodity hardware, and MapReduce is being implemented in data warehouse environments. Greenplum already has MapReduce from MapR built into it's offering, and Teradata has them combined in its Aster stack. This evolution will continue at pace.

The next stage in this MPP + MapReduce evolution is a scalable cloud offering that can be spun up on demand on an arbitrary number of nodes with usable and stable inter-node and intra-node IO bandwidth. "