It's All About Data Distribution.
As experts in Massively Parallel Processing (MPP), here at VLDB Solutions we talk regularly about ‘Primary Indexes’ (PI) or ‘Distribution Keys’ (DK). They are integral to the architecture of both Teradata and Greenplum respectively, and the correct identification and employment of them is ‘key’ to the maximised performance of both Massively Parallel Processing (MPP) systems. But how do they work?
Before we examine each in detail, it is first important to understand how data is stored and accessed on a MPP system, and how the distribution of data helps a system achieve true parallelism.
Within a MPP system, data is partitioned across multiple servers (referred to as AMPs in Teradata, and Segments in Greenplum). These servers ‘share nothing’ – they each process only their own share of the data required by a query, and do not process data located on other servers in the system. If data is not partitioned across all of the available servers, then those servers without data sit idle during the processing of a workload, and the full power of MPP has not been harnessed. The data is considered ‘skewed’, and therefore the query will be skewed too.
Primary Indexes and Distribution Keys are, as the name suggests, the key by which data is distributed across the servers. They are designated at a table level within the database, turning a column, or a selection of columns, into the key for each row of data.
If the Primary Index or Distribution Key for each row of data within the table is unique, then the rows are distributed across the servers in a ‘round robin’ manner. When the data is queried, each server has an equal share of the workload, and the system’s full power has been harnessed; the system has achieved true parallelism.
If the Primary Index or Distribution Key is not unique, rows with duplicate Primary Indexes or Distribution Keys are grouped on the same server. In a table where there are many matching key values, this can lead to skewed data and therefore skewed performance.
For example, if a MPP system had 10 available servers, but a table with a Primary Index / Distribution Key was created where that column only contained two different values (1/0, True / False, Yes / No, etc.) then that data would only be distributed to two servers, as the rows with matching value sets would be stored together. When that data is queried, only those servers with data can process the workload; the remaining eight servers remain idle.
Unique and Non-Unique Primary Indexes
As mentioned earlier, data distribution in a Teradata system is governed by a table’s Primary Index. Primary Indexes can be designated as either Unique or Non-Unique, depending on the column, or selection of columns, that has been chosen as the Primary Index. If a Primary Index is determined as a Unique Primary Index (UPI), then duplicate values are no longer allowed within the chosen column / selection of columns, and will not be loaded to the system. However, those rows that are loaded will be distributed evenly across the AMPs, and the system will achieve parallelism when that table data is queried. If a Primary Index is determined as Non-Unique Primary Index (NUPI), then duplicate values will be allowed within the column / selection of columns, but they will be grouped on the same AMP when the data is distributed.
Distribution Keys and Distributed Randomly
The Distribution Key is how data is distributed on a Greenplum system. Unlike Teradata, the key is not declared as unique or non-unique; it merely is, or is not. Again, as with Teradata, table data with a unique Distribution Key is distributed ‘round robin’ across the Segments; and duplicate Distribution Key values are grouped together on the same segment.
However, Greenplum tables can also be designated as ‘distributed randomly’. In this case, column data is not used for distribution, and each row is distributed to the Segments in the same ‘round robin’ manner as when using a unique Distribution Key.
How to Choose Primary Indexes and Distribution Keys
As should now be clear, correctly identifying which column, or selection of columns, to use as a Primary Index or Distribution Key is integral to the performance of a MPP system. In a Relational Database Management System (RDBMS), a table’s Primary Key is often a natural candidate to become the Primary Index / Distribution Key – a column of values uniquely identifying each row could easily be used as a UPI for best distribution.
However, there will be times when a table does not have a single column of unique values, but where ‘round robin’ distribution is still desired – ‘reference’ or ‘dimension’ tables, for instance. On a Greenplum system, this could be achieved by distributing randomly; but on a Teradata system, it would be necessary to identify a selection of columns, where the combination of data within those columns would be unique on a row-by-row basis.