Teradata Tips - The Primary Index (PI)
Why are we producing Teradata tips? Surely this has been done to death?
Well, a few of us Teradata types at VLDB were recently bemoaning the fact that we see the same poor behaviours year after year, decade after decade, site after site. This is partly an education issue, and partly a policing issue – we can’t be expected to interject every time somebody does something we don’t approve of can we?
Why does this stuff matter? Well, poor Teradata behaviour is a performance killer, plain and simple.
So, in true ‘Idiots Guides’ style, we decided to try and come up with some ‘Teradata Tips’ for folks to follow.
So, without further ado, here's #1 in the series:
Hopefully most Teradata users are aware that the primary index (PI) is used to distribute the data across the processors (AMPs) in the system. It has always been thus. Sensible data distribution in any massively parallel processing (MPP) system like Teradata is critical...or even more important!
Rather than duplicate what has been written elsewhere about the Teradata primary index, we thought it more useful to share our own easy-to-follow thoughts on the Teradata PI.
1 - ALWAYS explicitly code a PI when a table is created, even though you don't have to - this way you will get what you intended.
2 - DO NOT let the system choose a PI - it may not be what you intended (see above).
3 - ALWAYS code the PI column(s) first in the column list when a table is created - this means the PI might get used as a PI when the table is used as input into a 'create table as' (CTAS) operation which might save the day. Also, if your table is used as a template by somebody that violates #1 above, this might save their day, and then they'll owe you a favour.
4 - ALWAYS code NOT NULL for all PI columns - the optimiser has less work to do when there is no need to allow for nulls in a PI column, plus you'd never intentionally choose a nullable column for a PI, right? Of course not!
5 - DO NOT change the data type of PI columns when copying table definitions (DDL) from the production schema. You may decide you know better then the DBA or data modeller, it has been known. However, if you change that pesky INTEGER column to a CHAR and it's part of a PI, without knowing it you've probably created a table with an entirely different distribution. That's right, data distribution is affected by data type - be warned.
6 - DO make the PI unique (UPI) rather than non-unique (NUPI) if possible - this acts as a safety check on the PI values being loaded and can catch unintentional duplicate PI values...with no coding to do, which is nice.
7 - as a guide, DO NOT use more than 4 columns in a PI unless there is a very good reason to do so...which is rarely the case. We've seen as many as 15 columns used in a PI, but let's save that story for another day.
8 - DO NOT create tables without a PI (NOPI tables) to avoid thinking about a sensible PI - NOPI tables are probably not a good thing for end users in most cases.So there you go - 8 Teradata primary index (PI) 'tipettes' for the price of 1. This should send a clear signal that getting the Teradata PI right is *very* important.
Other MPP systems like Netezza or Greenplum, or even Hadoop, may call the PI something like a distribution key or hash key, but the same principle remains: data distribution is *very* important for parallel processing systems.
For those readers that are interested, there's more on the Teradata PI here with an article on Teradata indexes: http://blog.vldbsolutions.com/teradata-indexes/