Lies, Damned Lies and Teradata Statistics

Teradata Statistics

Teradata statistics collection is a topic that many will feel has been done to death. Although this is true, there still seems to be a lot of misunderstanding as to the benefit, or otherwise, that Teradata statistics deliver, especially when put in context of the cost of collection/maintenance.

First of all, let's consider why Teradata statistics are important in a world in which data is physically distributed in some way, as is the case with Massively Parallel Processing (MPP) systems.

In a classic general purpose Symmetric Multi-Processor (SMP) based DBMS scenario (think Oracle or SQL Server) statistics may help the query optimiser to understand that table A is a large fact table, and that table B is a small dimension table. This kind of information is obviously useful when it comes to generating a sensible query plan.

In the MPP world things are a bit more complicated. The optimiser benefits from not just understanding the relative table sizes, but also from understanding the distribution of each across the units of parallelism. Knowing that one/more of the tables is not evenly distributed i.e. 'skewed' can have a significant impact on the query plan. Choices such as table re-distribution or duplication that are available to MPP optimisers simply don’t exist in the classic SMP world.

Big MPP systems such as Teradata, Netezza and Greenplum are deployed for various reasons such as processing horsepower, scalability, capacity etc. These capabilities are not in themselves a means to an end. The end is the ability to run big gnarly queries against big gnarly datasets.

Given that this requirement is delivered via set-based SQL, query optimisation is of prime importance. Through the use of SQL, we tell the system what we want (the question) but not how to do it (the execution plan). Turning the question into a plan is the query optimiser's job. And it's not always easy. A good plan may run in seconds/minutes, whereas a bad plan that aims to answer the same question may run for hours, or even days in extreme cases.

So, as a general guide, MPP databases tend to benefit from statistics more than might be apparent. Similarly, a lack of statistics can lead to very bad plans in some cases. Lack of up to date Teradata statistics can lead to very bad plans, that's not in dispute.

OK, so we should just collect Teradata statistics on all the things we're told to collect Teradata statistics on then? You would think so, but I'd like to counter this with two thoughts.

Firstly, collecting Teradata statistics is a CPU-intensive operation, and CPU is an expensive and finite resource. On Teradata systems specifically, collecting statistics can be a very large consumer of CPU.

Secondly, no amount of Teradata statistics can improve on an already good plan. That’s the key point that’s lost on a lot of folks. A good plan is generated in a high proportion of cases in which no statistics are present, at least it is by the Teradata optimiser. This is something we’ve taken for granted for decades, often without realising.

Random sampling and evenly distributed data are often all that's needed for the optimiser to do its job and generate an efficient plan. It matters not a jot that the timings and/or row counts are less accurate, or that there is 'low confidence' relating to some of the steps - a good plan is still a good plan.

Teradata statistics should be seen as 'expensive medicine'. The expense is the CPU resource required to collect/maintain the statistics. The analogy with medicine is that there needs to be an illness (problem) to fix. Only a bad plan should be considered an illness worth incurring the cost of the expensive medicine (CPU cycles).

And therein lies part of the problem. The explain plan needs to be analysed, understood and categorised as 'good' or 'bad' before a sensible decision can be made as to whether more Teradata statistics would help.

It's all too easy to look at the timings and/or row counts and decide that a plan is 'bad'. It's often the wrong conclusion. The numbers in explain plans should be taken with a pinch of salt, to say the least, no matter what the technology vendors would have us believe. The only real truth in the plan is the sequence of events and the join/aggregate/sort/scan techniques used to implement each operation. The numbers are an informed guess. I've been largely ignoring them for 20 years.

As an aside, predicted elapse times expressed in terms of wall clock time are always subject to the same issue – how can we know the future system state when we run the query in anger? We can’t. Simple really.

So, how do we decide if a plan is good or bad? It boils down to two simple choices – learn how to manually analyse explain plan text, or buy a tool to do the job.

For my sins, I’ve been analysing Teradata explain plans with a pen and paper since the late 1980’s and they haven’t changed much. They’re pretty easy to follow and written in plain English:

“1) First, we…

2) Next, we…”.

The key is knowing what to look for, which is outside the scope of this article.For those that would rather use a tool to do the job, there are several choices. First, of all there is Teradata’s own ‘Visual Explain’ (VE), which does what it says on the tin – it presents a graphical representation of the explain plan text.

A far more useful tool is ‘Prism’ from Ward Analytics. Prism will identify issues with explain plans and offer suggested fixes. Back of the net!

In summary:

•    Teradata statistics are ‘expensive medicine’ and should be viewed and used accordingly

•    a good plan is a good plan with or without Teradata statistics

•    the sequence of events, parallelisation of tasks and physical operations used make a plan good or bad, not the predicted numbers

•    the numbers in explain plans - predicted row counts and elapse/CPU times – should not be regarded as accurate

•    learn how to interpret explain plans (or buy a tool to make life easier) to decide if they're OKDespite all of this rambling, always remember…on MPP systems it is possible to get a bad plan, with awful performance, for which the only remedy is often, you guessed it – appropriate and up-to-date statistics.In such cases the ‘expensive medicine’ is the correct cure.