Teradata Spool 1.0.1

vldb

Spool is work space used during the execution of a query. All queries require spool space. The amount of spool available to a query is determined by the user that submits the query. This is something that can be changed by the DBA.

Spool files are transient hash distributed tables that exist for the duration of a query. They cannot be inspected or queried, they are used by the system only.

The spool allowance for a user can be determined as follows:

select spoolspace from dbc.users where username = ‘my_userid’;

Production users typically have much larger spool limits than developers or business users.

Where Is Spool?

Spool is any unused cylinder in the system. It doesn’t exist in any particular place. Any unused space within any database can be used for spool.

Spool Limits

Each user that wants to run a query must have a spool allowance greater than 0. Spool is used during the execution of a query up to the spool limit.

Why Spool Limits Are Required?

If a query requires more spool than the spool limit the query will fail with a 'Failure 2646 No more spool space’ error. This is the main mechanism by which the system protects itself from runaway queries.In the absence of spool limits a poorly formed query could potentially use all the available space within every database in the system.This has several consequences:

  • No space available for new tables or table growth

  • No spool space available for other users

  • Significant CPU cycles required to build the runaway spool file

  • No space in DBC for journals - which would crash the system

Spool limits should be regarded as a necessary protection against runaway queries, especially in a parallel processing system such as Teradata.

Running Out Of Spool

When queries run out of spool it is rarely because the spool limit has been breached. This is a common misconception. The standard response is to increase the spool limit and try again. This does not address the underlying issue, and can actually make matters worse.

Let’s assume a user has a 100GB spool limit and that the system consists of 100 AMPs. This means that the user cannot run a query that consumes more than the allotted 100GB of spool space at execution time.

However, it is also the case that spool *must* be used evenly across the AMPs. There is an implicit per-AMP spool limit.

In the above case this means that spool cannot be more than 1GB per AMP, otherwise the query will fail. In the extreme case, even if no spool has been consumed on 99 of the 100 AMPs, if a single AMP uses more than 1GB of spool then the query will fail.

There is therefore an implicit per-AMP spool limit that consists of the spool allowance divided by the number of AMPs in the system.

Why per-AMP Spool Limits Are Required

In a perfect world all tables would be distributed evenly with 1% of the data stored on each of the AMPs in our notional 100 AMP system. However, the data is not always so compliant.

Even if the input data is evenly distributed, in some cases a spool file is still very heavily skewed.

The system has to protect itself against unconstrained spool file growth on a small number of AMPs. In the absence of a ‘per AMP’ limit the spool file would grow on a single AMP, or a small number of AMPS, until the overall spool limit is reached.

In such a scenario the query is classified as ‘skewed’ as the data is not evenly spread across the AMPs. This is the main issue that developers have to guard against relating to parallel processing architectures when compared to non-parallel architectures.

Skewed queries can have a significant impact on the entire system, not just on the skewed query. This is caused by overloading the AMP(s) in question with a disproportionate share of the workload.

In any parallel processing system, such as Teradata, the individual units of parallelism (AMPs in Teradata’s case) are not designed to be powerful in their own right – the power of the system comes from the massive degree of parallelism. As a result the AMPs are sensitive to being overloaded by skewed queries. To make matters worse, the impacted AMP(s) are impacted for all queries for all users, not just the skewed query.

It is very important that skewed queries are not allowed to overload a small number of AMPs for lengthy periods of time. The per-AMP spool limit ensures this is the case.

Increasing The Spool Limit

 The classic response to out of spool failures is to increase the spool limit and re-submit the query.

In cases where the degree of skewing is light this is a reasonable response which may allow the query to complete successfully.

However, if detailed analysis of the underlying cause of the skewing is not carried out then this response can make matters worse. It is often the case that a skewed query is highly skewed onto a single AMP. Increasing the spool limit in such cases allows the skewed query to run for longer before the (single AMP) spool limit causes the query to fail. This extends the time period over which the skewed query has a negative impact on the entire system, which was probably not the intent.

Avoiding Spool Issues

It is not possible to guarantee that jobs wont fail due to spool issues. However, steps can be taken to minimise the risk.

  • Appropriate SpoolSpace settings - the amount of spool space required by a user is highly variable. For production users the value can be 10-100x higher than a business user, for example. It is a function of the amount of data being handled by the queries submitted by each user. It is important to understand the nature of the workloads to be executed by a user so that an appropriate spool space value can be assigned.

  • Training - the key differences between massively parallel processing (MPP) databases and traditional databases centre on how the data is physically distributed when tables are populated, and re-distributed as queries are processed by the DBMS. A thorough understanding of data distribution and re-distribution is required by all developers and business users in order that the frequency of skewed queries is minimised. Developers from a non-MPP background (e.g. SQL Server/Oracle) have little or no understanding of these concepts.

  • Development/QA Environment - these are often significantly different to production environments. As a result the explain plans generated in development/QA are different to those generated by the production optimiser once the same SQL code is promoted to live. Differences in explain plans between development/QA and live can lead to unpredictable production query performance, including spool space issues caused by skewing. The Teradata System Emulation Tool (TSET) is designed to address this issue:

“Teradata SET allows you to imitate a target (production environment) system on a test system.System-level environmental data is captured from the target system, then stored in relational tables. You can then use the information from these tables, together with appropriate column and index statistics, to make the Optimizer on the test system generate query plans as if you are operating in the production environment.”

  • Development/QA Data - the data used in development/QA environments is often a sample of the production data with different data demographics. Even worse, the development data might have been created by the developer with little recourse to production data values and/or demographics. Issues with skewed queries often fail to surface in development/QA environments due to the significantly smaller data volumes. Data volumes, and demographics, that mimic production as closely as possible are the #1 means of avoiding spool issues due to skewed queries on a production platform. These issues will surface during development/QA and ‘bad’ SQL can be stopped from being promoted to live if meaningful data volumes are used during development/QA.

  • Data Profiling/Metadata - understanding production data demographics is a key part of the defence against skewed queries. This can be achieved through data profiling. The kind of insight required are degree of uniqueness, frequency of nulls, blanks or default values.

  • Explain Plans - Explain plans describe exactly how the system executes a query. Within the explain plain are words such as ‘redistribute’ and ‘duplicate’. These are often clues to run-time skewing issues. Explain plan analysis is not a skill that all developers, testers or QA analysts possess. There are various tools available to address this issue. One such tool is ‘Prism’ from Ward Analytics.

  • Statistics - the query optimiser relies on accurate statistics in order to generate efficient query plans. Statistics can be collected manually or obtained by the system at run time via dynamic sampling. In cases where there are no collected statistics on join/filter columns the system will use dynamic sampling to understand the data demographics. Unfortunately, this does not always deliver the required insight into rogue or ‘edge’ cases. A single value that occurs a very high number of times can often be sufficient to cause skewed query issues. Full collected statistics, rather than dynamic sampling, can help the optimiser in such cases.

  • SQL Coding Approach - the query optimiser will often struggle to compute an optimal explain plan due to the manner in which SQL has been developed. Extensive use of derived tables, ‘create table as’ (CTAS) and too many joins in a single statement are obvious examples.

Summary

Although job failures caused by skewed queries can’t be eradicated fully, certain steps can be taken to minimise the frequency:

  • Understanding of MPP principles of data distribution and re-distribution

  • Suitable development environment and data

  • Understanding of data demographics

  • Develop simplistic SQL

  • Understand EXPLAIN plans

So there you have it...Teradata spool 1.0.1.

Enjoy!