SQL Query Splitting

An article by Nick Holt about ‘SQL query splitting’ caught my attention on LinkedIn recently.

It’s a topic that’s dear to my heart. The notion that SQL queries can be split into smaller components is a key driver of how I’ve written SQL for over 30 years. For myself, and many others I’ve worked with, query splitting isn’t just possible, it’s essential.

In addition to the practical benefits discussed below, SQL query splitting also observes the guiding principle of simplicity to which I and many others subscribe.

“Simplicity is the ultimate sophistication”

and

“Everything should be as simple as possible, but no simpler”

</reminder>

Simplicity as a guiding principle is all very well and good, but there are also strong practical reasons in support of query splitting. This is certainly the case in the Teradata world in which my beliefs, and those of many others, were forged during the 1990s.

I created my first multi-billion row table over 25 years ago. The lessons we learned back then have held true to this day. Query splitting, as it now seems to be called, is central to these beliefs.

TL;DR

SQL query splitting = good. Work tables = good. CTEs = bad. Temp tables = bad.

Also, stop being so lazy and read the article. It’s free and you just might learn something 😉

Reasons To Split A Query

Before we delve into ‘how’ to split a query, let’s start with ‘why’.

First, let’s clarify what we’re talking about then we split a query.

Let’s assume we have a simple query that joins tables A, B & C as a single step:

-- step 1
select <stuff> from A join B join C

This is pseudo-code. It’s the technique that matters, not the SQL.

As a multi-step process it might look like this:

-- step 1
select <stuff> from A join B & materialise as AB

-- step 2
select <stuff> from AB join C & materialise as result set

The original single step query that joined all tables at once has been split into smaller processes.

This is a very simple example. A single query that joins ‘lots of tables’ could be broken down into far more smaller steps than the example above.

Reasons to split the SQL include the following:

1) Each Query Step Will Happen Anyway

Whether you split the query or not, the same steps are executed by the database in order to carry out your request. In the single step example, the intermediate results are created and automatically passed from one query step to the next:

join A to B & materialise intermediate results
join intermediate results from the previous step to C
materialise output from previous step as result set

The only real difference between the single-step process and the split process is visibility of the intermediate results. From a query execution perspective, nothing has been gained by running as a single step. All of the query plan steps need to be satisfied to return the results.

2) Testing Is Easier

So, you’ve joined dozens of tables together as a single SQL query? How very clever!

OK, then…tell me how you know your results are correct. “All the joins worked and the output looks OK” is not testing.

With query splitting the results from individual steps can be easily interrogated. This means we know we’re happy with the output from joining A and B before we join to C.

3) Query Optimisation Is Easier

The more tables joined together in a single query, the more difficult life becomes for the SQL query optimiser. This isn’t a database problem, it’s a simple maths challenge.

I wrote about this topic 10 years ago: ‘The Query Optimiser Likes Simple SQL’

A bad query plan can be the cause of very poor performance. The more complex the query, the higher the chance of a poor query plan.

4) Simpler Software = Easier Maintenance = Lower TCO

The real cost of software isn’t in the build phase, it’s in the maintenance phase. Some of my stuff it still out there 20-30 years after being released into the wild. No doubt most of it has been amended, some of it quite a lot. I’ve always written my SQL as simplistically as possible. This involves as much query splitting as possible. It looks like a noob wrote it. No really!

SQL query splitting means simpler steps, which means easier maintenance, which means lower TCO over the life of the software.

My rule of thumb is that the most junior team member should be able to deal with out-of-hours ETL support on their own. This is only possible if sensible coding standards are in play. Central to this is the notion of query simplicity achieved through splitting. Million+ line SQL code bases are not uncommon in the Teradata world, and others no doubt. Simple code is a doddle to diagnose/fix compared to so-called ‘clever’ code.

5) Re-Start From Point of Failure

Queries can fail for many reasons. Let’s not get into how/why.

If a complex single-step query fails it must be re-run from the start every time. That means even the query plan steps that worked successfully must be re-run.

This isn’t the case for a multi-step query. As results are materialised after each step, it is possible to re-start only from the failed step.

When dealing with large data volumes it is critical that restarts don’t re-run successfully completes steps. That’s a waste of machine resource and precious time.

Why Work Tables?

There are several ways to deliver SQL query splitting. The approach I recommend is the use of work tables – these are just normal everyday common-or-garden tables.

Reasons to use work tables include the following:

1) Intermediate Results Are Persisted & Visible

Work tables are under our direct control. We can create, populate, truncate & drop as/when appropriate.

Making the contents of intermediate query steps persistent via a work table has proven to be a life-saver many times over during my career. It makes it possible to see exactly what happened during the previous run of every production ETL job.

The counter-argument historically was the space required to persist intermediate results. As space is no longer as expensive as ‘back in the day’, this argument is no longer as persuasive.

I’d still argue in favour of the evidence trail we can use to defend ourselves and point at GIGO when users complain that the ETL logic was wrong. How very rude!

2) Work Tables Are Shareable

Work tables are just regular tables. That means they can be shared amongst multiple processes. This is especially useful if the step used to create the work table is an expensive operation to perform. Why waste clock-cycles (and maybe ££££) repeating the same process over and over???

Remember: ‘sharing is caring’.

3) Provides Control Over Data Distribution

Since the late 1980s I’ve worked on Massively Parallel Processing (MPP) analytic systems like Teradata, Netezza, Redshift & Greenplum. All of these systems use various methods to distribute data across one or more nodes in a cluster. For most tables, data is hash distributed and controlled via distribution keys specified in the table DDL.

Controlling data distribution to avoid skew is MPP 101. This applies to intermediate results as well as ‘normal’ (non-work) tables. By persisting intermediate results to work tables we can explicitly control the distribution key, and therefore the data distribution, at each stage the process.

This can be especially useful in cases where the optimiser would otherwise ‘get things wrong’ and create behind-the-scenes intermediate results with a poor distribution strategy, often leading to excessive query elapse times. This is more common than many realise when stats are stale or not present.

To answer Nick’s request, I once got a query down from 4 days to 4 hours elapse by staging intermediate results with a sensible distribution strategy. Left on its own, the optimiser chose to duplicate a table onto every shard. That was ~200 copies of a large table. Not doing this and hash distributing the intermediate results was all that was needed.

All is not lost for those using general purpose non-MPP databases like SQL Server, Postgres, mySQL etc. Although hashing/sharding is not ‘a thing’ for these folk, the ability to explicitly control things like indexes, partitions & sorting on work tables can be similarly beneficial.

4) Enables Statistics Collection

An important part of query planning for all databases is metadata. The more the optimiser understands the data demographics the better. This is normally achieved via collecting statistics or ‘stats’ on tables.

In many cases a poor query plan can be improved via the collection of more stats, or simply a refresh of stale stats.

Persisting intermediate results in work tables allows stats co be collected in cases where this is beneficial to query planning.

Side note : more stats does not necessarily mean a better query plan. A good plan is a good plan. More stats can only improve a bad plan, not make a good plan better. Not collecting stats because they don’t help is by far the best scenario.

What About CTEs?

Common Table Expressions or CTEs are talked about a lot at present. I’m not a fan. There it is. I’ve said it.

If I was tasked with writing SQL coding standards once again I’d argue for CTEs not to be allowed in production ETL code.

Reason not to use CTEs include the following:

1) Just Because You Can…Doesn’t Mean You Should

CTEs fit into that category for me.

It’s all too easy to jump on the bandwagon when new techniques appear. I’m not against progress. I just expect to know what problem something solves before it gets added to my armoury.

CTEs don’t add anything for me, so they stay in the ‘meh’ pile.

2) ‘Readability’ Is Subjective

The oft-quoted benefit of CTEs is that they ‘aid readability’. Say what!?!?!

First of all, ‘readability’ is 100% subjective.

Secondly, how does this aid testing, performance or maintenance in any way, shape or form? I’ll help you out: it doesn’t.

3) No Other Claimed Benefits

Beyond ‘readability’ I’ve yet to encounter any other claimed benefits for CTEs. As always, I’m willing to be corrected or educated. Maybe I’ve missed something, ya never knows.

CTEs are a triumph of fashion & “Look how clever I am, I coded it all in one go” attitudes. Doh!

4) Things CTEs Can’t Do

expose intermediate results for testing
share intermediate results
collect stats to aid query optimisation
control DDL for optimal intermediate result materialisation
avoid re-running completed steps when jobs are re-started

Basically, CTEs can’t do all the things we can do with good old-fashioned work tables (see above).

5) CTEs Are Easily Abused

It’s not the CTE’s fault, obviously.

However, it seems that folk just can’t resist stitching a ton of CTEs together into a big seething mass of impenetrable SQL. They’re probably the same folk that abused views before they latched onto CTEs. Ho hum.

For the sake of balance, reason to use CTEs include the following:

1) Small Data Volumes

For most folk with small/medium dataset sizes the efficiency, or otherwise, of your SQL query techniques won’t matter. How you right your code might be inefficient, but you’ll never know (or care) as the elapse times are acceptable.

2) No Elapse Time Sensitivity

Not everyone cares about SQL query elapse times. You might have ‘chunky’ data volumes, and your SQL techniques might not be super-efficient, but not everyone cares.

If your jobs finish during the allotted window and the users are happy, who cares that the SQL could theoretically be made more efficient and run faster?

I do, but that’s another matter!

3) Faster/Easier To Code

Yours truly has taught SQL to hundreds of students, written SQL for over 30 years, written SQL best practice guides, investigated the gnarliest SQL performance issues, ‘productionised’ dozens of End User Computing (EUC) SQL applications and answered thousands of SQL questions.

Against this background I think the main reason CTEs are popular is the speed & ease with which they can be written.

Who cares about all that stuff the IT picky folks (should) care about if I can get my pesky SQL written and executed as quickly as possible? That’s a prevalent mindset IMHO folks.

CTEs allow us to avoid having to think about data types, having to write DDL & having to manage intermediate table contents.

Overall there’s less code to write if CTEs are used rather than work tables.

What About Temp Tables?

‘Back in the day’, actually the early ‘90s, we would call tables that held intermediate results ‘temp tables’.

This was a reflection of the contents which were temporary as they were cleared down and replaced every time the owning process was executed.

Teradata, and no doubt others, introduced real temporary tables in the ‘90s. I couldn’t see a use for them so didn’t bother. However, to differentiate new-fangled ‘real’ temp tables from what we called temp tables, we decided to use the phrase ‘work tables’ instead.

Reasons not to use (real) temp tables include:

1) Lack of Persistence

Unlike work tables, temp tables typically only exist for the duration of a session. That’s good from a space utilisation perspective, but it means all that lovely evidence is lost from the previous ETL run.

2) Not Shareable

As temp tables typically belong to a user or session they’re not easily shared.

VLDB