How To Maintain MPP-Based Referential Integrity (RI)?

How Should MPP RI Be Maintained?

The question “how do you maintain data integrity”, specifically “on an MPP database platform”, was posted by Jay Quincy Allen recently on LinkedIn

The original article is here:

https://www.linkedin.com/posts/jayquincyallen_datawarehouse-database-data-activity-7054831887545966593-dKAk

As well as the RI question getting my attention, Jay (perhaps foolishly) invited yours truly to comment. The more I thought about it, the more I decided the topic of RI maintenance was worth a blog post, so here goes…

Just Give Me The Skinny

OK, impatient folk, I hear ya.

TL;DR – should we care about RI? Not as much as you might think.

Disagree? Grab a cuppa and read on.

But Just Implement RI, Surely?!?!?!

Ask Jay points out, not all MPP platforms enjoy “the benefit of declarative referential integrity (DRI)”.

He goes on to state that users of “Redshift, Snowflake, or Synapse…have probably added automated test scripts at the end of your integration process to perform data validation.”

Unlike a general purpose DBMS, analytic databases don’t always have the ability to implement DRI. This makes it quicker/easier for the DBMS vendor, but it leaves open the question as to how RI can be implemented.

So, if RI can’t always be implemented at the schema level via DRI, where does that leave us?

OK, Mr Teradata, Enlighten Us

Jay points out that “Teradata has been around far longer than these latest mpp databases.” and asks yours truly “What is your perspective on this issue?”

Given that I’ve been a Teradata developer since 1989, and a certified Teradata Master since 2002, the request to chime in seems entirely reasonable.

Teradata Support For RI

As the pioneering MPP analytic platform, it should be no surprise that Teradata supports various forms of RI. These types include Standard, Batch and Referential. Teradata’s RI documentation provides full details.

The summary is that Teradata supports either no RI, soft RI or hard/enforced RI.

As there are no free lunches in the world, Teradata points out the following (my highlighting added for emphasis):

“There is a significant potential for performance problems when updating tables with batch RI constraints in situations where multiple update operations are involved: for example, an INSERT … SELECT or MERGE operation involving thousands or even millions of rows. This would be an expensive operation to have to roll back, cleanse, and then rerun.”

Although, as always, the adage ‘your mileage may vary’ (YMMV) is very much in play, I’d take Teradata’s warnings on the risks surrounding RI very seriously.

If I Had A Pound (or a Euro or a Dollar)…

…for every time I’ve implemented Teradata RI, in any form, I’d have precisely…wait for it…nada, zilch, zero, nowt, nothing.

That’s right, in the last 30+ years I’ve yet to implement Teradata RI at the schema level. Certainly not that I can remember, anyway.

The reasons are many and varied. Ultimately, it comes down to the ‘pain:gain’ ratio. Specifically, the ‘pain’ of implementing RI is not often worth the gain. The main issue is one of context. RI is fine for small systems, and very desirable or even mandatory for operational applications.

However, MPP-based Data Warehouse systems are rarely small and should never host operational OLTP-style applications.

What works for OLTP is not necessarily desirable for large-scale analytics.

MPP Data Warehouse Traits

Whether built with Teradata or one of the new-fangled MPP platforms, MPP-based Data Warehouses tend to share common traits:

used by larger organisations
data sourced from multiple systems, both internal and external
complex, low latency & high volume ETL processes
wide variety of users and activities
high query complexity
high query concurrency
long query elapse times

Certain shared traits are relevant to RI:

little interest in single records
complex data model/schema – especially for adopters of industry standard logical data models
historic data spanning multiple years or even decades – billion+ row tables are not uncommon
data freshness as important as data quality – ‘what happened recently?’ is the #1 focus
large fact tables with PK:FK relationships – perhaps the main blocker to easy RI implementation

What Causes Poor RI?

“Upstream” is the knee-jerk response. Sadly, this isn’t always the case. The causes of poor RI can be many and varied:

poor change control – upstream didn’t tell us about data feed changes
scheduling errors – jobs run out of sequence, not at all or too many times
ETL logic errors – we coded what we intended but the requested/agreed logic was wrong
poor QA/testing – we messed up and no-one caught it
poor migrated data quality – bulk take-on of new data contained errors
poor ad hoc ‘fixes’ – unintended consequences of one problem causing another elsewhere
poor data retention logic – beware the ‘GDPR data clean-up’
data corruption caused by system hardware/software failures

It is also possible that we we think are RI issues are actually not. Incorrect understanding or implementation of entity relationships is common, especially with a large & complex data model.

During the 1990s we used to run ad hoc RI checks at a major UK Teradata site. The number of missing rows was so large and so difficult to explain we decided not to do anything unless the users complained. I’ll wager it’s largely still the same now.

Does Poor RI Even Matter?

If we confine ourselves to the world of MPP Data Warehouses, a universal truism is that the recent high value business data constitutes very little of the overall Data Warehouse by volume, but constitutes nearly all of the business user interest.

There is strong focus on the data required to compute the recent KPIs. Most of the rest is of little/no obvious day-to-day benefit, but is there to support whatever future analytic activities are undertaken. Interest diminishes exponentially as data ages.

Retailers are focussed on recent order items, telcos on recent CDRs, banks on recent financial transactions etc. There is always a focus on the recent ‘core data’ that underpins the organisation’s raison d’être.

Given that this is the case, organisations will often tolerate lower-than-you-might-expect data quality elsewhere in the schema. This includes RI.

Ultimately, no-one really cares if low interest tables do not enjoy 100% perfect RI. I’ve yet to encounter a single client that would foot the bill for perfect RI across a complex MPP-based Data Warehouse platform. The ROI case just isn’t there.

So What’s the Best RI Strategy?

Jay’s article offers ‘automated test scripts w/ETL’ & ‘clean data on read’ as potential ways to ‘guarantee data integrity’.

Test scripts that try and check RI relationships as part of ETL are almost always too expensive to implement, sadly. This does leave the door open to a lowering of data integrity.

‘Clean data on read’ isn’t a panacea. As discussed above, the reason for RI issues are many and varied. Aiming to fix data on the way into the Data Warehouse is not a catch-all means to ensure no RI issues, although it should certainly be considered as part of the solution.

<Finally, the answer to Jay’s question>

Unlike new-fangled MPP systems, Teradata does offer various forms of Declarative Referential Integrity (DRI).

However, DRI is something I would use only if we could live with the processing overhead & the rollback penalty. This is only likely on smaller systems with big ETL processing windows and low delta record volumes.

Based on my real-world experience, no-one votes for DRI when it introduces the very real risk of ETL process failures caused by a single missing row. The pain:gain ratio is all wrong.

Ultimately, there is no way we can ‘guarantee data integrity’.

Although theoretically possible, for me, it’s not a technical issue.

Business users rarely care enough to want to make the investment in the investigations & fixes required to stamp out DQ problems across large analytic systems.

The strategy I’ve felt most comfortable with is to carry out DQ/RI investigations & fixes on a case-by-case basis in response to business user requests, and implicitly, funding.

DRI certainly isn’t the answer.

</Finally, the answer to Jay’s question>

Author: Paul Johnson