Is ETL Documentation A Waste of Time?
When I started developing ETL processes (‘pipelines’) in the late 1980s it consisted of hand-crafted Teradata SQL scripts that were executed from an IBM mainframe. Big banks that use Teradata still do this today.
The ‘external to the data warehouse’ coding approach is a very common ETL design pattern. In only a couple of instances have I encountered ETL application code which consists of database procedures.
So what?
Well, the main feature of remotely executed ETL code is that there is no explicit relationship – and therefore no metadata – between the code and the target data warehouse. This is not the case if stored procedures are used.
To make a change or add a new ETL process, the first task is good old-fashioned analysis. There are several ways to obtain the necessary insights.
In the early stages of a data warehouse platform it is possible to rely on domain knowledge. This approach gets riskier over time, but it still happens.
The more traditional approach is to produce ETL documentation. Perhaps surprisingly, this approach is fraught with difficulties.
I’ve encountered Word docs that are essentially a code dump; Word, Excel, PowerPoint and Visio files all used concurrently; multiple, often inaccessible, document repositories; out of date documentation; incomprehensible documentation and, you’ve guessed it, no documentation.
In the absence of stored procedures, the standard impact analysis approach is to ignore the documentation completely and search through the code library.
As an aside, this isn’t so bad on Linux systems with grep, sed, awk, pipe, cut, sort, uniq etc. Searching through tens of thousands of SQL scripts and millions of lines of code on an IBM mainframe using ISPF option 3.14 is quite another matter.
Ultimately, the relationship between the job, application code and database tables is what we’re trying to uncover.
The most difficult piece of the puzzle is the relationship between SQL scripts and database tables. Without stored procedures, and in the absence of usable documentation, this isn’t recorded reliably anywhere, remember?
Searching for table names through masses of SQL code is standard practice.
Although the schedule, code & schema are the only source of the absolute truth when it comes to ‘what’ an application consists of, the more difficult question to answer is ‘why’.
“That’s what the documentation is for!”, I hear you cry.
Well, maybe. There are usually only a few key places where a helpful ‘why’ is required. Surely that’s what code comments are for?!?!?!
So, there you have it…why waste time on producing masses of ETL documentation when the truth within the schedule, code and schema only needs enhancing with a few ‘why’ comments here and there?
I know which I prefer, and it’s not documentation. No sir.
#etl #elt #teradata #ibm #mainframe #documentation #sql #jcl #job #schedule #scheduler #schema #datawarehouse

