Readings in Database Systems
Also known as the 'Red Book', Readings in Database Systems has been published since 1988. The 5th edition was published in 2015 after a 10 year hiatus. Quite ironic really that yours truly started his graduate IT career in 1988 on, you guessed it, database systems!
The Red Book contributors are Peter Bailis (Stanford Future Data Systems group), Joe Hellerstein (Professor of Computer Science at the University of California, Berkeley) and Michael Stonebraker (Professor at MIT). Although these chaps certainly know a thing or two about databases, it's always interesting to read the academic take on the database world.
What caught the eye in the preface was the statement that 'Cloud computing and microarchitectural trends has made distribution and parallelism nearly ubiquitous concerns'. This is against a backdrop of a massive decrease in compute/storage costs and the increase in data volumes under management with the rise of 'Big Data'.
For those of us that have been Teradata developers since the 1980's, data distribution and parallelism have been with us for almost 30 years. Good to know the rest of the world is finally catching up!
Chapter 1 : Background (Stonebraker)
The key theme here is that the Map-Reduce market has become an HDFS market, which 'seems poised to become a relational SQL market'. In a similar vein, 'New data models have ben invented, only to morph into SQL on tables'.
We've been calling this out for years, a lot of Map-Reduce/Big Data/Hadoop efforts are leading inevitably to 'SQL-on-Hadoop', which is essentially a parallel database capability...which we've had for 30 years.
Chapter 3 : Techniques Everyone Should Know (Bailis)
As discussed on a previous VLDB blog post, it's all to easy to overwhelm database optimisers such that 'modern query optimizers still struggle with large plans'. Wise words from Mr Bailis.
Most folks don't get bitten hard enough, or often enough, by bad query plans to be too bothered by the query optimisation challenge faced by all database optimisers. The fallout from a bad query plan becomes all too apparent as data volumes rise. A bad query plan against a few gigabytes of data is not the same as a bad plan when petabytes of data are in play. Testers take note!
As Teradata's Stephen Brobst memorably quipped at the Teradata User Group (TUG) in London a few years ago: 'With Hadoop you're the optimizer. Good luck with that'. Well said.
The folks at Pivotal are noted to have been addressing the query optimisation challenge with Orca. Now if only someone would make those Greenplum explain plans easier to read. Ho-hum.
Chapter 4 : New DBMS Architectures (Stonebraker)
A favourite YouTube video that always raises a chuckle in the VLDB office pokes fun at the NoSQL crowd. However, we tend to agree that the SQL and NoSQL crowds will merge over time. Beats piping data to /dev/null.
Chapter 5 : Large-Scale Dataflow Engines (Bailis)
The ongoing convergence between the 'old' (RDBMS-based data warehouses) and the 'new' (Hadoop) supports our belief that "there's nothing new under the sun".
Bailis points out that systems built from the Hadoop ecosystem have 'come to resemble traditional data warehouses'. Furthermore, 'post MapReduce systems have come to implement a growing proportion of the feature set of conventional RDBMSs'.
This is an ongoing trend that will ultimately lead to the 'Hadoop/Big Data' open source ecosystem delivering, amongst other things, MPP database capability. Yes, 30 years later than when it first came into being, but that's another story.
Chapter 6 : Weak Isolation & Distribution (Bailis)
The author points to survey results that state that only 3 out of 18 SQL & NewSQL databases provided serialisability by default, and that 8 did not offer serialisability at all. This is characterised as a 'race to the bottom' amongst vendors.
With the rise of Hadoop and all things Big Data, it is noted that 'weak isolation has become even more prevalent'. Ouch.
Chapter 7 : Query Optimisation (Hellerstein)
Ah, query optimisation, one of our favourite topics. No seriously!
Query optimisation is quite rightly portrayed as 'one of the hardest parts of a DBMS to implement well' and that query optimisers remain a 'clear differentiator for mature commercial DBMSs'.
These observations resonate well with the team at VLDB.
We took the prowess of the Teradata optimiser somewhat for granted during the 1980s/1990s/early 2000s. It wasn't until the new wave of PostgreSQL-derived MPP systems (you know who you are!) came along in the early/mid 2000s that we realised how wrong it can all go when the query optimiser struggles to come up with sensible execution plans. Sub-optimal execution plans don't really matter until you get hundreds of millions, billions and now trillions of rows.
One of the points we've been calling out for what seems like ages is that 'the "big data" community...has been far too slow to focus on query optimization'.The author goes further, stating 'that even when the community started building SQL interfaces like Hive, query optimization remained a little-discussed and poorly-implemented topic. Maybe it's because query optimizers are harder to build well than query executors'. Shots fired!
<chuckle>However, it is noted that 'there are efforts underway to start building at least a 1980s-era optimizer'. Green screens ahoy!</chuckle>
Our friends over at Pivotal are noted for their work on the Greenplum Orca optimiser.
Query optimisation *really* matters when large and/or complex datasets are in play. Unfortunately, the challenge of query optimisation is known to be 'NP hard'. Query optimisation is even harder in a distributed/clustered MPP architecture. Damn that linear scalabilty.
The low hanging fruit of noSQL/no ACID/parallelism etc. is precisely that, low hanging. Like the challenge of workload management, query optimisation is very much further up the tree.
Ever wondered why not all databases are available as an MPP version? Well, query optimisation is right at the top of the list of the challenges faced when a general purpose DBMS tries to morph into a fully-formed MPP version. It's not easy, not at all, no sir. Just ask Microsoft or Oracle.
As an aside, Teradata deserve serious plaudits for how good their SQL query optimiser is, and has been for so long. Teradata explain plans are also extremely user-friendly. Go Teradata!
Chapter 8 : Interactive Analytics (Hellerstein)
The author calls out the 'buzzword bingo' that has seen the analysis of data range from being called '"Decisison Support Systems" (DSS), to "Online Analytic Processing" (OLAP) to "Business Intelligence" (BI) to "Dashboards" and more generally just "Analytics"'. What no data science? Oh, wait, that's just analytics carried out in Silicon Valley, remember? Silly moi.
Pre-computation and sampling are offered as two methods for making a query 'run in less time than it takes to look at the data'.
The invention of the term 'OLAP' is assigned to Ted Codd, whose (in)famous promotion of Essbase is described as 'not one of Codd's more scholarly endeavours'. More shots fired.
The assertion that 'pre-computing a full data cube does not scale well' is well documented. In essence, this is sparse matrix problem.I
t is noted that 'OLAP style pre-computation has underpinned what is now a multi-billion dollar BI industry'.
On the other hand, approximate query processing (sampling) attempts by IBM, Informix and Microsoft all failed because 'database customers won't tolerate wrong answers'. What a picky bunch those database users must be.
Chapter 9 : Languages (Hellerstein)
The academics had done a great job of keeping us pesky out-in-the-field types on board until we came across the claim that 'it is rare for people to interact directly with a database via a language like SQL unless they are coding up a database-backed application'.
We take this to mean that only IT-folks building an application will ordinarily use SQL to interact with a database. If that is the correct interpretation, which seems likely, then it is 100% incorrect/inaccurate/wrong, and has been since, well, the 1980's. There are very few client sites, if any, where we haven't witnessed business users/analysts writing and submitting SQL against Teradata, Netezza, Oracle, SQL Server, DB2 or Greenplum.
We have trained hundreds of business users to write SQL. We even offer a dedicated SQL training course for business users.
More than one retail bank in the UK has hundreds of business users that submit SQL via Teradata's BTEQ client in batch mode using JCL submitted from an IBM mainframe. Think about that. Ugh, poor souls.
It boils down to a simple truism - big companies often have sufficient data that only an MPP class database will suffice, and only SQL runs in parallel against said MPP databases.
The upshot? Lots of folks at lots of big companies write SQL, which they run directly against their company's MPP database. Sometimes business users even write highly efficient Really Neat SQL (RNSQL).
Chapter 11 : A Biased Take on a Moving Target: Complex Analytics (Stonebraker)
The opening gambit here is that 'the questions asked by today's data scientists...represent a very different use case from the traditional SQL analytics run by business intelligence specialists'.
Furthermore, 'data science will completely replace business intelligence...since it represents a more sophisticated approach to mining data warehouses for new insights'.
Data science is characterised as 'predictive modeling, machine learning, regressions, ... and not SQL analytics'.
This is kinda interesting given that activities such as predictive modeling and regressions have been widely used for decades. Machine learning is a newer kid on the block, certainly, but the whole 'data science thang' smacks a bit of Emperor's new clothes. Sorry to disappoint.
The author considers it a 'sad truth' that 'Most data science platforms...operate on file data' but that '...data scientists over time wish to use DBMS technology'.
OK, so we want to bring the data science (R) to the database? How about being able to run R in parallel inside an MPP database? Sounds to good to be true? Oh ye of little faith...take a look at PivotalR - voila!
PivotalR turns the R code into SQL that runs in parallel inside the Greenplum MPP database. This should hopefully address the concern that 'most analytics codes...fail to scale to large data set sizes'.
Interestingly, Hellerstein disgarees with Stonebraker that advanced analytics (data science) will replace BI. As he points out 'SAS users are not database users'. We side with Joe on this one. Sorry Mike.
It is interesting to note that Hellerstien suggests that 'the advanced analytics community is highly biased towards open source'.
Open source Greenplum & R anyone?
Chapter 12 : A Biased Take on a Moving Target: Data Integration (Stonebraker)
The last chapter of the Red Book covers the large topic of 'data integration'. Last, but by no means least - data integration is where team VLDB earns most of its corn.
It is largely the case that data integration 'began with the major retailers in the 1990s consolidating their sales data into a data warehouse'. To back this up, the early Teradata adopters in the UK include retailers such as Littlewoods and GUS (now combined into ShopDirect), Grattan, JD Williams (N Brown), Asda and Tesco.
Following on from the retailers, 'all enterprises...organized their customer facing data into a data warehouse'. In our MPP-centric world, this covers the banks and telecoms companies, in particular.
One of the greatest truisms we've witnessed first hand is that 'an upfront global schema is incredibly difficult to construct for a broad domain'.
We'd go further and describe this challenge as 'impossible'.
This issue can be so contentious at large organisations that we've even been called in to referee the bun-fight between the enterprise modellers and the architects. You know who you are!
Data integration is described as 'fundamentally difficult' both because 'data is dirty' and 'deduplication is hard'. Wise words indeed. Thankfully for team VLDB, the truism 'where there's muck there's brass' is wholly in play when it comes to the data integration challenge.
Step 5 of the data integration/curation journey is described as consolidation/de-duplication in which 'Mike Stonebraker and M.R. Stonebraker must be consolidate into a single record'.This consolidation challenge is known by various names such as 'name & address matching', 'customer de-duplication' or 'single customer view' (SCV).We have been heavily involved with SCV processing since the 1980s. In our opinion, SCV is by far the most complex logic we ever deploy as part of the myriad ETL processes that keep the data warehouse fed and watered with deltas.
SCV processing looks simple on the surface, doesn't it? Try and build a half-decent name/address matching process in SQL and you'll soon wish you hadn't started. Either that or you'll pretend what you've built in SQL is fit for purpose whilst knowing the results are terrible.
Data is dirty, remember?
Note to self - write an SCV blog post some time.
And that's it folks - our take on the most excellent Red Book 5th edition. Well done for reading this far.