Greenplum Provisioning
For those that has wrestled with Open Source Software (OSS) you may be familiar with perhaps the biggest ‘gotcha’: OSS isn’t always easy to tame.
The price tag might appeal, but the flip side is often the difficulty to provision, deploy, configure, optimise or secure OSS.
Greenplum is no exception. It is far from trivial to provision.
However, once you’ve been through the Greenplum provisioning process a few hundred times, it gets a lot easier. No, really!
Object Storage For Analytics
The current vogue in the analytics world is to store & analyse data in object storage.
The aim of this blog article is to show that it is possible to provision a cloudy Greenplum instance quickly & reliably, and that the provisioned Greenplum instance can query data resident in external object storage.
Provisioning A Greenplum Cluster
The first step is to create a Greenplum cluster on your preferred cloud platform, in this case AWS.
The current Greenplum v7 installation guide runs to over 2,000 pages. “There’s no such thing as a free lunch”, after all.
Thankfully, we’ve been installing Greenplum on all manner of platforms for well over a decade, so we won’t be needing the install guide.
The pre-requisites are a suitable Linux machine image and an infrastructure-as-code (IAC) script to ‘do the doing’.
Our first version of this process used the AWS CloudFormation Template (CFT) approach.
However, we’ve since re-developed the Greenplum provisioning process using Terraform.
Terraform GPDB Provisioning
It usually takes about 1 minute to provision a single node Greenplum instance on AWS:
$ terraform validate
Success! The configuration is valid.
$ terraform init
Initializing the backend...
<snip>
Terraform has been successfully initialized!
$ terraform plan
<snip>
Plan: 19 to add, 0 to change, 0 to destroy.
<snip>
$ terraform apply
<snip>
Apply complete! Resources: 19 added, 0 changed, 0 destroyed.
Outputs:
GP_Admin_Password = "gpdbdemo"
GP_Database_Name = "greenplum"
Master_Host_IP = "<ip_address>"
The GPDB super-user password and default database name shown above are Terraform variables. The IP address is assigned by AWS.
In this case the instance consists of the following:
- 1 x r5dn.2xlarge EC2 node ($0.668/hr on demand at the time of writing)
- 8 x vCPU
- 64GB RAM
- 4 x GPDB segments (shards)
Once provisioning is complete the instance is visible via the AWS console.
Check The New Instance
We can also SSH to the new Greenplum instance as the AWS Linux ec2-user to check all is well:$ ssh -i <pem_file> ec2-user@<ip_address>
Once the Greenplum software has been successfully installed the following message will be visible:
<snip>
Greenplum Database Cluster (GPDB) generated by VLDB.
[ec2-user@mdw ~]$
Connect Remotely via psql
The next step is to connect to the Greenplum Database (GPDB) using the Postgres psql command line client.
To connect to ‘greenplum’ database as GPDB super user ‘gpadmin’ running at the IP address assigned by AWS:
$ psql -h <ip_address> -d greenplum -U gpadmin
psql (14.7 (Homebrew), server 9.4.26)
Type "help" for help.
greenplum=# \d
Did not find any relations.
greenplum=# select current_date;
date
------------
2023-05-04
greenplum=# \q
Object Storage Data
That’s GPDB provisioning taken care of. The next stage is to connect to object storage.
As this demo uses AWS, data will be stored in the AWS Simple Storage Service, or S3 for short.
The demo data is UK ‘Price Paid Data’ (PPD) provided by the Land Registry.
There are 3 datasets of different sizes:
- current month (March ‘23) – 107,948 records
- previous year (2022) – 843,730 records
- everything (1995-2023) – 28,205,961 records
The files are stored in a standard S3 bucket in us-east-1.
Accessing External Data via PXF
The ‘glue’ that allows us to query S3 data from Greenplum is the Greenplum Platform Extension Framework (PXF).
PXF allows us to create external tables in Greenplum that map to data stored in S3:
create external table uk_price_paid_mar_23_ext
(transaction_id varchar(38)
,sale_price integer
,transfer_date date
,postcode varchar(08)
,property_type char(01)
,new_indicator char(01)
,duration char(01)
,primary_name varchar(85)
,secondary_name varchar(45)
,street varchar(56)
,locality varchar(35)
,town varchar(22)
,district varchar(38)
,county varchar(35)
,category char(01)
,record_status char(01))
location ('pxf://vldb-data-us-east-1/landregistry/pp-monthly-mar-23.csv?profile=s3:csv&server=s3') format 'csv'
;
The ‘location’ clause binds the external table definition to S3 via the PXF ‘secret sauce’.
Similar DDL is used for 2022 data and 1995-2023 data which gives 3 external tables:
- uk_price_paid_mar_23_ext (March ‘23)
- uk_price_paid_2022_ext (2022)
- uk_price_paid_ext (1995-2023)
These are external tables that map directly to unaltered CSV files stored in S3.
No data has been loaded to Greenplum Database (GPDB) at this stage.
Querying S3 Data
A simple ‘select count(*)’ via psql tests data access from GPDB via external tables to S3.
The row counts and elapse times are as follows:
- March ‘23 – 107,948 rows, average 1 second
- 2022 – 843,730 rows, average 2 seconds
- 1995-2023 – 28,205,961 rows, average 36 seconds
All timings are averaged across 3 runs.
Load S3 Data to GPDB
The GPDB internal database storage consists of EBS volumes attached to EC2 instances.
In theory, internal storage offers greater IO bandwidth than external S3 storage. Let’s see shall we?
Data is easily loaded from S3 to GPDB internal storage by running a simple ‘create table as’ (CTAS) SQL query:
create table uk_price_paid as (
select *
from uk_price_paid_ext)
distributed by (transaction_id)
;
Yes folks, that’s all it took to copy an entire table from external S3 storage to internal EBS storage: a simple CTAS query.
Note the ‘distributed by’ clause. As the data is now internal to GPDB, rather than stored in S3, the distribution of data across the database segments (‘sharding’) can, and should, be specified.
A CTAS was performed against all 3 external tables to create internal versions. The row counts and elapse times are as follows:
- March ‘23 – 107,948 rows, average 1 second
- 2022 – 843,730 rows, average 2 seconds
- 1995-2023 – 28,205,961 rows, average 36 seconds
Once again, all timings are averaged across 3 runs.
Query Data Stored Internally
A simple ‘select count(*)’ via psql tests GPDB internal storage performance.
The row counts and elapse times are as follows:
- March ‘23 – 107,948 rows, average 0.5 seconds
- 2022 – 843,730 rows, average 0.6 seconds
- 1995-2023 – 28,205,961 rows, average 2 seconds
The biggest improvement is for the largest table – the 28m row 1995-2023 dataset.
A full table/file scan drops from from 36 seconds when the data is stored in S3, to 2 seconds when the data is stored internally.
Unsurprisingly, internal EBS storage is significantly faster than external S3 storage.
Once again, all timings are averaged across 3 runs.
It’s A Wrap
In summary, what this blog article has shown:
- provisioning GPDB instances on AWS using Terraform
- using PXF to query data in S3
- loading data from S3 to EBS storage using simple SQL
- improvement in query performance against internal storage compared to S3
So, if you want to take advantage of OSS analytic software, why not give Greenplum a try?
As always, team VLDB are here to help.
Enjoy!
Author: Paul Johnson