Greenplum Provisioning & Object Storage

Greenplum Provisioning

For those that has wrestled with Open Source Software (OSS) you may be familiar with perhaps the biggest ‘gotcha’: OSS isn’t always easy to tame.

The price tag might appeal, but the flip side is often the difficulty to provision, deploy, configure, optimise or secure OSS.

Greenplum is no exception. It is far from trivial to provision.

However, once you’ve been through the Greenplum provisioning process a few hundred times, it gets a lot easier. No, really!

Object Storage For Analytics

The current vogue in the analytics world is to store & analyse data in object storage.

The aim of this blog article is to show that it is possible to provision a cloudy Greenplum instance quickly & reliably, and that the provisioned Greenplum instance can query data resident in external object storage.

Provisioning A Greenplum Cluster

The first step is to create a Greenplum cluster on your preferred cloud platform, in this case AWS.

The current Greenplum v7 installation guide runs to over 2,000 pages. “There’s no such thing as a free lunch”, after all.

Thankfully, we’ve been installing Greenplum on all manner of platforms for well over a decade, so we won’t be needing the install guide.

The pre-requisites are a suitable Linux machine image and an infrastructure-as-code (IAC) script to ‘do the doing’.

Our first version of this process used the AWS CloudFormation Template (CFT) approach.

However, we’ve since re-developed the Greenplum provisioning process using Terraform.

Terraform GPDB Provisioning

It usually takes about 1 minute to provision a single node Greenplum instance on AWS:

$ terraform validate

Success! The configuration is valid.

$ terraform init

Initializing the backend...

<snip>

Terraform has been successfully initialized!

$ terraform plan

<snip>

Plan: 19 to add, 0 to change, 0 to destroy.

<snip>

$ terraform apply

<snip>

Apply complete! Resources: 19 added, 0 changed, 0 destroyed.

Outputs:

GP_Admin_Password = "gpdbdemo"

GP_Database_Name = "greenplum"

Master_Host_IP = "<ip_address>"

The GPDB super-user password and default database name shown above are Terraform variables. The IP address is assigned by AWS.

In this case the instance consists of the following:

1 x r5dn.2xlarge EC2 node ($0.668/hr on demand at the time of writing)
8 x vCPU
64GB RAM
4 x GPDB segments (shards)

Once provisioning is complete the instance is visible via the AWS console.

Check The New Instance

We can also SSH to the new Greenplum instance as the AWS Linux ec2-user to check all is well:
$ ssh -i <pem_file> ec2-user@<ip_address>

Once the Greenplum software has been successfully installed the following message will be visible:

 <snip>

Greenplum Database Cluster (GPDB) generated by VLDB.

[ec2-user@mdw ~]$

Connect Remotely via psql

The next step is to connect to the Greenplum Database (GPDB) using the Postgres psql command line client.

To connect to ‘greenplum’ database as GPDB super user ‘gpadmin’ running at the IP address assigned by AWS:

$ psql -h <ip_address> -d greenplum -U gpadmin

psql (14.7 (Homebrew), server 9.4.26)

Type "help" for help.

greenplum=# \d

Did not find any relations.

greenplum=# select current_date;

    date ------------  2023-05-04

greenplum=# \q

Object Storage Data

That’s GPDB provisioning taken care of. The next stage is to connect to object storage.

As this demo uses AWS, data will be stored in the AWS Simple Storage Service, or S3 for short.

The demo data is UK ‘Price Paid Data’ (PPD) provided by the Land Registry.

There are 3 datasets of different sizes:

current month (March ‘23) – 107,948 records
previous year (2022) – 843,730 records
everything (1995-2023) – 28,205,961 records

The files are stored in a standard S3 bucket in us-east-1.

Accessing External Data via PXF

The ‘glue’ that allows us to query S3 data from Greenplum is the Greenplum Platform Extension Framework (PXF).

PXF allows us to create external tables in Greenplum that map to data stored in S3:

create external table uk_price_paid_mar_23_ext  (transaction_id varchar(38)  ,sale_price     integer ,transfer_date  date   ,postcode       varchar(08)  ,property_type  char(01)    ,new_indicator  char(01)    ,duration       char(01)     ,primary_name   varchar(85)  ,secondary_name varchar(45) ,street         varchar(56)  ,locality       varchar(35)  ,town           varchar(22)  ,district       varchar(38)  ,county         varchar(35)  ,category       char(01)      ,record_status  char(01)) location ('pxf://vldb-data-us-east-1/landregistry/pp-monthly-mar-23.csv?profile=s3:csv&server=s3') format 'csv' ;

The ‘location’ clause binds the external table definition to S3 via the PXF ‘secret sauce’.

Similar DDL is used for 2022 data and 1995-2023 data which gives 3 external tables:

uk_price_paid_mar_23_ext (March ‘23)
uk_price_paid_2022_ext (2022)
uk_price_paid_ext (1995-2023)

These are external tables that map directly to unaltered CSV files stored in S3.

No data has been loaded to Greenplum Database (GPDB) at this stage.

Querying S3 Data

A simple ‘select count(*)’ via psql tests data access from GPDB via external tables to S3.

The row counts and elapse times are as follows:

March ‘23 – 107,948 rows, average 1 second
2022 – 843,730 rows, average 2 seconds
1995-2023 – 28,205,961 rows, average 36 seconds

All timings are averaged across 3 runs.

Load S3 Data to GPDB

The GPDB internal database storage consists of EBS volumes attached to EC2 instances.

In theory, internal storage offers greater IO bandwidth than external S3 storage. Let’s see shall we?

Data is easily loaded from S3 to GPDB internal storage by running a simple ‘create table as’ (CTAS) SQL query:

create table uk_price_paid as ( select *  from   uk_price_paid_ext)  distributed by (transaction_id) ;

Yes folks, that’s all it took to copy an entire table from external S3 storage to internal EBS storage: a simple CTAS query.

Note the ‘distributed by’ clause. As the data is now internal to GPDB, rather than stored in S3, the distribution of data across the database segments (‘sharding’) can, and should, be specified.

A CTAS was performed against all 3 external tables to create internal versions. The row counts and elapse times are as follows:

March ‘23 – 107,948 rows, average 1 second
2022 – 843,730 rows, average 2 seconds
1995-2023 – 28,205,961 rows, average 36 seconds

Once again, all timings are averaged across 3 runs.

Query Data Stored Internally

A simple ‘select count(*)’ via psql tests GPDB internal storage performance.

The row counts and elapse times are as follows:

March ‘23 – 107,948 rows, average 0.5 seconds
2022 – 843,730 rows, average 0.6 seconds
1995-2023 – 28,205,961 rows, average 2 seconds

The biggest improvement is for the largest table – the 28m row 1995-2023 dataset.

A full table/file scan drops from from 36 seconds when the data is stored in S3, to 2 seconds when the data is stored internally.

Unsurprisingly, internal EBS storage is significantly faster than external S3 storage.

Once again, all timings are averaged across 3 runs.

It’s A Wrap

In summary, what this blog article has shown:

provisioning GPDB instances on AWS using Terraform
using PXF to query data in S3
loading data from S3 to EBS storage using simple SQL
improvement in query performance against internal storage compared to S3

So, if you want to take advantage of OSS analytic software, why not give Greenplum a try?

As always, team VLDB are here to help.

Enjoy!

Author: Paul Johnson

VLDB

Greenplum Provisioning & Object Storage

Greenplum Provisioning

Object Storage For Analytics

Provisioning A Greenplum Cluster

Terraform GPDB Provisioning

Connect Remotely via psql

Object Storage Data

Querying S3 Data

Load S3 Data to GPDB

Query Data Stored Internally

It’s A Wrap

Welcoming Andy to VLDB Solutions: A New Leader in Sales and Marketing

Exploring the Classic Texts on Data Warehousing – Rediscovering Bill Inmon

SQL Query Splitting

VLDB

Greenplum Provisioning & Object Storage

Greenplum Provisioning

Object Storage For Analytics

Provisioning A Greenplum Cluster

Terraform GPDB Provisioning

Connect Remotely via psql

Object Storage Data

Querying S3 Data

Load S3 Data to GPDB

Query Data Stored Internally

It’s A Wrap

Share this:

Welcoming Andy to VLDB Solutions: A New Leader in Sales and Marketing

Exploring the Classic Texts on Data Warehousing – Rediscovering Bill Inmon

SQL Query Splitting

Discover more from VLDB