Fast Uploads to AWS Redshift with Pentaho Kettle

Although it’s blisteringly fast at large, aggregate queries Amazon Redshift is horribly slow at insert/update transactions. Even the new faster instances only get 3 or 4 writes a second, if you’re only uploading a few hundred rows at a time then Kettle’s table output step is probably ok, but if you’re trying to add a 500,000 rows to a Redshift table from a Pentaho Kettle job then you’re looking at 36 hours to write all the rows.

The Redshift documentation suggests that for bulk uploads you should upload your data as CSV files to S3 and then use Redshift’s COPY command to load to your table.

It’s actually quite easy to do all of this in one go in a job in Kettle.

First, create a simple transform to set a system variable to today’s date. We’ll use this later to set file names.

The next thing to do is to build whatever transformation you need to get your data in to the correct schema for your Redshift table using the S3 output step. Use the system variable from the previous transform to append to the file name. Don’t set a header row.
You need to set the fields in the same order as the columns in the table you’ll be copying to.

If you’re going to be uploading more than a million or so rows; select the split every…rows option and set it to 500,000. This will speed things up later; the COPY command loads files in parallel so 10 files of 500,000 rows will load faster than one file of 5,000,000 rows.

At this point you should test that your output files import to your redshift table correctly; run the transform on a small data set (use a LIMIT statement if you’re getting the date from a query) and inspect the file that gets written to S3. Then you can try manually running the COPY command to import the data to Redshift.

Once you’ve successfully copied a file to your table, you can go ahead and build the whole thing in to a job, using the Execute SQL Script step to copy the files to Redshift. You can use the system variable from the first step in the SQL query to copy only the files created during this run. Old files can just be left in S3 as an archive, deleting periodically.

New data gets added to the end of the table, so if your new data is ordered by the Sortkey of your table you can add a VACUUM command to the end of the SQL script.

In the job above you can see I’ve added success/failure emails,I’ve also added a check database connection step at the begining to make sure the Redshift connection and any other database connections are available before doing anything else.

Connecting Google Data Studio to Amazon Redshift

It’s now possible to connect Data Studion To Amazon Redshift.

Google and Amazon both have complete suites of analytics tools that seamlessly connect within their own ecosystems, but getting them to work with each other can be a pain.

Amazon’s data warehouse solution – Redshift – is widely used and works brilliantly with Amazon Quicksight, but in my opinion at least Google Data Studio is a better data vis tool, and it’s free.

I already used Redshift to power various reports in Tableau but we needed a way to share a few basic KPIs without having to pay for a load of extra Tableau licences. Data Studio isn’t even close to be as full featured and powerful as Tableau yet but it’s improving constantly.

Amazon changed the Redshift Postgress connector recently, so now it’s possible to connect Data Studio to a Redshift database using the standard Data Studio PostgreSQL connector (Redshift is based on Postgres).

It’s really simple to set up, provided you remember 2 things:

  1. Your Redshift cluster must be publicly accessible
  2. The username and password you enter in to Data Studio need to be for an IAM user with full access to the Redshift cluster

Here’s a screenshot of my Data Studio configuration:

Data Studio setup Redshift

And here’s one of my Redshift cluster setup:

Redshift set up Data Studio

Once you’ve got them connected Redshift functions like any other data source in Data Studio.