Fast Uploads to AWS Redshift with Pentaho Kettle

Although it’s blisteringly fast at large, aggregate queries Amazon Redshift is horribly slow at insert/update transactions. Even the new faster instances only get 3 or 4 writes a second, if you’re only uploading a few hundred rows at a time then Kettle’s table output step is probably ok, but if you’re trying to add a 500,000 rows to a Redshift table from a Pentaho Kettle job then you’re looking at 36 hours to write all the rows.

The Redshift documentation suggests that for bulk uploads you should upload your data as CSV files to S3 and then use Redshift’s COPY command to load to your table.

It’s actually quite easy to do all of this in one go in a job in Kettle.

First, create a simple transform to set a system variable to today’s date. We’ll use this later to set file names.

The next thing to do is to build whatever transformation you need to get your data in to the correct schema for your Redshift table using the S3 output step. Use the system variable from the previous transform to append to the file name. Don’t set a header row.
You need to set the fields in the same order as the columns in the table you’ll be copying to.

If you’re going to be uploading more than a million or so rows; select the split every…rows option and set it to 500,000. This will speed things up later; the COPY command loads files in parallel so 10 files of 500,000 rows will load faster than one file of 5,000,000 rows.

At this point you should test that your output files import to your redshift table correctly; run the transform on a small data set (use a LIMIT statement if you’re getting the date from a query) and inspect the file that gets written to S3. Then you can try manually running the COPY command to import the data to Redshift.

Once you’ve successfully copied a file to your table, you can go ahead and build the whole thing in to a job, using the Execute SQL Script step to copy the files to Redshift. You can use the system variable from the first step in the SQL query to copy only the files created during this run. Old files can just be left in S3 as an archive, deleting periodically.

New data gets added to the end of the table, so if your new data is ordered by the Sortkey of your table you can add a VACUUM command to the end of the SQL script.

In the job above you can see I’ve added success/failure emails,I’ve also added a check database connection step at the begining to make sure the Redshift connection and any other database connections are available before doing anything else.