Trying to go to bed smarter than when I woke up.
Vertica is a modern analytical database that’s designed for querying and working with large datasets in an analytical capacity. In addition to its power, it has a very easy learning curve: it’s query frontend is modeled off of PostgreSQL’s dialect.
When working with Vertica, I often find myself needing to export full tables from a PostgreSQL database, and load the data into a Vertica cluster. I’ve had to do this a couple of times now, so here are some tips on how to make this process painless and smooth:
Test loading a small dataset first
Before exporting an entire table from postgres, export a subset of it to make sure it’ll work
If there are any errors in your pipeline, you’ll most likely encounter them in the first 100 thousand rows. Then test loading the data to make sure thre are no errors. This works every time and it’s a lot faster to test loading a small subset of rows first.
Use REJECTFILE to trace the errors.
REJECTFILE will show you which rows were not inserted.
If you’re handling over a few million rows, use
COPY DIRECT. That way, the data essentially goes straight to disk.
Split the file into smaller parts and load in parallel
First, split up your file like this:
Once your file is split up, you can load it to vertica using multiple processor threads: