High speed data loading in Postgres

Intro

I vividly remember that time when I had to load a very large amount of data into Postgres and it just took forever - just didn't want to finish, even after hours.

In case you're in a similar situation, you've come to the right place. After reading this post you'll know all your options to get out of it. Using a worked example, I'll show you how it's possible to start at half an hour runtime and end up with a version which is done in half a minute. Step by step, we'll get it ~60x faster.

Tables

If you've checked out my otherI posts, you'll recognise the setup. It's the same one I used for this one. We have meters and their readings stored in their tables, here they are:

create table meters
(
    id  uuid primary key
);

create table readings
(

    id           uuid primary key,
    meter_id     bigint,
    rating       double precision,
    date         date

    constraint fk__readings_meters foreign key (meter_id) references meters (id)
);

I'm using Postgres on my laptop (Apple MacBook Pro M1 and 32GB RAM).

We'll be inserting 15000 meters with one reading every day for each one, for 5 years.

insert into meters 
    select uuidv4() from generate_series(1, 15000) seq;

insert into readings(id, meter_id, date, reading)
    select uuidv4(), m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m

I'm letting this run, and after a (rather long) while, it's done (in 2098.31s).

It's just our starting point, we'll get it much faster. Buckle up!

UUID v7

The first thing I'm going to do is focus on the primary key. The UUID v4 is not a great choice if we care about performance. I elaborated why that is in another post, but the gist is that its randomness cause a lot of page thrashing in the B-tree indexes. The database has to do a lot of work in order to keep the tree balanced after every tuple inserted.

But, no worries. Recently, Postgres got the ability to work with UUID v7! They are time-sortable, which means insertions will "affect" only a isolated part of the B-tree (if you'd visualise it, it's going to be its right side) - much less work for the database. Let's give this a try.

insert into meters 
    select uuidv7() from generate_series(1, 15000) seq;

insert into readings(id, meter_id, date, reading)
    select uuidv7(), m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m

Check this out - with this, we've reduced the time to more than half! It finished in 821.50s. So far so good.

Numeric IDs

Let's try something else. So the UUIDs themselves have to actually be generated before insertion, which naturally takes some time. Let's then replace the UUID primary key with a numeric one, which does not have to be generated but just read from a sequence. In addition, the corresponding data type (bigint) will be half the size of a UUID. Let's see where this brings us.

create table meters
(
    id bigint primary key 
);

create table readings
(

    id           bigint primary key generated always as identity,
    meter_id     bigint,
    reading      double precision,
    date         date,

    constraint fk__readings_meters foreign key (meter_id) references meters (id)
);

Here's the new script:

insert into meters 
    select seq from generate_series(1, 15000) seq;

insert into readings(meter_id, date, reading)
    select m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m

This gets us a little further indeed! We're at 646.73s. A bit over 10 minutes. This is great, but we've still got work to do - we still got to get it about 20 times faster than this.

Let's for now move on from adjusting the table structure, as it's efficient enough, and get our hands dirty with some configuration tuning.

Shared buffers

Postgres uses shared buffers to make reads and writes more efficient. It is a set of 8kb pages in memory which get read and updated, in order to avoid doing slower disk operations all the time.

If we don't size the shared buffers correctly, we will have a lot of so-called evictions to make room for all the data to be processed. This will make the process slower than it could be. The default setting of 128 MB is too low for our workload, so I'll increase it to 2 GB. This way all the data will fit.

alter system set shared_buffers='2GB';

As expected, this brought us closer to our goal. We're now at 595.33s. Onwards!

Full page writes

Postgres works with 8kb pages, however the OS and the disk do not. This can lead to, in the event of a power failure, pages being only partially written. This is a headache from the perspective of data recovery, because Postgres relies on the fact that the pages are not corrupted in any way when it starts its recovery protocol. The solution to this employed by Postgres is that after every checkpoint, at the first update of a page, the full page is written instead of only the change.

I do not recommend this setting for production use, but just for the sake of experimentation and on a temporary basis, let's shut it off.

alter system set full_page_writes=off;

Hmm, well, it got us to 590.01s. It's not that much, but we'll take it!

Constraints

Next up, we'll remove the table constraints and see what happens. From the script above, I'll remove the definition of the fk__readings_meters foreign key constraint. This is expected to give a boost because it doesn't have to check the presence of the key in the other table at every tuple inserted.

Quite a difference this made with this. We're now at 150.71s.

Indexes

What I'll do next is I'll remove the indexes as well. This means that the penalty for updating the index with every tuple inserted will not happen anymore. Just as with the constraints removal above, you can always just recreate indexes after the data loading is finished.

These are my tables now.

create table meters
(
    id bigint 
);

create table readings
(

    id           bigint,
    meter_id     bigint,
    reading      double precision,
    date         date

);

I've ran the same import script as above, and now we're at 109.51s. Great stuff! Can we get it under 100s?

Unlogged tables

Sure thing! But we'll have to make some more concessions. For example, for the rest of the experiment I'll be using unlogged tables. Again, not a setting to keep on for production beyond strictly the data loading procedure.

create unlogged table meters
(
    id bigint 
);

create unlogged table readings
(

    id           bigint,
    meter_id     bigint,
    reading      double precision,
    date         date

);

I'm now at 40.65s. Believe it or not, we're not done yet here.

Copy

The copy command is the king of speed when it comes to data loading. Let's give it a go, here's how it works.

\copy readings from '<path_to_file>/readings.csv' delimiter ',';

This finishes in 35.41s. Pff, amazing!

Here are all our results in one view:

That's quite a difference from when we started out. I want to add that I've experimented with checkpoint tuning as well. It didn't yield any notable improvements for this experiment. Nonetheless, you might want to keep it in mind as it can affect performance if misconfigured.

Thanks for reading!

Cover Photo by Florian Steciuk on Unsplash

Author Of article : Mircea Cadariu Read full article