background

PG15 Beta 4, Primary Key Options, Sequence Limits, Configuration | Scaling Postgres 232

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the release of Postgres 15 Beta 4, different primary key options, sequence limits and how to examine your Postgres configuration.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 15 Beta 4 Released!". This is from postgresql.org and basically, they've just released a new beta of Postgres 15. There are a number of changes that they do list here from Beta 3, including that the SQL JSON features have been removed and that they were planning to go into 15. This was mentioned last week in one of the posts and there's a suggestion in one of the posts this week that it may have been performance regression related. But if you want to learn more, check out this blog post.

Next piece of content- "Choosing a Postgres Primary Key". This is from supabase.com and they're talking about how you should go about choosing a primary key for your tables. Now usually the first thing that people reach for or use is integers or big integers. And Postgres has a very convenient way to set these and have them auto increment by using serial and big serial and using that as the data type when you create the table and set it as the primary key, as you can see down here. Now what that does is it creates it as an integer primary key, but it also creates a sequence that gives you auto-incrementing values. Now you can also create an auto-incrementing ID with integers using the SQL standard, which is "KEY GENERATED BY DEFAULT AS IDENTITY". And a lot of people recommended using this going forward because it is the SQL standard. But serial and big serial still work in Postgres. They also discussed potentially using random IDs for your primary key and the benefit of that is that someone has less information about your database.

For example, if you show certain user IDs as part of the URLs, it'd be hard to interpret what the next one is. Now, whereas you can use random IDs, they're talking about, it's probably more important to use secure random IDs, but whenever you start talking random IDs, you lead into UUIDs. Now they focused on discussing two different types of UUIDs.Version 1, which is presumably the original one, is actually constructed of a date timestamp using the Gregorian calendar, a Mac address. So as you can tell, it's not fully random because it includes a date time and a Mac address in it, but it tries to give you what it says, a unique identifier. Now version 4 UUIDs use pretty much all the characters for randomness. Generally, this is the version that you've probably been using for a while, if you're using IDs is a fully random version for UUID. In the Postgres version, I believe 13, there's a built-in way you can do it, or for versions prior, you could add the extension to do it.

Now, relatively recently, there's been an explosion of different UUID types and versions 6, 7, and 8 were published in 2021 recently. The reason is that you can have a lot of problems when using these UUIDs because they sort terribly. I mean, essentially they're very random in nature and this can cause a lot of problems. Try to index them and write them when they're being indexed because you have to go all over the index to place the proper values. And a lot of organizations have developed new versions that I would call pseudorandom, primarily using time. So what the IETF did was establish these three additional versions, versions 6, 7, and 8.

Now, along with this post, I did some googling and I found this additional reference- "Analyzing New Unique Identifier Formats (UUIDv6, UUIDv7, and UUIDv8". This is from blog.devgenius.io and he gives a very good briefing on the differences. So UUID 6 is basically a simple reordering of the bits within the UUID to allow it to be sorted as an opaque sequence of bytes. And what they basically did, was they took UUID version 1 and instead, they're putting the year first, then month, day, and timing. So it's much easier to sort compared to UUID version 1's. Now, UUID 7 is a new time-based UUID. So the first part of it is time-based and the latter part of it is random. The added benefit, it's based on the Unix timestamp.

So that's much more familiar for a lot of people compared to the Gregorian calendar version 6 one. So basically UUID 7 is probably your go-to if you're going to want to use something in a database. 100% randomness is not important. So it's a pseudo-random number. The first part of it is sorted by time and then the latter part is a random sequence of characters. And UUID version 8, it has a lot of flexibility in how you want to build it out. But going back to the original post, you could see what they mentioned here. UUID 6 has 62 bits of Gregorian time and then some 48 bits of randomness. UUID 7 has 36 big Indian Unix timestamps and then variable randomness up to 62 bits. And then version 8 is more variable. So basically, if you can't find a use case for version 7 or 6, then you should use version 8.

Now to go to a different post that I think is also relevant if you want to learn more about this is a post from 2018 from Second Quadrant called "Sequential UUID Generators". And here, they're talking about an extension that was developed called sequential-uuids, which seems to be doing a similar thing to what the standards bodies have done with version six, version seven and eight, putting a sequential portion of the UUID and then some randomness in it. But what they found out is that when you're using completely random UUIDs, your transactions per second falls off a lot based on how much data and how you write it to disks, because there's a lot of random I/O with it. So you can see how the transaction per second drops precipitously. In addition, you're going to write a lot more WAL and you're going to have a lot more full-page images being written to the WAL.

So you have a lot of disk activity going on relative to that when you're using random. If it's more sequential, you're not going to create as many full-page images in the WAL. And here's another example of a medium-sized data set and then a large data set, and you can see how the fully random I/O, and even if you have just a portion of it sequential, but not very much, is going to be writing a lot of full-page images to the WAL. So in Postgres, using fully random IDs can cause a lot of performance problems. You can reference this post if you want to learn more about that. But you're saying, 'Okay, this sounds great, how would I get started using something like this?'. So there is a pg_idkit extension that includes the new version 6, and version 7 standards, as well as a host of other standards developed by other companies primarily. And you can just click here to get the extension to use it in Postgres. This post follows up by talking about the generation speed of UUIDs and how much it is to store them.

And really, if it takes longer to generate them, they're usually smaller in size, whereas those that are fast tend to be a little bit larger in size. But there's not a significant amount of difference in my opinion. So in terms of what you should use, basically using serial or big serial is probably the way to go. My personal opinion is that I would always go big serial because if your database is small, you can handle a big serial. It's not going to take up a lot of room. Once you get to have a super large database, you're going to be running out of a normal integer size anyway, so you're going to have to migrate to it. So from the start, I would advocate using big ints or big serials to start.

Now, in terms of UUIDs, there are a lot of people who like to use them well, with version 7 being more sequential, I think that's potentially more appealing to use as a primary key. But what I tend to do is only do that if my unique identifiers are being generated outside of the database. So if they're generated inside, I go ahead and stick with using auto-incrementing integers, but if something's generated outside of it, I go ahead and use UUIDs. But this is a very comprehensive post about choosing a primary key. So if you want to learn more, definitely check out this one.

Next piece of content- "ERROR: NEXTVAL: REACHED MAXIMUM VALUE OF SEQUENCE". This is from cybertec-postgresql.com and I assume this is a post to deal with this error where your sequence has reached its maximum value. So basically if you choose to use a serial type or a standard integer four type, it'll create a sequence of that size as well and that can cause problems if you eventually run out of them because now you can't insert any more records because there's nothing left in the sequence. Now, a lot of times you're going to hit this limit when you hit the 2 billion mark in your table.

But if you do frequent deletes from that table or a lot of inserts that don't go through completely, you can actually have a much lower number of rows in your table. But still, you're running out of sequence numbers because you're using so many of them. So again, this is another reason that argues for going ahead and using the big zero or the big int to start and just don't worry about it. But this post does show you how you can see what the current value of your sequence is using SELECT currval. It can actually set the value to a different value and he shows you how you can simulate this particular error. But if you want to learn more you can check out this blog post.

Next piece of content- "PostgreSQL Configuration". This is from proopensource.it and they're talking about all the different configuration parameters in Postgres and basically there are 346 of them. These are from version 14. And if you want to look at all these different parameters, all you have to do at a psql command prompt is type "SHOW ALL;" and give you all the different configuration settings along with the description of what each one is. Now if you want to look at a specific one, you can also specify that particular parameter. So you can say SHOW shared_buffers and I'll give you an output of what the shared buffers are currently set at.

Now these are also categorized into groups and she has a query here where she's looking at the pg_settings system view and it actually breaks out by category all the different parameter settings for each one. And finally, this post closes out by saying there is this website, postgresqlCO.NF for looking for information about each parameter. Of course, you can also look at the Postgres documentation as well as two different configurators that help you get a start for configuring all of these many parameters of Postgres. But if you want to learn more about this you can definitely check out this blog post.

The next piece of content is "Parallel vacuuming in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL". This is from aws.amazon.com, and they're talking about parallel vacuum. Basically, vacuum is the process by which you vacuum up dead rows that are left from an update or delete. It doesn't delete them or update them, it actually just marks them for future cleanup, to reclaim that space, to use it for something else, or to actually potentially shrink the table if it's at the end. This post goes over how vacuum works, and it also talks about autovacuum, which basically handles this automatically based upon different parameters on whether a table is ready to be vacuumed yet.

But you can also do manual vacuums. Now, the manual vacuum offers parallelism. You don't get that with an autovacuum. This post goes over how you can set different parameters within the database to be able to configure when a parallel vacuum should be run. It also goes over the commands to do it because you can specify parameters, whereas even though you tell it to do a parallel vacuum, it won't do it. If you say it has to have at least four indexes to do parallel, even though you tell it to do parallel, it still won't do it. And I should say that even though this talks about RDS and Aurora, this is applicable to Postgres as well.

Now, if you want more detail on this particular post, you can also check out "5mins of Postgres E34: Making the most of Parallel VACUUM and why SQL/JSON was pushed out to Postgres 16". This is from pganalyze.com. So the parallel vacuum post is what Lukas went over in this week's episode, so you can get more insight into the post from him. He also looks into the SQL/JSON being removed from Postgres 15, hopefully to come into 16. And what he mentioned from his investigation into the conversation here is that it was actually a performance regression, so that's kind of why it was slowed down. But if you want to learn more about that, definitely check out his episode.

Next piece of content- "How to Plot an ASCII Bar Chart with SQL". This is from blog.jooq.org, and this is just an interesting little take on how to use a lot of SQL code to generate ASCII charts. So if you want to see what can be done without a graphing library, definitely check out this blog post.

The next piece of content "WAL and checkpoint tuning" was the next episode of Postgres FM, so you can listen to that episode or click here to check the YouTube video on this particular episode.

Next piece of content- "PostgreSQL with Andrew Atkinson". This is from codewithjason.com, and this is another podcast that has a guest who's talking about Postgres. And interestingly, one of the upcoming episodes of The Rubber Duck Dev Show will be having Andrew on as a guest as well, so we'll get to ask him some questions there.

And the last piece of content, the PostgreSQL person of the week is Jeevan Ladhe. If you're interested in learning more about Jeevan and his contributions to Postgres, definitely check out this blog post.

episode_image