background

Postgres 14 Beta 1, PG 14 Improvements, Best Primary Keys, Composite Keys | Scaling Postgres 166

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the release of Postgres 14 Beta 1, different PG 14 improvements, choosing the best type of primary key and using composite keys.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 14 Beta 1 Released!". This is from postgresql.org and they're announcing that the Beta 1 is now available for you to test out and install and try. It mentions a number of highlights here covering different areas of performance. Some of the ones that were particularly of interest to me is that having a large number of connections allows the database to operate more efficiently, potentially without using PgBouncer, whether those connections are in an active or idle state. There have been enhancements to partitioning in the Postgres Foreign Data Wrapper as well as reduced bloat on tables with frequently updated indexes. There are also improvements in data types and SQL. So one of the interesting things to me about here is the new subscript syntax that is possible when working with JSON, but definitely a lot of other improvements in areas of administration, particularly with regard to vacuum and how it operates, including the new idle_session_timeout capability, as well as replication and recovery, and security improvements. So if you want to get an idea of what's coming in Postgres 14 in the fall, you can definitely check out this blog post.

The next piece of content- "An early look at Postgres 14: Performance and Monitoring Improvements". This is from pganalyze.com and they looked at a number of areas of the new Postgres 14 that's in Beta 1. The first thing they looked at was the active and idle connection scaling that was done in 14. The yellow line represents Postgres 13.3 and the black diamonds represent Postgres 14 Beta 1. You can see Postgres 14 maintains its connections better than 13.3 did, but not to the degree that was presented in the previous post. So in this previous post where they talked about the patch, you could see a much greater improvement in the TPS, maintaining consistency out to around 10,000 connections, whereas previously it was falling off much more. But we don't really see that with the test that they did here.

So not really quite sure what's going on, if it's because it's still Beta One and there's maybe debugging set up or if there's something different about the configuration that was done. Not sure, and they don't really mention it in this post, but they do say that even with their testing at 5,000 connections, it was a 20% improvement in TPS and a 50% improvement in TPS at 10,000 active connections. So an improvement that they've demonstrated, but not as dramatic as the one that maybe people were hoping for, or maybe again it's just a configuration or because it's still Beta 1. The next area they looked at is memory use with the pg_backend_memory_contexts. So this is a new view that you can get information about your connection and even specify particular connections with this function. You can use it to be able to print out a given PID of a connection to get information about the memory that's logged to Postgres's logs.

That way, you can get an idea of how memory is being used. Next here, they looked at some monitoring where there's a new pg_stat_wal that gives you an indication of how many wall records have been produced, how many full-page images, and how many total bytes have been written. Now basically, you would have to reset the stats periodically to get these to reset, but it's a cumulative indication of what's going on. So you could find this beneficial with resets to track the state of your WAL generation over time. Then they mentioned the new query_id. So this is present in pg_stat_statements, but this capability in Postgres 14 allows you to turn the computation of the query_id on.

So here they're making a configuration setting change and now it appears in the pg_stat_activity table. So you can track these are essentially the same plan. Even though the query varies by these parameters to the internal system, it's going to be producing the same query plan. So you can do something interesting and see okay, out of the active queries in pg_stat_activity, what query ID is the most common? So you can see this query ID is using up most of the connections. So definitely highlights a number of improvements coming in Postgres 14. So encourage you to check out this blog post to learn more about these specific features.

The next piece of content- "UUID, SERIAL OR IDENTITY COLUMNS FOR POSTGRESQL AUTO-GENERATED PRIMARY KEYS?". This is from cybertec-postgresql.com. So basically to me, this is which auto-generated primary keys are best to use. Now the first thing they talk about you can go with what's known as a natural primary key. So if you have a user's table and it has a unique username, well you could use that as the primary key or you could use the email address as the unique key for that table. The other possibility is to use a surrogate primary key or an artificial primary key. A lot of application frameworks automatically generate these keys for you as the primary key, a surrogate primary key. So usually this is what I see, this is not as common using a natural primary key, but this post talks about if you're using an artificial primary key, what's the best way to keep it updated or auto-generated. There are basically two techniques that you can use. One is using a sequence of numbers, the other is using UUIDs. So for generating a sequence of numbers, the best thing to use is a sequence in Postgres.

So it's a separate data object that generates a new sequence of a number each time. The other alternative is using the UUIDs and in Postgres 13 and higher, you can use the gen_random_uuid function when you're creating the column-free primary key. So for example, with a UUID you can set the default to be this function and classify it as the primary key. Otherwise, if you want to use a sequence, you can create an INT or a bigint, and default it to the next value in the sequence, another function uses sequences, and designates it as the primary key. But you're going to have to have this sequence in place. Therefore, Postgres offers some convenience data types that they're calling serial or bigserial. What that does is when you specify, say, a serial as the data type, it automatically creates it as an integer, creates the sequence, and then sets that default as the next value in the integer.

So it does that for you automagically. Of course, a bigserial uses a bigint instead of an INT. Now, there is a new way in Postgres to set up auto-incrementing integers with a sequence and that's using identity columns. So basically you say the name of the column. The data type is bigint generated always as an identity with a primary key and that essentially manages a sequence for you behind the scenes to give you an auto-incrementing column. This is basically the way to go because it's the SQL-compliant way of doing it. Now, for completeness, they do mention you can use triggers but that's definitely the slowest approach to produce this. Then they cover a few different questions that someone might have. So it says should I use integers or big integers for my auto-generated primary key? His argument is you should always use a bigint because if you have a small table, the difference in size is not going to make that much of a difference because you're not storing that much data.

But if you have a big table, you have a real risk of running out of integers. So he definitely suggests always using a bigint and I definitely agree with that. Second question: should I use bigserial or an identity column for my auto-generated primary key? His response is you should use an identity column unless you have to support old PostgreSQL versions. Now, I would add to that that if your framework automatically uses the old syntax, I tend to use what my application framework uses. I don't want to kind of swim against the current or make things more difficult for myself. So when the time is appropriate, definitely move to the identity column, but they're still going to be supporting serial, bigserial, and things of that nature for quite a long time.

Next question, should I use bigint or UUID for an auto-generated primary key? His advice is to use a sequence unless you use database sharding or have some other reason to generate primary keys in a decentralized fashion, basically outside the single database. So I agree with this as well. I always use integers for my primary keys. The only time I don't is if some other service is generating a piece of data that I need to store. So if a JavaScript client is responsible for creating a unique identifier, then I would create it as a UUID in my system because UUIDs have a number of performance disadvantages. Now they did talk about where they did some benchmarking and found it slower in some cases, but there can be some huge performance implications as your data sizes increase with the generation of WAL.

There's actually this post from 2018 in November from 2nd Quadrant where they're showing how the amount of WAL generated significantly increases, mostly due to full-page image rights. The reason this happens is because of the random writes. So if you have a lot of data being inserted into a table with an integer column, it's just going to keep increasing, and usually just add it to the tail end of any indexes or the table as the data comes in. But if you have a UUID, those index insertions are very random across the entirety of the index. So you're going to end up with a lot of bloat as well as a lot of full-page image rights because a lot of different pages are touched and need to be saved to the wall.

So they were seeing significant hits to TPS and the size of the WAL with UUIDs compared to sequential IDs. Now what this post talks about is there is an extension that allows you to generate semi-random UUIDs. So they're partially sequential at the beginning and then they're random at the end. That made the performance of the system much better. So if you want to go with UUIDs and you're dealing with a lot of data, maybe you want to consider using this extension that gives you better performance but still gives you kind of sequential random UUIDs. But definitely, a great post talking about different options for auto-generated primary keys. So if you're interested, check out this post.

The next piece of content- "Database Design: Using Composite Keys". This is from endpoint.com and since we're talking about auto-generated primary keys, this is an argument basically for natural keys. In addition to using natural keys, he expands on it and says you could also use composite keys for your primary keys. So basically, more than one column is in the primary key. They give an example here of a typical company where you have departments and employees and say you need to add a new column to the department table that specifies the teams. How would you do that if you were using a surrogate key for a primary key? He's arguing that in a lot of cases using natural keys is better and when you do that, you could define a composite primary key that is composed of two different columns in that table. Now, some of the arguments I can agree with, but some I don't necessarily agree with.

You definitely can go this way and there may be some benefits to it, but some of the benefits I can see it's not as much as he suggests. He's talking about performance considerations and that these natural keys could potentially use up more space and there are performance advantages to that. But whenever using more data, it is a hit on performance. But most of the reason that I haven't used this or explored this option highly or it's very much on a case-by-case basis is that my application frameworks automatically generate surrogate keys. I've been able to operate my databases just fine using these surrogate keys and not using natural keys. Now there may be certain cases where I might reach for them, but generally, I tend to follow what my application framework does again because I kind of don't want to swim against the current too much. But if you want to learn more about using natural keys, particularly composite ones with your database, you can check out this post.

The next piece of content is actually a YouTube playlist called "Postgres Pulse". Now these videos were posted a while ago, but here is a convenient playlist. Each of them is about 30 minutes in length and I think there are about 12 to 14 of them. So if you are looking for some Postgres video content, maybe you want to check this out and see if there's any of this content that you want to look at that you've potentially missed.

The next piece of content- "PostgreSQL 14: Substantive Change to Fulltext Query Parsing". This is from akorotkov.github.io. Now I think what they meant here is there's a substantial change to the full-text query parsing. So basically there's been a bug in full-text search where when you try to search on something like pg_class, you aren't getting the expected results. Like here's a web search to tsquery where you're searching pg_class pg and even though this text exists, it's false so it's not finding it. So there's basically some bug in it and they basically had to resolve this bug in certain ways. As a result, you may get slightly different results from a full-text search in Postgres 14 based upon resolutions to this bug. So if you use a lot of full-text search and you want to be aware of any potential changes that may happen to the results of that, you may want to check out this post to see if it would potentially impact you. So if you want to learn more you can check this out.

Next piece of content- "Waiting for PostGIS 3.2: ST_InterpolateRaster". This is from blog.crunchydata.com and they're talking about a use case where you have temperature readings, say across the state of Washington, and you want to actually get temperature readings interpolated between where the measurements were made. If you want to do that using a Raster, there's apparently this new feature or function that enables you to do exactly that and gives you a gradient to interpolate where different temperature changes are based on where the temperatures were actually measured. So if you want to produce something like this, maybe you want to check out this blog post.

The next piece of content- "Porting Array length from Oracle PLSQL to Postgres PLPGSQL''. This is from migops.com, and they're talking about a specific case where you're porting an array length check in Oracle in a function to Postgres where you get an entirely different unexpected answer. So if you are looking to migrate Oracle to Postgres and want to watch out for some gotchas, maybe you want to check out this post.

Next piece of content- "PGO 4.7, the Postgres Operator: PVC Resizing, GCS Backups, and More". This is from blog.crunchydata.com and they're talking about a new release of their Postgres operator for working with Kubernetes. So if you are interested in that, you can check out all the new features that have been added to this release.

The last piece of content, the PostgreSQL person of the week is Magnus Hagander. So if you're interested in learning more about Magnus and its contributions to Postgres, definitely check out this blog post.

episode_image