background

Postgres 14 Released, Using JSON, Not Using Indexes, Sequence Gaps | Scaling Postgres 185

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the release of Postgres 14, how best to use JSON, why are your indexes not being used and sequence gaps.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 14 Released!". This is from postgresql.org and they've released version 14. They mentioned a number of the different features in this announcement here. Some of the main ones that they talk about is being able to support JSON subscripting, so much similar to most programming languages in order to extract certain values out of the JSON field. Support for multirange types within the range type/data type. Performance improvements with regard to connection with, they say, some benchmarks showing a 2x speed up, or at least the transaction performance doesn't drop by half as you increase the number of connections, like up to 10,000, for example.

It maintains its performance. Some improvements to be sure to reduce bloat on indexes. They talk about the pipeline queries, which are basically able to send a query before receiving the results back so you can pipeline multiple queries and then receive the results in an asynchronous manner. Improvements to logical replication in terms of not having to wait for large transactions before data is transferred, and a number of other improvements, including changes to the Foreign Data Wrapper. Vacuum improvements that we'll talk a little bit about more later, as well as a number of other security, performance, and convenience features. So if you want to find out more about that, you can check out this post.

I'm also going to talk about this post, which is "How We Shipped PostgreSQL 14 on Azure Within One Day of its Release". This is from citusdata.com. That's pretty impressive because sometimes it takes months for some of these hosted Postgres services to get out the next release once Postgres is released. For them to do it within one day is pretty impressive. Now, they discussed that part at the end of the post, but they also mentioned some of the features that they found favorable. Of course, Microsoft did contribute to some of these big performance features that were mentioned.

So for example, one is the connection scaling. So they did most of the work, to my knowledge, on improving the connection scaling so that when you try to go out to, say, a 10,000 connection count, the performance doesn't drop almost by half compared to being able to maintain it at a high connection level. So for example, it looks like the difference between 100 connections and 10,000 connections doesn't seem to be significant, at least in this read-only pgbench workload test. There were also improvements done to vacuum speeds that they say are up to 25% performance improvement as well as crash recovery, being able to do that much faster, like up to 2.4x faster.

Now, it looks like this recovery in its sync method is kind of related to being able to get a much faster startup recovery, but there are some considerations involved. So I would definitely check this out. If you run a large instance and want your database to come back online it happens to crash. Hopefully, that's rare or nonexistent for you, but that might be something to look into. Then as I mentioned, the latter part of this post actually talks about how they got their extensions ready so quickly to be compatible with 14, as well as get Citus data, their scale-out extension, ready as well. So definitely encourage checking out this blog post.

The next piece of content- "PostgreSQL 14: PostgreSQL Is Still a Tweeny". This is from instaclustr.com and this is just another viewpoint of the improvements that he thinks are significant. Again, he's mentioning the connection scaling improvements. Again, that's a big one that most people mention. They talk about a number of processes that have been moved to the background to be able to do work in parallel, such as refresh materialized view, return query, and queries to foreign data wrappers. As well as some improvements to CONCURRENTLY so that you can do more actions concurrently.

Or at least some improvements to those actions that are done concurrently. Then he mentions a whole host of visibility improvements. So these are improvements to monitoring the activity that the database does. So for example, there's a pg_stat_progress_copy to be able to monitor copy operations. There's pg_stat_wal to give you more information about the write-ahead log. Pg_stat_replication_slots. In addition to mentioning that they exist, it now shows you the status of the replication. So there are a lot of additional statistics and observability improvements that have been added to different areas of Postgres. So if you want to learn more about that, you can check out this blog post.

The next piece of content- "Upgrading PostgreSQL 9.6 to PostgreSQL 13". This is from migops.com. Now, with the release of Postgres 14, that means that PostgreSQL 9.6 will become end-of-life essentially on November 11, 2021. That's the last release. So if you're still on that version or even the version prior, you're going to want to upgrade. Now of course you can do a pg_dump and a pg_restore into the new database version, but if you have a large database, that will take an incredibly long time. So this post actually runs through the process of using pg_upgrade and the hard links method, which can give you a very fast cutover time on the order of seconds or minutes. If you have an incredibly credibly large database and they go through a pretty good process, one that I adopt myself in, that you have a primary and then a streaming standby server. And in preparation for the replication, say you're going to be upgrading this one, you create additional standbys ready in case you need to fail back to the prior version because of some issue. But this post goes through all the different commands that you would do to do this type of upgrade. So if you're interested, check out this blog post.

Next piece of content- "JSON IN POSTGRESQL: HOW TO USE IT RIGHT". This is from cybertec-postgresql.com. This is talking about the tendency at least today, or they say as of the 2000s, to store everything in a large JSON, which is not really how you want to use JSON in Postgres. He has an example here, what he calls a bad example of where, say, you have a people table that you have some people in and you have a set of rooms and basically you want someone to reserve particular rooms. To do that you've decided to just use a big, large JSONB field to store the information about who's assigned to what room. So there's a whole host of things wrong with this that he mentions in the post below. The first mistake is to model regular columns as JSON attributes.

So the first thing he's talking about is the name room. You could easily make this its own column within the table and you would want to do that. So don't put all of that information in the JSON blob. The second mistake is modeling tabular data as JSON arrays. So for example, this could easily be essentially its own table, a reservations table, and that's exactly what he modeled here. He created a reservations table and linked the people ID to the room ID, and then had a range of when it would be reserved. So that's a much better way to model it. The third mistake is storing foreign keys in JSON. So that means you're going to have to do this awful type of query here to link rooms to reservations, whereas you could easily do it with the data model changes he's proposed. The fourth mistake is it makes it incredibly hard to modify the JSON data if you store it like that.

This example was just making a new room reservation, whereas with his model it is a simple insert to do a room reservation. The fifth mistake is trying to enforce constraints on JSON, which is nearly impossible to do, or there is no standard way to do it. Whereas Postgres has constraints you can put on each data field to enforce constraints either by data types or custom design constraints. An example here of using a B-tree just index is to be able to ensure that a room is not reserved more than once. So there's a particular technique to do that called an exclusion constraint. So those options are available to you if you model it in the way that he suggests. The sixth mistake is complicated searches in JSON. So looking to see if a time is free can get quite complicated up here, whereas it's a much simpler query to do it below that.

So those were just some bad examples of issues you can run into if you try to store too much in JSON. Now he talks about a good example of it: you have your standard table, you have all the standard columns, but there are some additional attributes that don't apply to every object. So this is something in a warehouse and everything pretty much has a price, a weight, whether it's available or not, who manufactured it, and packaged size, so those are consistent for any item. Whereas maybe you have certain attributes as he talks about, maybe it's a type of power plug, something unique for that item that is important for a customer to know. You could store it in the attributes field but that's a good example. But if you want to learn more, you can check out this blog post.

The next piece of content- "Using recursive queries to get distinct elements from a table". This is from depesz.com. So this is from an example where someone had an 800 million row table with a set of stations and each station had a certain number of channels. This is the query that they did where they wanted to get all the channels for a station. So they would select the station and then do an array aggregation of the channel and group it by the stations so it will show all the channels for a station. Basically, it took about five minutes to run this. So he tried to do a similar thing and asked how we can improve performance. Now he added the index on the station and channel and it ran in 136 seconds so that takes a bit of time.

Now he said one path you could do is that you could pre-cache that value and maybe you do it as a materialized view or you create a separate table that you then update with triggers. That's one way that you could do it and you could get a query back in less than a millisecond. So basically you're caching it essentially. But he came up with another solution which is basically a skip scan. So the first thing you do is you do a SELECT it from the table and limit it to once so it's going to get the first one. Again, that happens at about a millisecond. Then you do greater than that value that you just received and you want to get the next value that exists for that station, the next channel.

You're going to skip everything in between it and then just limit one. Again that ran at about a millisecond so you want to continue doing that for all the channels. So he did it as a recursive query, so it recursively runs through and finds all the channels for his station. With that, he got the query down to 157 milliseconds so that's with no caching at all, that's a significant improvement over 136 seconds. So definitely a great technique to consider if you need to do something similar.

Next piece of content- "WHY ISN'T POSTGRES USING MY INDEX?". This is from pgmustard.com. He says the two main reasons that maybe this happens is that number one, Postgres can't use the index, or two, it thinks that it could do the query without using the index faster. It covers some of these and some of the reasons he mentions that you can't use it, that maybe a function is preventing its use. The main reason that would be is like if you're using a lower to find an email. Unless that has been defined as a functional index and lowering the email field, it's not going to work. Another type is a data mismatch preventing its use. He uses an example of where he's trying to search with a numeric and it just basically does a sequential scan instead or the index does not exist. Now I'll add a fourth option here and that is if you're using LIKE or ILIKE, and you only have a B-tree index on it, it's generally not going to work.

Then he goes into some of the reasons that it can use the index but doesn't think it should. One is the table is small so it can just do a sequential scan. A significant proportion of the rows are being returned, so if it's returning 80% of the rows, there's no reason to use an index. It's faster to just do a scan. Limit clauses can also have an issue with that, so sometimes you need to be cautious of that and they actually have a link to a Twitter thread talking about the potential dangers of using LIMIT. So something to be aware of. Then the other consideration is also cost estimation. So if your statistics are off, it's not going to be using an index. So this is a pretty comprehensive post about why Postgres isn't using an index, so I definitely encourage you to check that out.

Next piece of content- "Zero-downtime Postgres schema migrations need this: lock_timeout and retries". This is from postgres.ai and he's talking about the importance of when you're doing schema migrations using a lock timeout. Otherwise, you can get things waiting behind a lock queue and not being freed up in your database system basically going down. So anytime you want to do any sort of a DDL change, you're going to want to use a short lock timeout, and depending upon your implementation, you're of course going to want to retry it if it does get locked. He advocates different ways to do that, particularly using an exponential backoff and jitter, a randomization. So then when it tries again it's going to do it at a random time. So definitely encourage doing that. He also talked about subtransactions, but I would specifically avoid those because of issues with them. But definitely, good advice to follow about using lock timeout when you're doing DDL changes.

The next piece of content- "GAPS IN SEQUENCES IN POSTGRESQL". This is from cybertec-postgresql.com. So they're talking about how the gaps can happen in sequences in, say, your primary keys that are integers. One way it happens is that if you're doing an INSERT and you get a failure with it, it will just start over at the next sequence number. It won't go back and do that sequence number. So you can see here we have a gap because when it tried to INSERT this row, that particular row failed and it's not going to reuse that sequence number, it's going to just get the next one. The other possibility is caching. So you can cache sequences so that your session doesn't consult the sequence every time it has a set of values that it can just reuse.

But if you do a disconnect and start a new session, it's going to start that sequence after the point of that value cached. So the disadvantage of caching these is that if you have sessions constantly connecting and disconnecting, you're going to run through your sequences quite quickly. The next possibility is a crash. So if something crashes and is not saved to the database, it's possible that the sequences were not logged to the WAL and you can get a gap of say, up to 32 numbers. So that's another gap potential. Also, sequences can jump backward in the event of a crash if nothing is logged to the WAL. So for example, if you start a transaction BEGIN and you do next file, next file, next file, next file, and then it crashes when the database comes up and you do the next file, it'll start back over at one.

It will do that because nothing was committed to the WAL, essentially. Now, normally that's a problem, but he says "...don't use sequence values from an uncommitted transaction outside of that transaction". Because again, that could cause problems. Now he talks about a way to build a gapless sequence, but that's basically using a separate table that's tracking the numbers. It's basically creating your own sequence and that'll just be pretty slow. So I definitely wouldn't recommend that. But it's a way that you could do it or this is the way to implement it if you're looking for a guaranteed gapless sequence, but you're going to get performance problems. But if you want to learn more, you can check out this blog post.

Next post is "3 ways to auto-increment with Postgres". This is from mydbanotebook.org. These three ways are three ways that you can get sequences. They're no different. They're essentially all using sequences but in different manners. So the number one way to get something to auto-increment is just to create a sequence. So it's not really set to the primary key, you can just use a sequence for any purpose to give you something that auto increments. The next way is through a serial data type that explicitly creates a column that has an associated sequence that is the primary key and is NOT NULL. The third way to do it is an identity column, and this is an SQL-compliant way versus a serial data type that Postgres has.

This is a SQL standard way of generating auto-incremented IDs. I believe it still uses essentially the sequence mechanism in the background, but it doesn't show that explicitly here. So you can see here it's using the next file of a particular sequence, whereas here it just says "generated by default as identity". Now, what they did mention is that this also has the support to do a "generated always as identity", and what that difference means is that this prevents someone from manually inserting an ID. So basically it will always be done through the Identity column. But if you want to learn more about these three different ways to generate auto-incrementing numbers, definitely check out this blog post.

The next piece of content- "Devious SQL: Dynamic DDL in PostgreSQL". This is from blog.crunchydata.com and they're talking about different ways to do dynamic DDL in Postgres. Now the example that they're having here, say you have logical replication and you have two tables, and the sequences get out of sync. So basically, you want to do a restart of the sequence with what the max ID is in the table. So they go through some different ways to do that. Because if you try to do an ALTER SEQUENCE with a RESTART and do it as a subquery, it won't work because it's expecting a single, essentially, scalar value. So they show these different ways to do, essentially, Dynamic SQL to accomplish this. So the first way to do it is to use psql's variable substitution. So this is using the psql client to do it and you use \gset to set a particular variable.

Then you use that variable in the next command to alter the sequence. The next way is to do the \gexec command within Postgres where you define your query, and then you do a gexec on it to actually alter the sequence. The other way to do it outside of psql is to do it as part of an anonymous do block using PL/pgSQL. You can also create your own exec function to do it, and that's another way that you can accomplish the same thing. But they do make a note that if you're trying to do some Dynamic SQL that has REINDEX CONCURRENTLY, that can run into issues because the CONCURRENTLY operation needs to be run outside of a transaction block because it does its own control of the transactions to be able to accomplish doing things concurrently. So you need to be aware of that if you're trying to use Dynamic SQL with it. But you can check out this blog post.

Next piece of content, the PostgreSQL person of the week is Jeff Davis. So if you want to learn more about Jeff and his contributions to Postgres, you can check out this blog post.

The last piece of content, we did have another episode of The Rubber Duck Dev Show. This one discussed when to use UUIDs and when not to. So if you want to learn more about that, you can check out this episode. The upcoming episode is on whether or not to comment your code.

episode_image