Postgres 15.1, Postres 10 Retrospective, Reducing Replication Lag, Listen & Notify

In this episode of Scaling Postgres, we discuss the release of Postgres 15.1, a Postgres 10 retrospective, how to reduce replication lag and using listen & notify.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 15.1, 14.6, 13.9, 12.13, 11.18, and 10.23 is released". This is from postgresql.org and this is a pretty quick set of releases after Postgres 15 was released not too long ago, in my opinion. It includes over 25 bug fixes that were done over the last several months that primarily affect Postgres 15. They also updated some time zone data as well as this is the last official release of PostgreSQL 10, so it is now considered end of life. So if you're still on 10, you should definitely upgrade to one of the more recent versions.

Next piece of content- "A TRIBUTE TO POSTGRESQL 10". This is from jkatz05.com and he's talking about how now that Postgres 10 is at end of life, we should actually look back and reflect on the significant release that it was. Some of the things he mentions here is they actually started doing a new versioning scheme so that it was much more clear what a major version is vs a minor version, et cetera. So they went from the major versions being 9.5 and 9.6 to 10, 11, 12, and 13. The other significant thing is adding more parallelism to different areas. So much so that on a particular OLTP workload he had when he upgraded from Postgres 9.4 to 10, he saw a 2.5x speed up for the entire workload. Now of course there were some changes in 9.5-9.6 that he skipped over because they introduced some parallel things too, but definitely impressive seeing a 2.5x speed improvement upgrading a couple of versions.

The other feature it introduced is declarative partitioning, making it a lot easier to do. Also, it introduced logical replication with publishers and subscribers to be able to send data logically as opposed to just copying a whole instance. He mentioned when you're setting up replicas, you could now set a quorum for synchronous replication, so not everything had to be in synchrony, but you could specify a particular quorum level. It also made passwords more secure, supporting scram password authentication instead of MD five. And then a few other things like enhancements to the Postgres foreign data wrapper, doing push down as well as identity columns as opposed to serial as well as supporting the amcheck extension. So this is a pretty interesting look back at what Postgres 10 offered and its significance.

Next piece of content- "Postgres v15- a billion transactions later". This is from kmoppel.github.io and he's talking about a test he did comparing Postgres 10 and Postgres 15 using pgbench with 1 billion transactions. Now, out of all this testing that he did, there were some variations in runtimes and things of that nature, but the most significant thing that I noticed was the size difference. So if you look at the data size growth, there was a difference of 46% as well as the index's size was a savings of 60%. So this just goes to show you all the index improvements that have happened over the most recent versions of Postgres, and how much that can really save you disk space. So if you are on an older version, if you want to save some disk space, just move up to the most current version to get all those benefits. But if you want to learn more, you can check out this blog post.
Next piece of content- "Reducing replication lag with IO concurrency in Postgres 15". This is from citusdata.com and if there had not been a new release of Postgres along with the Postgres 10 retrospectives, this would have been the primary post that I covered this week. I highly encourage you to read this. I found it very informative and interesting about what they're working on and kind of where they're going, and the next steps. So basically the area that they're trying to resolve is issues with replication lag. Specifically, my understanding is the stage at which the replica is trying to apply the changes to the database itself. So presumably it's received the WAL over the wire and is now applying those changes to the database files on the replica. This can cause slowdowns because that's a lot of random IO. And they've actually made some changes to Postgres 15 to address this.

And they set up this test here where they set up a pgbench scale of 1,032 clients. They set maintenance_io_concurrency to zero because that essentially turns off this new feature that they added and full_page_writes to off to minimize checkpointing in the process. So they started up pgbench and you can see this is the lag in seconds and how it dramatically increases at a pretty steep slope. And again, they say this is the replay_lag. So I'm interpreting this as the WAL trying to be written to all the database files on the replica. Now, what they did is they changed maintenance IO concurrency to 10, which is its default value. And in that setting, you see that it starts to not slope as highly. So then they set it up to, I believe it's a 32, and you can see it dramatically stopped the lag and then precipitously dropped down. So basically, this setting is reducing replication lag.
Then they stepped it down in a stepwise fashion and they found around 16 for this setting, again the default was 10, which was sufficient to keep up with the load that was happening and not have essentially any replication delay. Now they're using maintainence_io_concurrency and this has been in Postgres since, I think, version 13. But apparently, with Postgres 15, they introduced some changes that impact the application of the WAL on a replica. And I do want to read this quote that they put here. "If trying this experiment at home, note that the current advice-based scheme" –so this is an IO advice-based scheme– "is only known to work on Linux with ext4 and xfs". So I have some clients using zfs, it won't work with that, but they're saying it is expected to work in the future. "It doesn't yet work with other operating systems"– so non-Linux and "the true asynchronous I/O proposal being worked on for the next generation will work everywhere, via various kernel APIs".

So basically, this is an introduction to this feature in Postgres 15. It works as long as you're using Linux and ext4 or xfs. But in the future, you should be able to use different file systems and different operating systems as well. Now, the reason this works is they're using a particular API called posix_fadvise() but it says there are many drawbacks to it and that's kind of why you have a reduced set of operating systems and file systems it works on. But they're looking to move towards Linux's io_uring. So basically they want to get to full asynchronous I/O if they can. But again, this post has a lot of detail. I highly encourage you to check it out and examine these features. If you're thinking of upgrading to Postgres 15 and you've had some replication lag, maybe changing the setting could improve it for you.

Next piece of content- "Listening to Postgres: How LISTEN and NOTIFY Syntax Promote High Availability at the Application Layer". This is from enterprisedb.com and they're coming up with a scenario where you want to build a social network that has members. Those members have posts and you have followers of particular members. And if a member posts who has a lot of followers, you want to send out notifications to each member, essentially. They're asking how you can handle this. So they inserted about a million followers of a particular character here and then they started with just doing a trigger that inserts into notifying members into the table. But the problem is doing a single post for one user took 10 seconds because this trigger has to basically insert a million rows along with the post. So that's not a viable option. The next option they explored is using LISTEN NOTIFY.

So basically you set up a Python application that sets up a listener and then modifies the trigger such that it does notify using pg_notify when a particular notification needs to be sent out to followers. Then the application takes it and actually does the inserts essentially in an asynchronous fashion. But he says a disadvantage of this is that if your application goes down, you essentially can lose notifications if you have an insert into posts. And another alternative is basically doing polling. So on the notification table, they put an is_sent flag and they're doing UPDATE SKIP LOCKED to be able to know what notifications need to be sent out. Their application was rewritten essentially just to do polling to say 'Hey, are there any new notifications that need to be sent?' and then go ahead and do that. So this blog post walks through different implementations of how to do these types of features using LISTEN NOTIFY or even polling, along with UPDATE SKIP LOCKED. So if you're interested in that, you can check out this blog post.

Next piece of content- "Logical Replication Improvements in PostgreSQL-15". This is from amitkapila16.blogspot.com, and this is a review of all the changes that have been rolled into Postgres 15 for logical replication. He talks about allowing the replication of prepared transactions. This was already being done with logical decoding, and now the standard logical replication between two different database servers can use the prepared transactions. Basically, that means it can start sending data before transactions are complete. Essentially it uses a two-phase process to do this. The next one is to allow replication of all tables in a schema. So now you can just specify a schema to be replicated, as opposed to having to state each table so that feature is available.

The next one is allowing row filters for logical replication of tables, as well as column lists for logical replication. Basically, this reduces the amount of data within a table that you can replicate over. Also, it allows logical replication to run as the owner of a subscription, as well as improvements to conflict resolution. So now you can actually specify that a subscription should be disabled if it runs into an error, as opposed to it just keeps producing an error, and then different ways that you can skip over the errors and resolve the replication conflicts. As well as improving communication between publisher and subscriber to be more efficient, as well as having a pg_stat_subscription_stats view. So if you want to learn about all these changes in detail, definitely check out this blog post.

As well as "5mins of Postgres E43: Logical replication in Postgres 15 & why the REPLICATION IDENTITY matters". This is from pganalyze.com and the previous post is what Lukas covers this week, as well as emphasizing the importance of REPLICA IDENTITY when you're setting up this logical replication because the column lists and the row filters rely upon the REPLICA IDENTITY and need to be included if you're going to be making update and delete changes. But definitely check out his episode if you want to learn more about that.

The last post related to logical replication is "Performance impact of row filters and column lists in logical replication". This is from postgresql.fastware.com and they are doing some performance tests against Postgres 15 to see what changes have been able to filter down the number of columns or number of rows that are sent over. As you can see with initial synchronization, you have 2 million rows and you need to do the initial sync to a new replica. It looks like based on the filters, you have a pretty linear performance change. So if you're going to do all of them, it's essentially the same as without a filter.

And if you're only sending 25% of the rows, it's about 25% of the performance. Same thing with actually sending data over. It's not as dramatic as this, but you definitely get performance improvements when sending over synchronization and you do get transaction per second improvements as well, the less data that you're sending over in rows. This also holds true for using pg_recvlogical if you are consuming those logical changes in an application. They also checked column list changes as well. It wasn't as dramatic but definitely led to some improvements. So if you want to learn more about that, definitely check out this blog post.

Next piece of content- "Read Before You Upgrade: Best Practices for Choosing Your PostgreSQL Version". This is from timescale.com and they're talking about the decision on when you should upgrade. They mention a few myths like you should upgrade as fast as possible every time, or upgrade when you have to, or upgrade for every minor version. But then they talk about being more thoughtful about it and he has a specific plan that he tends to follow and he also mentions his personal procedure, typically how he does it. First, he upgrades major versions when they reach the minor version 0.2. So in other words, 15.1 was just released. He probably would not advocate upgrading to that yet. He'd probably wait for 0.2. Next upgrade minor versions as they are available. So I'm assuming once the major version has been upgraded, go ahead and upgrade the minor version as is because normally these, essentially, binary changes are very minimal risk to do the upgrade.

Next, upgrade immediately when you are more than two major versions behind. Now, I don't necessarily agree with this, I think you have a little bit more leeway. I was confused by something he stated here because he said, quote "...they"– meaning the Postgres community– "are committed to two versions a year". For as long as I remember, there's a major release every year. So I don't know why he says two versions a year and he says "Only five major versions are supported, so your installation will be supported in approximately three years". But my interpretation is that it's five years. So Postgres 10 was released five years ago. So I'm not quite sure what he means by this, but this is one of his recommendations. Next is to upgrade when the security team tells you to. Definitely need to do that and upgrade because you need the functionality. So there's something in the release that you could use that would make your life easier, essentially. But if you want to learn more, definitely check out this blog post.

Next piece of content- "REWRITE OR TO UNION IN POSTGRESQL QUERIES''. This is from cybertec-postgresql.com and he's talking about an older post he wrote where he was saying how when you join to another table and use an OR looking at comms between, the performance gets pretty bad. One potential solution to it is using a UNION to do the comparison instead. And someone said 'Well, the Postgres planner should basically use the more efficient process as opposed to you having to decide the best way to do it'. But he actually examined some cases where this doesn't always hold true. So there are certain scenarios where it's not possible to get the same answer converting an OR when you're joining two tables to when you're doing a UNION between them. So he goes and covers that here and explains right, exactly what he says here, why the optimizer doesn't automatically rewrite OR to UNION. So if you want to learn more, definitely check out this blog post.

Next piece of content- "Using PostgreSQL JSON functions to navigate reviews of restaurants in India". This is from dev.to by Francesco. Now, this post talks a little bit about Avien.io, but essentially you can just look at the Postgres-related information and it's basically a post explaining how to manipulate and work with JSON data. So he imports a set of JSON data into a database and he shows you all the different JSONB functions you can use to manipulate it and pull out data in different ways. So if you're looking for a JSON refresher of different functions and different ways to pull out and manipulate JSON, definitely check out this blog post.
Next piece of content- "Securing Patroni REST API End Points Part 2: Using SSL Certificates. This is from percona.com. This is a follow-up post from last week. And in this post, they're covering client certificate verification. This is where you're setting your clients up to trust the servers. So they show you how to do that as well as handle HAProxy with certificates. So check out this blog post if you want to learn more.
Next piece of content. There was another episode of Postgres FM this week. This one was on "Database branching". So my interpretation- you have data at a certain state and you want to branch it off and take a copy of it to do testing on or experimentation on. You can listen to it here, or you can actually look at the YouTube video here.

Next piece of content. The PostgreSQL person of the week is Wilfried Roset. If you're interested in learning more about Wilfred and his contributions to Postgres, definitely check out this blog post.
The last piece of content. We did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was talking about "Callbacks: Good or Evil?". Specifically Active Record Callbacks in Ruby on Rails. So if you're interested in the long-form developer discussion about that, we definitely welcome you to check out our show.

Postgres 15.1, Postres 10 Retrospective, Reducing Replication Lag, Listen & Notify | Scaling Postgres 241

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript