Slow Queries, 4TB Upgrade, Postgres on ZFS, Storage Stampede

In this episode of Scaling Postgres, we discuss dealing with slow queries, upgrading a 4TB database, running Postgres on ZFS and handling a storage stampede.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Dealing With Slow Queries With PostgreSQL". This is from pgdash.io. They're talking about a number of ways that you can deal with slow queries. The particular scenario they had in mind is when you do a deployment because they basically say at the top here quote "In every deployment, there are always a few queries that run too slow". So it's basically you make a change and you've created some slow queries. So how do you track this? Well, the main way that you track this is through pg_stat_statements. So you can track it to see what statements are taking the most time in your database. But that has a few disadvantages. One is that it collects cumulative statistics. So if you just did a deployment, there's no way to tell that you've immediately changed that unless you're taking the contents of pg_stat_statements and copying it on a periodic basis, say every hour, to some location.

And then you can see a change that has happened as a result of a deployment, for example. So basically you may need to use some other tool or technique to be able to track changes to these cumulative statistics. The other thing that it doesn't include is query parameters. So those aren't included in the queries in pg_stat_statements. Now, that can have an impact because maybe you're querying a column using one value that the index is highly efficient on. It uses an index and returns the query quickly. Whereas another one, maybe it's more efficient to do a sequential scan for that particular value you're querying and you're going to get a sequential scan. So it's going to confuse the fact that this is a slow query or not? Maybe it depends on the values that you're selecting. So that's another disadvantage of it. Another area that you can look for slow query logging is in the logs themselves and you can set log_min_duration_statement that if a statement exceeds a particular value, say 5 seconds, then it will be logged to the Postgres log.

So you can examine that and see what queries are slow. There are a number of other parameters that you can specify how long you want it to record as well as you can do sampling if it would log too many statements if you're trying to do that. The other thing that people tend to do is they use the auto_explain extension. So basically, if any slow query is logged, go ahead and log the EXPLAIN plan for that query as well. So they talked about setting that up and setting some configuration variables to get that working. Then you have the issue of still executing queries. So a query only ends up in pg_stat_statements once it's completed, but anything that is still running is not going to be there.

So you need to look at pg_stat_activity to get that information in terms of what's currently running. And you want to be really wary of those that are "idle in transaction", particularly if it's a long time because that could cause a whole host of problems with Postgres with vacuum not being able to complete and encountering a txid wraparound issue. The next issue they talk about is locks, so you can log_lock_waits, so you could turn that on. You can also adjust the deadlock timeout if you need to. And then they also talked about being able to use the pg_locks system view if you want to get real-time information on what locks are happening in the system. But these are different places that you can go so you can try to figure out how to deal with slow queries in Postgres.

Next piece of content- "How Retool upgraded our 4 TB main application PostgreSQL database". This is from retool.com and they're talking about an upgrade that essentially they needed to do because they were on version 9.6 which was slated for end of life on November 11, 2021, and they wanted to do an upgrade and they chose 13 at the time. They chose 13 because they wanted to have the most amount of time before they would have to upgrade again. And also PostgreSQL 13 had a number of enhancements they were looking forward to, such as parallel query executions, parallel vacuuming of indexes, as well as adding default safely to columns. So in terms of the upgrade strategy, the easiest of course to do is a pg_dump and then a pg_restore. But at four terabytes, as he mentions here, that's going to take days. And really they wanted to have 1 hour of downtime.

Now whenever I hear something like that, I think pg_upgrade. However, they're using Azure's hosted Postgres solution and I guess you can't use pg_upgrade because they didn't mention it in this post. I've converted even databases larger than this using pg_upgrade and it takes less than a minute, two minutes max to do the actual conversion. There's preparation that needs to be done as a part of it, but the actual upgrade itself takes less than a minute usually. However, they decided to do a logical replication solution. So basically get another database in logical replication, sync with their primary in the new version of Postgres and then transition to it. Now, they tried using Azure's database migration service, but it caused them a number of issues to the point that it caused some long-running transactions that blocked Postgres's autovacuum, but eventually caused them some downtime.

Now this wasn't during the upgrade, but it looks like it was in preparation leading up to it, so that's unfortunate. But what they ultimately decided to use is something called Warp. I mean, they also consider pglogical, but that doesn't work on the platform. They were on Azure's platform for their single-server Postgres product. And really that's the best solution because logical replication didn't really exist prior to version ten. So you're going to need to use some sort of extension to do it. But they chose Warp to do it. But even that it looked like it had some issues. Now, they seemed very concerned about the time, which I'm a little confused by what this post is mentioning because my understanding of doing a logical replication upgrade you have them in sync and you can take your time to make sure that everything is exactly the way you want to and then you switch the transition time and it's relatively fast.

But they were talking about skipping some tables in Warp. So manually converting two massive tables. Now, one was a log table, but the other one seemed pretty important. So not quite sure why you would leave that behind and do that conversion separately. It wasn't clear to me in the blog post. But they did do a fair amount of testing in staging and some practice run-throughs. So that's definitely all good to do a good conversion. Then another issue, they had to remove foreign key constraints while doing the conversion. So really a bit inconvenient for some of this, which is why I definitely like using pg_upgrade. But if you want to learn more about their transition maybe you're on Azure and you're thinking of a way to try to migrate to a more recent version. This is definitely a great post to review the challenges and successes they had.

The next piece of content- "Our Experience with PostgreSQL on ZFS". This is from lackofimagination.org and typically if you're using Postgres on Linux, I tend to use Ubuntu, it uses the ext4 file system. But ZFS has been continuing to mature. I actually have some clients that are using ZFS for their Postgres instances and they're primarily using it for the compression reason. However, they mentioned some of their reasons for using ZFS. The first one is being able to do file system snapshots instantaneously. So doing a snapshot very quickly and they're actually using that to do their backups. So they're using the snapshots as their backup on an hourly basis. So it is, of course, better to use a point-in-time recovery mechanism of keeping track of all the WAL because with that you can get a point-in-time restore done whereas if you're doing hourly snapshots, you can only restore to an hour-point.

But they found this advantageous. The second reason, as I mentioned, is the built-in compression. So in their case, their compression was able to get a 4.5x compression ratio which is pretty large. So they're saving a lot of space in their database by compressing it and it can actually result in fewer disk accesses, although in exchange for some increased CPU usage to handle the compression. The third reason they mentioned is built-in support for secondary caching on SSD disks. So they are using cloud instances but some cloud instances have disks on the instances themselves, like AWS and they're actually using these instance volumes for a read cache so it gives them better performance by caching reads on those as opposed to having them go through the storage system. And it has other nice features. They like being able to detect bit rot and essentially their own implementation of software RAID.

And then they go into the backups they're using basically the snapshots. So they do 23 hourly and 14 daily snapshots. So that's how they're doing their backups. And then they give you all the commands to set up ZFS on a Ubuntu machine. So they show you creating the main ZFS pool here with two drives that they're using in RAID-0. Then they added their local SSD disks as the cache created their volumes, set up the compression, added a few other parameters, and then did the installation of Postgres. The other thing that they did is they also turned off full-page writes in Postgres when using ZFS because as they say, quote "By design it's impossible to write partial pages on ZFS". And then they give you the backup script that they use and they're doing the backup from a separate machine that accesses over SSH, the primary machine to do the backup. Although I would probably just have this as a script on the local machine to avoid network issues. And there's also a comment about some of this below as well. But this is a great post that shows you how to get Postgres set up on ZFS.

The next piece of content- "Steady Storage Stampede". This is from enterprisedb.com and they're talking about a scenario where say you're moving your database to a cloud solution and you're picking kind of the default for the volumes that you're going to be using. In this case, they're using AWS, the pg2 storage with 1,500 IOPS and 3,000 IOPS burst performance. And they ran a pgbench benchmark and they noticed that the transactions per second were around 2,000, but then it started falling off a cliff and going up and down, being very erratic for the duration of their test. They wondered what was going on and they looked at the actual Linux Dirty Buffers RAM and they noticed that it's increasing and increasing to essentially the limit of it when it starts to become erratic. And then even once the test was stopped, it took a fair amount of time, 215 seconds, before the buffers were fully drained and stored on the disks. So this part's bad, but even taking so long to drain the memory and have it get on the disks is bad as well.

So they said let's increase the size of the dirty buffers up to 16GB. And they use the vm.dirty_bytes to do this. And now you see really high TPS and then it just falls off when it stops. But what's happening when you look at the Linux dirty buffers is you can see all of that is being filled. It doesn't reach the 16MBs they set, but then it takes forever to drain too. So essentially, you're just using up all of that RAM and it's not getting to the disk fast enough. So then they said, well, let's try throttling the buffers. So make the buffers smaller, using a 1GB instead. Now, here it starts off at 2,000, but then it quickly gets saturated. The buffers get saturated at that 1MB limit, but the performance isn't as erratic and it is in and around 1,000 before the test finishes. Okay, so this is an indication that, essentially, the disks aren't fast enough. So let's throw more IOPS at the problem. So they used 10,000 I ops to see what they would get. And here, as opposed to 2,000, was the peak.

Now that you've given all this performance, it's now in excess of 5,000 IOPS. But then it does eventually saturate the dirty buffers and falls back to around four. This with 8GB of dirty buffers is what you see here. Now, the thing to keep in mind is that this is a consistent test. Some of this may be hard to test if you have erratic activity in your application because this is a stage test that is consistent activity. So to me, the point is when you see this kind of erratic behavior in your TPS, like things are doing fine, then suddenly it's doing this, maybe take a look at what your dirty buffers are doing at the operating system level. Because maybe you're saturating that. So maybe you need to either decrease it to kind of throttle it or increase it a little bit more. Or of course, increase your disk performance. Another way you could check this is just by looking at the disk IOPS of your volume, at least in AWS. You can pretty easily tell if you're saturating that at some point. But if you want to learn more, definitely check out this blog post.

The next piece of content- "The Vanishing Thread and PostgreSQL TCP Connection Parameters". This is from tech.instacart.com. So, this was a scenario where they had a Ruby on Rails application and it suddenly was showing a 15-minute timeout for certain threads. They're saying, what's going on? The timeouts that we have set are 5, 10, or 15 seconds, but not 15 minutes. What's going on? So they considered different parts of their architecture. So they have their application, which is using Puma threads. They have a network load balancer, a pgbouncer, and a database. Well, the database has a statement timeout that's on the order of seconds and they're also using log_min_duration_statement and nothing's getting logged in a 15-minute interval.

So it doesn't look like anything's going on with the database. The pgbouncer, they're doing something similar. They have a query_wait_timeout. So either the query is going to be executed or the connection is canceled. So it shouldn't be happening there. So it looks like something between the application and the network load balancer, or maybe the load balancer to pgbouncer. But when they looked at the application to the load balancer, they noticed a lot of TCP retransmits and they occurred prior to these 15-minute timeouts. And it looks like the Linux kernel, by default, will quote "...retransmit unacknowledged packets up to 15 times with exponential backoff". So basically the default settings can lead up to a 15 minutes delay. And they saw with one of their background workers that doesn't have an application level timeout, they saw this error that says "connection timed out" and that basically is coming from the operating system.

And they were able to replicate this type of behavior by using tcpdump and iptables. They made a connection and then they broke the connection and they got the same behavior when trying to send traffic to the database. So basically, how they fix this is by modifying the TCP connection parameters, basically shortening the idle time, shortening the keep alive interval, the count, as well as a TCP user timeout. So by adjusting that, and again, this is application specific, you wouldn't want to just copy these and put it in your application, but with these parameters they were getting things canceled appropriately and not taking 15 minutes to get canceled. But this post, as you can tell, is very, very long, and goes into a lot of detail about setting these different settings and where they set them to get the behavior that they're looking for.

Now this doesn't identify what the exact problem is, but it at least mitigates the problem and allows these connections to be canceled. So the open question that would result in finally resolving what this potential issue is, is why does the Ruby-PG library sometimes ignore Ruby interrupts or timeouts while waiting for the TCP layer? So why isn't Ruby canceling if it's not able to get the information it needs from the database? Why is it just staying open and not canceling it? And then secondly, what's the root cause of these spikes in retransmissions? Why is it having to be retransmitted? Is there an issue with the network? Is there an issue with the network load balancer? But we don't have those answers right now. This is a very interesting deep dive into using TCP connection parameters to kind of get around an essentially unknown issue.

The next piece of content- "PostgreSQL 14 B-Tree: Reduced Bloat with Bottom-Up Deletion". This is from percona.com and they're talking about trying to prevent page splits by doing bottom-up deletion. Basically, they showed an example of Postgres 13 vs. 14 and they created a table, loaded some data in it, applied two indexes, and the sizes of the tables in the indexes were pretty much identical. Then they did four updates of the entire database to update a lot of data in it, and then they checked the sizes again between Postgres 13 and 14. The tables were exactly the same size, however, the indexes were larger in version 13 compared to version 14. So that's essentially this bottom-up deletion in action that I kind of think of as a just-in-time vacuum. So it does some cleanup in the pages in advance of vacuum coming into doing things, basically trying to avoid those page splits from happening in the indexes.

But if you want to learn more about this, this is the content that the post"5mins of Postgres E 14: HOT Updates vs Bottom-Up Index Deletion in Postgres14" covers. So here, pganalyze.com goes through this post and a few others to get into more detail about how this works. So if you're interested, definitely check out this piece of content as well.

The next piece of content is actually a YouTube video and it's "PostgreSQL Partitioning Tutorial". And this is from the High-Performance Programming YouTube channel. He explains the reason why people tend to use partitioning is that it can result in some performance improvement, although improvements aren't drastically great. But the more important reason that he also mentions is for maintenance. So when you have, say, a terabyte table vacuuming, that terabyte table is going to take a long time, particularly if it has a lot of indexes. Where if it's partitioned, vacuum can operate much more easily on that. So it's really the maintenance reasons that are the main reasons I started to move to partitioning tables because the performance, while it is better, does not blow the doors off better in my experience. But he explains partitioning the different types of partitioning and also runs through the commands and shows you how to partition a table. So if you're interested in that type of content, definitely check out this video.

The next piece of content- "Two-phase commits for logical replication publications/subscriptions". This is from postgresql.fastware.com and they're talking about a new feature added to Postgres 15 where you can actually do a two-phase commit for your logical replication. So basically, when you set up your subscriber, you can choose to make it a two-phase commit subscriber. So if you want to learn more about that, definitely check out this blog post.

The next piece of content- "Database Migration: SQLite to PostgreSQL." This is from bytebase.com. Now this isn't someone that had a big application going from SQLite to PostgreSQL, but this is for an embedded solution. So they chose to change their embedded database SQLite to using PostgreSQL and how they did that. So if you're interested in that, you can check out this blog post.

Next piece of content is actually a presentation and it's "Postgres in a Microservices World" by Bruce Momjian. And this is 111 slides worth of content. So if you're interested in that, you can definitely check out this presentation.

The next piece of content. The PostgreSQL person of the week is Bharath Rupireddy. If you're interested in learning more about Bharath and his contributions to Postgres, definitely check out this blog post.

And we had another episode of The Rubber Duck Dev Show this past Wednesday evening. This was on "Where Should You Put Business Logic?". Basically, your application framework may have guidance on where to put things, but what's the best way to structure a web application? So if you're interested in that type of content, definitely welcome you to check out our show.

Slow Queries, 4TB Upgrade, Postgres on ZFS, Storage Stampede | Scaling Postgres 212

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript