Vacuum Speed Up, Faster Foreign Tables, Fast Queries With Union, Query Optimizer

In this episode of Scaling Postgres, we discuss a vacuum speed up and faster foreign tables in Postgres 14, running faster queries with union and learning about the query optimizer.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Speeding up recovery & VACUUM in Postgres 14". This is from citusdata.com and they're talking about an enhancement that has been made recently to the future Postgres 14 to be released in the fall. What they changed was the compactify_tuples function. Now that's used in three different areas. One, when PostgreSQL starts up after a nonclean shutdown, basically a crash recovery. Two, during the recovery process that is used by physical standby servers to constantly run to replay wall changes it's used there, Then third, by vacuum. Now, when they did some profiling, specifically the area that had issues was a high CPU load coming from HEAP2 CLEAN WAL records and basically, they tracked it down to this function. What this does basically is an example of a heap page in Postgres.

So you have the page header and an array of items that have pointers to the actual tuple of data within the page. During UPDATE/DELETE operations, certain space becomes unused so it's a no longer needed tuple and it needs to be vacuumed up eventually. But those three different processes I mentioned actually do things to compact the space within a page. So basically after compaction it looks something like this today and you can have the tuples essentially out of order from the array of items at the beginning of the page. Now, in terms of the rewrite compaction process, the area of compactify_tuples that was taking the longest was actually the qsort function within it to make sure you're not overriding what you need during the compaction phase.

Basically, they re-engineered it so that you don't have to do a sort operation at all. It keeps the items in order to ensure they can always write without damaging compaction. If things do come out of order, they actually copy some items to memory to be able to do the compaction of the page, but basically, they are avoiding that sort. Of course, the blog post goes into a lot more detail about that, but they do have some performance changes that they show here. So number one, they check the crash recovery because that seemed to give the most bang for the buck with changing this and doing a pgbench run with 10 million rows in the database, doing a bunch of updates, and then forcing a crash of the system, therefore requiring a recovery when it comes back up.

After the change, it came up 2.4 times faster, so more than twice as fast. But a crash recovery doesn't happen all that often. What does happen a lot, of course, is vacuum. Actually, the vacuum performance changed as well. So now vacuum, with this change, runs 25% faster. So anything that can make vacuum run faster is going to be better. So to me, even though it's not over twice as fast during recovery, vacuum is going to happen so much more frequently that this is the most important speed up in my opinion. Now of course, this also impacts the recovery process of replicas as well, so that minimizes the chance that they're going to fall behind the primary. So, definitely, some great enhancements coming in Postgres 14. If you want to find out more about the details, definitely check out this blog post.

The next piece of content is "Faster Bulk Insertion To Foreign Tables- Introduction to PostgreSQL 14 Committed Features". This is from postgres-road.blogspot.com and this is talking about an enhancement that's been worked on for Postgres in terms of inserting data through a Foreign Data Wrapper. Now, this post goes into a lot of the overview about why they're working on this in particular. They're mainly focusing on the scale-out potential of Postgres eventually and getting the foreign data wrapper working efficiently is going to make Postgres scale out faster. So they had someone who was testing this and they were seeing some poor performance. So they took a look at it and basically, it looked like the insertions were happening row by row at a time.

So very inefficient, particularly if you have a network round trip happening to an external foreign server. So basically they developed some batch methods to do the insertions and they have some of the timing changes they saw here. So for example, with a local table without using a Foreign Data Wrapper, the insertions that they tested happened in 6.1 seconds. Before their improvement using a Foreign Data Wrapper, it was 125 seconds. After they made their improvement, the insertion into the Foreign Data Wrapper was 11.1 second. So it's about twice as slow doing it to a local table, but 11x faster than it can do it in Postgres 13 and below. So definitely a huge improvement, but it's still going to have some sort of an impact compared to local table insertion.

Now, they did mention that the foreign server that they're using is actually on the same machine, so there's no network latency, so you have to take that into account. Then they also split the target table into eight hash partitions, so presumably multiple tables that they're pushing out the data to. With that, it was just a little bit slower at 12.5 seconds compared to the 11 seconds before, but even the local table was slower. So definitely some great performance improvement improvements when doing inserts in particular to Foreign Data Wrappers in this post. Also mentioned that they're also working on the ability to use multiple CPUs as well. So maybe you can paralyze that across multiple different servers to do the inserts. So, some more interesting changes are coming in Postgres 14 and if you want to find out more, definitely check out this blog post.

The next piece of content is "Speeding up SQL queries by orders of magnitude using UNION". This is from foxhound.systems. Basically, they had to develop a pretty comprehensive query to return some data and they have what they called a Diamond Schema. So basically, there are multiple paths to query it because everything is kind of interrelated here. You'll see, essentially it's a circular way you could do joins and when they pulled out one segment of the data that they needed, it ran in 1.5 milliseconds. So that's super fast. But then they joined the other set of data they needed to bring in and they were using a fair amount of ORs. So anytime you have ORs you can think of, okay, this is probably not going to result in some great performance.

Doing that, adding in that extra data with the join, it finished in 3.2 seconds, so 3,000 times slower. Now some of that was due to there being more data in some of the tables that they joined to, but what their idea was is 'Hey, maybe we can query both of these separately and then join them together as opposed to using the ORs'. So basically, they took the original query that's up here that's not using any ORs, it's just pulling the exact data needed for one set of data, calling it the employee meal items, and then they did another query to pull the customer meal items. So they query it separately, not using any ORs. Then they took those results and merged them together using a UNION ALL.

Now the original query ran in one millisecond, and the second one doing the customer meal items ran in 102 milliseconds. So taking them both and then doing a UNION ALL to these queries ran in 112 milliseconds. So pretty fast and much faster than the more than three-second query. Of course, they have some conclusions here, but basically, what I look at this as when you have ORs you're essentially going to slow down your queries and this is one technique you can use to address each of those queries individually as smaller, shorter queries and then UNION them up or merge them together at the end might yield better performance, as this example demonstrates. So if you want to learn more about this technique, definitely check out this blog post.

The next piece of content- "HOW THE POSTGRESQL QUERY OPTIMIZER WORKS". This is from cybertec-postgresql.com and here they're just going through all the different techniques that the optimizer uses to determine the best path for queries in Postgres. Now this is a very comprehensive blog post and they talk about a variety of techniques used by the planner and how he's demonstrating this, he uses queries and then doesn't explain what the optimizer is actually doing. So here he demonstrates things like constant folding, a function inlining, and determining whether to run a function once per row or ones that can be applied to multiple rows by determining whether it's volatile, stable, or immutable. Looking at inlining and subselect flattening, and covering things like the different types of joins. So this is a very comprehensive blog post that I encourage you to read if you want to learn more about how the optimizer works to translate your SQL into the actual plan that's executed against Postgres because it can help you get a sense of why Postgres is doing what it's doing. So if you want to learn more about that, I highly encourage you to check out this blog post.

The next piece of content is "How to check and resolve Bloat in PostgreSQL". This is from highgo.ca and the first thing they cover about why bloat happens. Basically with Postgres's Multiversion Currency Control, every UPDATE actually happens as an INSERT, and then the old row is eventually not used anymore and can be vacuumed up. Same thing with the DELETES. The row isn't actually deleted, it's just marked that it is no longer there, and then eventually it can be reclaimed. But that process of vacuum doesn't free up disk space. So basically you can have bloat if space cannot be utilized. Now, how to detect it? There are a few techniques you can use, but they're not very accurate just using base postgres. But there's an extension, pgstattuple, that probably does the best job of determining how bloated your tables are.

So here they did a test and they found, for a fresh table, the bloat would be 0%, whereas they did a I believe it was a delete process, and then they showed that the bloat was at 30%. Similarly, indexes can also be bloated and they show the bloat here by checking an index. Now then the question comes, once you've identified bloat, how do you deal with it? The first option they discuss here is VACUUM FULL. So basically, this essentially rewrites the whole table. The disadvantage is you cannot read or write to it while this VACUUM FULL is going on. So that's really hard to do. CLUSTER is similar in that it's rewriting the table, but it's organizing the data on disk to the equivalent of how an index is set up. So you CLUSTER to a particular index and it orders the data that way. But much like VACUUM FULL, you cannot read or write to the table.

Pg_repack is a separate extension that does online rewriting of the tables and indexes. So you can leave the table online and use pg_repack. The downside to it is that it does take a fairly long time to run depending upon your table size, and you're going to use two to three times the amount of space of your table to do it. So if you're short on disk space, this unfortunately isn't really an option. Then the other option is reindexing. So this just addresses indexes. You can reindex them quite easily doing reindex concurrently on the most recent versions of Postgres. So that essentially rewrites the index. But of course, it doesn't do anything to assist with table bloat. But if you want to learn more, you can check out this post.

Next piece of content is "Waiting for PostgreSQL. SQL 14- "Allow configurable LZ4 TOAST compression". This is from depesz.com. So by default, Postgres does compress TOAST. So TOAST is where Postgres stores data that exceeds the space within a tuple, so it's a separate storage area. What it stores in that TOAST area is compressed. By default, it uses Postgres's implementation of the LZ algorithm. But in this patch in Postgres 14, they've added the option to use lz4. Now you need to do a number of things to enable it, especially building it using the --with-lz4. So you kind of need to know ahead of time you want to use this lz4 compression, but the advantage was about twice as fast as the native compression. So if you're looking for more speed with using compression with your TOAST stables, you may want to check out this new compression coming in Postgres 14. So if you want to learn more, definitely check out this blog post.

The next piece of content- "Sharding Postgres on a single Citus node, how why & when". This is from citusdata.com. So this is talking about the new Citus 10 that is open source. You can now do distributed tables within a single node. And then of course the reason is well, why would you want to do that? Now, this post lists four reasons you would potentially want to do that. One is query parallelization for multi-shard queries. Basically, you can get query parallelization by setting up a distributed table within a single node, although Postgres already has some parallel features, so I don't know how advantageous this is compared to those features. They didn't discuss that here. The second reason is smaller indexes to create and maintain. So you can definitely do that with this, but you can also do that with partition tables if you happen to want to partition that data.

The third is smaller tables to autovacuum in parallel. Again, partitions can give you that. So I don't know how big of an advantage this is compared to partition tables. Then lastly, faster bulk data loads, which you could also get with partition tables. So a lot of the reasons they give here, you can get them with native Postgres otherwise. So that only leaves the reason that you eventually want to scale out. Like if you envision your database growing super fast and needing to scale out, then at that point it makes sense to go with Citus, it seems like. But they talk about how you do it, how you would set it up on a single node, and then once you're ready, how to actually bring up different worker nodes to be able to send those sharded tables to when you want to scale out. So if you want to learn more you can check out this blog post.

The next piece of content is actually a YouTube channel. This is EDB and in the last week, they've had a number of Postgres webinars that have been posted, each about an hour in length. So if you want some Postgres video content, definitely check out this YouTube channel.

Next piece of content is "Explaining Your Postgres Query Performance". This is from blog.crunchydata.com and they're talking about EXPLAIN and EXPLAIN ANALYZE for analyzing your queries to make them faster. Now this is a secondary post where they're talking about using pg_stat_statements to identify the statements that are taking the most time in your database. Well then you want to look at each statement using EXPLAIN ANALYZE to determine how to optimize those queries so they're running faster, taking fewer resources, so they go through the process of explaining being able to interpret the output of EXPLAIN ANALYZE to help you optimize your queries. So you can check out this blog post.

The next piece of content is "Ansible Collection for PostgreSQL and EDB Components". This is from enterprisedb.com and they've set up an ansible collection called edb_postgres with a link to it right here on GitHub. Basically, they've defined a whole bunch of ansible roles for setting up different components of a Postgres infrastructure and I believe they're using Red Hat Enterprise Linux for doing this because they also intend, in the future, to add a Debian and Ubuntu support offering more backup recovery options such as pgBackRest and some more. So if you want to try to use these playbooks and roles or even get a sense of how they've chosen to set them up, you can definitely check out this open-source project.

The next piece of content is "No space left on device". This is from mydbanotebook.org and this post talks about what happens when you run out of space on your database server. Basically, the first thing is you need to add more disks or remove any unnecessary files. Now with regard to this, the very important point they make here is that whatever you do, don't touch anything inside of the data directory of Postgres, especially one called pg_xlog in versions prior to 10 or pg_wal in versions 10 and higher. But once you're doing that, some other things to keep in mind is that with those disks running out of space, did you have any WALs corrupted?

They said you could check the log for some messages to see if that's potentially the case and ways you can resolve the WAL corruption. Ideally, you restore things from backup, but failing that you can examine the pg_resetwal tool but you need to be very careful with this. So what they say here is quote "I strongly encourage you to read the whole documentation page", so I definitely agree with that. You can also check for data corruption as well in your data using the pg_checksums, and they have a quick example of what it shows in the logs once they've relieved the space issue and the database comes back online. So you can check out this blog post if you're interested.

The next piece of content is "Performance Improvements in GEOS". This is from blog.crunchydata.com. They're talking about GEOS, which is a library that PostGIS uses for doing the quote "...'hard' computational geometry functionality". Such as intersections, unions, differences, and buffers of geometry. They're showing here from version 3.6 onto the mainline, all the different performance improvements they've made to different capabilities of the GEOS library. So this is a great example of how they're really driving forward the performance of this library in order to help PostGIS in general. So if you're interested in that, you can check out this post.

Next piece of content is "How to Get the Status of a Cloud Native PostgreSQL Cluster". This is from enterprisedb.com, and they're talking about how you can use the kubectl, which is part of the, I believe, Kubernetes controller, to be able to get a status of your Postgres cluster. So if you're using Kubernetes with Postgres, maybe you want to check out this blog post.

The last piece of content, the PostgreSQL person of the week is Julien Riou. So if you're interested in learning more about Julien and his contributions to Postgres, definitely check out this blog post.

Vacuum Speed Up, Faster Foreign Tables, Fast Queries With Union, Query Optimizer | Scaling Postgres 158

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript