background

Backup Compression, Postgres IO, Parquet Files, pg_stat_monitor | Scaling Postgres 214

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss parallel server-side backup compression, IO in Postgres, parquet files and the new pg_stat_monitor extension.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and co-workers continue to do well. Our first piece of content is "Parallel Server-Side Backup Compression". This is from rhaas.blogspot.com and he's talking about a new feature that should be coming in Postgres 15 that does server-side compression. If you use pg_base_backup to do a backup of your database that is essentially a client application, the server-side really doesn't do that much other than allow connections from pg_base_backup. And the pg_base_backup client essentially backs up and stores the data the way you want to. Maybe you want it zipped, maybe you want it put into a plain format that replicates the existing file structure. So all that configuration happens from the client's perspective. Not really much happens on the server. However, with this new feature, it enables you to do compression on the server-side as it's sending the client data as well as doing it in parallel.

And that's what this post talks about. Now, in my use case for backing up databases, typically I run pg_base_backup on the actual server that is running the database. So essentially it's just a local connection. But I've never used the compression option in pg_base_backup because it's just too slow. And what we do instead is pipe it into Pigs, which is a parallel gzip. So I've been using that solution for years, essentially to use pg_base_backup to backup databases. But this new feature is pretty interesting. So he's using a particular data set from a UK land registry and the database is 3.8GB, so not too big. And he tested two different backup formats using this server side compression capability. So he tested the Tar format and then the Plain format, and he tested a variety of compressions. So he used none, so no compression. He used the default gzip method. He used lz4. Then he also used zstd, which is actually something that I've learned that Facebook actually developed a number of years ago.

I think maybe 2016 was when it first started. And the zstd actually allows you to do things in parallel, much like I was saying, the parallel gzip that Pigs offers, so this might be able to do the whole thing in one go. So, of course, what's most interesting to me is the parallel compression capability. Looking at the zstd for eight workers, this would be the server doing some compression responsibilities, even if you're backing up from a separate server, which I don't typically do. But if you do that, there will be a load on the server because it's doing compression before it sends the data over to you. Now, looking at this, the compression ratio looks like it's doing about a third. So it went from 3.8GB to 1.3GB once compressed down. So that is a very good result. It's better than lz4, it's even better than gzip. But even I think you can adjust the compression ratio with zstd.

So I'm not quite sure what parameters he set for that. But in terms of the timing, you'll see that the zstd compression is almost twice as fast as no compression at all. It is 20 times faster than gzip, and it's three times faster than lz4. So this is a huge improvement from my perspective. Now, what you notice is that the Plain format for using pg_base_backup is twice as long, essentially, as the Tar format. The reason is that when you're using server-side compression, it compresses everything, and sends it to the pg_base_backup. In the Tar format, the client just simply sends that data to disk. But with a Plain format, it actually has to decompress everything and rewrite it out to the format that matches what's on the server. So essentially it has to do a decompression step, which is why it's twice as slow, essentially, as just using a Tar file that just accepts the data from the network and writes it to disk. But this is a fascinating feature, and it may alter how I'm backing up databases in the future. But if you want to learn more, definitely check out this blog post.

The next piece of content- "IO and PostgreSQL: Past, Present, Future". This is from the CMU or the Carnegie Mellon Database Group YouTube channel, and they're talking about IO Postgres and its current state. And that in and of itself is very interesting. But the second half of the talk is essentially the future where we're going. With the advent of NVMe drives and Asynchronous IO becoming more prevalent, given these new standards and designs for the technologies that have just become available in the last three or four years, asynchronous IO has become almost essential for getting great performance out of these drives.

So Andres is going through the process of making changes to Postgres to make Asynchronous IO a reality. And he did show one slide where checkpointing happened like ten times faster than normal. So not everything will be that performant, and there are still issues they're dealing with. But this looks like a pretty impressive improvement once it's fully implemented. Now, this is probably not going to be present in version 15, maybe not even version 16. We'll have to see. It's going to take a fair amount of time to make the changes to support this type of Asynchronous IO, but if you want to learn more, definitely encourage you to check out this YouTube video.

The next piece of content- "Parquet and Postgres in the Data Lake". This is from crunchydata.com, and this blog post is talking about a data format called Parquet, which I actually wasn't familiar with, but they start the post referring to Retool's conversion experience to an up-to-date version of Postgres. We actually covered this in a previous episode of Scaling Postgres, where they were migrating a four-terabyte database. And if you remember that, over half of the database, I believe, was essentially log-related tables of data that really don't change that much. And he essentially represented it here in this graph where you have a lot of terabytes that are relatively static, but then you have a smaller amount of, say, gigabyte-level dynamic data that probably is changing quickly. Now, what changes here is probably new inserts into append-only tables.

But in terms of conversion, moving a lot of data is much harder than moving a smaller amount of data. So he theorized if you were actually to partition this larger data into smaller partition tables, then it should make it a little bit easier to work with. Now, essentially, these are still static because they are unchanging if it's an append-only table and it's just the most recent data that's going to be written to. But he theorized that this should be a much easier way to do a conversion if you ever need to. Now, one way he proposed doing this is to place this data in a data lake. So basically storing all of these tables in another data source, proposing using these Parquet files. Now, what's interesting about these files, there's actually a Foreign Data Wrapper for these Parquet files. So you can actually go into Postgres and query the database and it will actually go and read these parquet files and return data from it.

And apparently, these files can also be consumed by R, Python, Go, and other types of applications. Now, what makes these Parquet files a little bit different is that they are language-independent, and column-oriented. So as opposed to being row-oriented, they are column-oriented. They are typed, binary, which makes them more efficient, and then they're also compressed. So it's very easy to compress something in a column-oriented format because the data types are all similar. He actually came up with an example of doing this, taking some data from a CSV file, loading it into Postgres, and then converting it into Parquet files. And then he actually queried the Parquet files and did some comparison and the timings weren't significantly bad.

Now, this was with a smaller data set. So as you come up with large data that you're trying to do this with, say, terabyte level, does it work as well? Don't know. However, there are a lot of caveats because the Parquet Foreign Data Wrapper is relatively new. And how well would this work ultimately? But even with this proposal he mentioned, you could instead store it in a separate database. So if you're going back to the original scenario where you're trying to migrate a large database, could you move all of this static data into just a separate database, not even using Parquet files, but a separate Postgres database? Then use a Postgres Foreign Data Wrapper to query them if you need to. But this was a pretty interesting technique to handle some historical data. So if you're interested in learning more, you can check out this blog post.

The next piece of content- "Understand Your PostgreSQL Workloads Better with pg_stat_monitor". This is from percona.com, and they're talking about the general availability of the pg_stat_monitor, which is an extension that's been developed by percona.com. Now, this is meant to be an enhancement to pg_stat_statements. So pg_stat_statements keeps a record, if you enable it, of all the statements happening to your server. Now, it's a cumulative record of all those statements happening on the server, and they can be reset. But therein lies the problem is that it's cumulative, so it's hard to trace back in a particular frame of time if something happened to queries or to correlate it to, say, a deployment that went out. And what pg_stat_monitor does is it actually places these metrics in buckets. So as opposed to having something cumulative, it is bucketed by a particular time frame that you can define.

So that enables you to correlate, say, a deploy that results in poorer database performance you can more easily target when that happened, what change happened that caused this? In addition, it adds a number of columns that aren't present in pg_stat_statements to give more insight. One of the main ones is the client_ip. So knowing which server is sending which traffic they have, the bucket, the bucket start time things related to the query itself, like the plan ID, the relation, the application name, as well as some cp_user_time and cp_sys_time. So these are all meant to allow you to keep better track of what's going on with the statements running against Postgres. And then this is designed to work with another solution that stores this information long-term. So this extension defaults to a maximum of ten buckets, each containing 60 seconds of data.

So that's quite short. Essentially, that's ten minutes of what it retains, but the intention of it is to send the data to a longer-term storage solution elsewhere. So you're not going to store all of this data that's being generated on the Postgres database itself. And they do have their monitoring solution that you can tie this into, of course. But this extension is something that can help you bucket your statements, as opposed to having one cumulative set of metrics and pg_stat_statements. Now, there are some comments in here where some people ran into some issues using it. You may want to take a review of this before you start implementing this.

Now related to it, timescale.com posted an article "Point-In-Time PostgreSQL Database and Query Monitoring With pg_stat_statements". So essentially, they're doing the same thing, and they want to bucket the pg_stat_statements and they're using their extension timescale to do this and they're creating hypertables to store the data. So it's essentially partitioned data and as opposed to calling them buckets, they're calling them snapshots. So essentially you're taking periodic snapshots of the pg_stat_statements table and storing them in these hypertables or partition schemes. Now you need the Timescale extension to go through this process, but you could also just use partition tables and Stock Postgres. I don't think you need the Timescale extension to implement something similar, but they do have all the code here of how you could do this yourself as opposed to using the extension that was mentioned in the previous article. But if you want to learn more, definitely check out this blog post.

Next piece of content- "POSTGRESQL 15: USING MERGE IN SQL". This is from cybertec-postgresql.com and they're talking about the new MERGE statement coming in Postgres 15. They have an example here where they created a table and inserted some data into it using GENERATE SERIES and they have IDs one through ten with values ten to 100, so ten times what the ID value is, then they use this MERGE command. So they MERGE a particular query that they did. So the query or the set of data that they're generating, they're doing eight rows because they're doing 16 rows but skip every other one. So essentially, just getting the odd numbers. So eight rows, you can see it says MERGE 8 down here they're using the same ID and then a random number times 1,000. Now they're matching it against the table using this ON clause.

So they're matching up the main ID here and they're saying when the data from the query matches what's in the table, then UPDATE the data and set the value to this random number. When it's not matched, then INSERT that value into the table. And as you can see for odd numbers it's going to update every odd number, but once it hits eleven, it's essentially not doing the update anymore because at eleven, it's no longer matched. So it's supposed to just insert those values. So it inserts 11, 12, and 13. Now, in addition to doing INSERTs and UPDATEs, you can also do DELETEs. So they show that example here of the same query interacting with the same table when it's matched, then DELETE it. So essentially, all the odd rows are now gone. But you'll notice 14 is a new value because it needs to MERGE 8, but there wasn't a 14th value. So we went ahead and inserted that 14th value. But this is a good blog post that covers the new merge command coming in Postgres 15.

Next piece of content- "It's beer o'clock for PostgreSQL". This is from ads' corner on andreas.scherbaum.la and he's talking about the function NOW and when you use it, it's now going to say it's beer o'clock as opposed to giving you the time. So how do you implement something like this? He says that "Functions in PostgreSQL can also be overloaded", but you can also have functions with the same name in different schemas. So he's going to use this and create a function in a schema and then alter the search_path. So what schema it's going to consult first when you're going to be executing is the command to determine which function to run. So by default, default functions like NOW are in the pg_catalog, but by default, pg_catalog is the first in the search_path.

So what he needed to do is put the public schema ahead of the pg_catalog schema in the search_path. So he did that with a session. So he said, "SET search_path TO public, pg_catalog". Now, at this point, he can create his function. He created a NOW function, so it's the same name as the existing NOW. But he actually added another parameter where you can change your beverage and then essentially it returns it's and then whatever beverage you select then o'clock. So by default, it's going to say "It's beer o'clock". So now, when you do select now it'll say it's beer o'clock. Or you could SELECT NOW whiskey. It's whiskey o'clock. So this is a quick little post about how you can override different functions in Postgres and gives you insight into the search path as well as creating functions in different schemas. But if you want to learn more, definitely check out this blog post.

The next piece of content- "Atomic Commit with 2 Phase Commit in FDW Distributed Setup". This is from highgo.ca, and they're talking about the scenario where you have a coordinator database and you want to distribute your database across multiple nodes. So you're going to have part of the data residing here on a distributed node, one part of the data here and part of the data here. And when you want to do an update that's going to impact multiple distributor nodes, how do you coordinate that? Now, if you do it very simplistically, you send an update from the coordinator node to the distributed nodes. So you just send an update and then you send the commits. It's possible that one potentially does not commit because of a network issue or some other issue with a host. And now you have data committed here and not committed here. And that could cause a big problem with the consistency of the database.

Now, one way around that is to use a 2 Phase commit. So basically, you do the updates to all the nodes and then you send a PREPARE transaction. And if one node doesn't prepare, it essentially it's easy. You can just roll it back and no harm, no fuss. The issue comes if all of them did the PREPARE, so all three distributed nodes got the update, and then they did the PREPARE transaction. But at the point that you commit, the prepared one did not commit. And again, this leaves you in an inconsistent state. Now, how they're planning to handle this is actually sending a commit warning message and relying on a global transaction manager to keep them aware of these issues. Because once this node comes back online, they should be able to auto-commit it, because they do have the information to commit the transaction, it just wasn't done for some reason. So there's a way to handle this scenario. So if you want to learn more about this, definitely check out this blog post.

The next piece of content. The PostgreSQL person of the week is Abbas Butt. If you want to learn more about Abbas and his contributions to Postgres, definitely check out this blog post.

The last piece of content. We did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was on "The JIRA Incident", where they were down for many days for a subset of their customers. So we discussed ways that this could have been perhaps made better and potentially avoided. So if you're interested in that type of content, definitely welcome you to check out our show.

episode_image