background

Shared Buffers, Compression Algorithm, Merge Command, Postgres Contributions | Scaling Postgres 239

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to set shared buffers, the different WAL compression algorithm options, how to use the merge command and how to contribute to Postgres.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Harnessing Shared Buffers- Part 2". This is from enterprisedb.com and they're following up on a post they made previously on this topic, where they were looking at what you should set your shared buffers at when you're using magnetic hard disks. Now, this one explores using more enterprise-class-grade SSDs. So they're using essentially Enterprise servers with 128GB of RAM and 6 SSDs configured in RAID-0. Now, when they did a pgbench benchmark, so this is for OLTP based, you can see across a range of shared buffers, there was hardly any difference up to where you're exceeding about a quarter of the memory you start to see declining. And they attribute that to having to scan through all the shared buffers on a periodic basis. They also show the flame graphs that give an indication of how frequently data was accessed multiple times.

And you can see with pgbench that something being accessed five times is relatively rare. It's usually none or basically very few times. So it was super interesting that there were essentially no differences, whereas there was quite a difference with magnetic disks. But they wanted to test what happens if you use data more frequently. So what happens if you have a smaller working set of memory? They have the TPROC-C OLTP benchmark to let them look at that. As you can see from this flame graph, the data was used much more. So much of the working set was kept in memory. And with this result, you can see improved performance as you're adding more shared memory. But again, around that 25% mark for shared buffers, which has been the go-to recommendation, that's where it essentially levels off, so it's linear up to that point. So basically, when you have a working set that can fit primarily in the memory, it seems like a quarter of the memory may be a good estimate.

But if you don't reuse data from the database and it's going to be in memory, the shared buffer starts to matter less and it's just basically more dependent upon the disks. Now, they also tried the data warehouse and in Part 1 of this post, they showed that smaller shared buffers led to better performance. And that's what they saw here as well. So in this case, a lower number is better. You can see at 128 megabytes of shared buffers, that gave the lowest result. And as they started increasing, performance diminished. Now, it's only by a few percentage points as they are ramping it up. So it doesn't make a significant difference, but it's probably just having that extra memory to scan. It makes more sense just to keep your shared buffers low and leave it for caching. So, definitely a super interesting result with Enterprise-class SSDs. Now, their ultimate conclusion is that one "OLAP warehouse-oriented server can operate normally with a default of 128MB" of shared buffers.

Also "Continue to follow the 25% rule for OLTP systems". They also mentioned exercising caution we're exceeding 32 megabytes of RAM and never exceeding 64 megabytes of RAM, although I don't know if it's just because this machine had 128. Therefore, a more appropriate recommendation would be to exercise caution when exceeding 25% of your RAM and never exceed 50% of the RAM. But looking back on the results, I think it depends on how many times you're going to access something in memory. The more frequently you're going to access recent data, it makes sense to have the shared buffers high enough to accommodate that up to a certain level.

Otherwise, keep it as low as possible because again, the OLAP presumably is looking at a wide swath of data that doesn't fit in the shared buffers. Therefore, the performance is all based on the disk system. Similarly, the pgbench flame graph shows that if data is not accessed frequently, then the shared buffers really don't make a difference. It's, again, all based on the disk access. So to my mind, your shared buffers are based upon how often you are accessing repeated data in the database and maybe gauge the size of your shared buffers to the amount of data you are accessing on a regular basis. But if you want to learn more about this, check out this post.

Also, this post and the previous one are covered by "5mins of Postgres E41:  Tuning shared_buffers for OLTP and data warehouse workloads & is 25% of RAM a good choice?". This is from pganalyze.com and Lukas covers both parts of this blog post and gives his conclusions as well. So if you're interested in that, definitely check out this piece of content as well.

Next piece of content- "You Can Now Pick Your Favorite Compression Algorithm For Your WALs!". This is from enterprisedb.com. They're talking about the enhancement where you can define which compression algorithm you want to use for compressing WALs if you want to enable that for your system. Before, they used to offer a Postgres-based one called pglz, but you can now use lz4 and zstd, and they show some of the results here in terms of the size of the WALs generated. This is having it entirely turned off and you can see the huge savings in disk space using any one of these compression algorithms. Then they zoomed in to look at the different compression algorithms, and I believe this is a report on disk space. They say pglz provides slightly better compression than lz4, but zstd provides better compression than pglz.

And frankly, I was expecting more of a difference between these. So it's kind of surprising that it's relatively minimal and then they looked at the transactions per second that was possible when enabling this and again zstd was the winner by a narrow margin with lz4 and pglz following up behind. Again, I'm super surprised by this because for different clients I've actually started using zstd to compress backups or compressing a bunch of WAL files together and it's been a game changer with how fast that works in the compression ratio; it's because it is able to do it in parallel. Now with this use case, I guess those advantages aren't coming through but I'm still surprised. There is not much of a difference between these but if you want to learn more, definitely check out this blog post.

Next piece of content- "The MERGE command". This is from enterprisedb.com and they're talking about the new MERGE command in Postgres 15 and how it essentially lets you merge data from one table into another. So how you typically write it is 'MERGE INTO target USING source ON' a particular condition. So the example they have here is two different tables where some rows exist in one table but not the other and some of the rows are identical but they have a change. So you can do a MERGE INTO target using the source and then have a condition when not matched then say INSERT VALUES, or when matched then DELETE them, or when matched, then UPDATE. So basically you define how you want to handle the merges and you can even have multiple conditions within a single MERGE command to say in this case DELETE, in this case UPDATE, et cetera. So this is a very brief post about the MERGE command. If you want to learn more, definitely check this piece of content out.

Next piece of content is actually the next episode of Postgres FM talking about "Contributing to Postgres". So they discuss all sorts of different ways that you can contribute to Postgres, not only with code but in other ways. So you can definitely listen to the episode here or click here to view it on YouTube.

Now, related to this is a post by supabase.com, which is "What is PostgreSQL commitfest and how to contribute". So if you want to contribute code you're going to want to get involved with the commitfest and this explains what commitfests are. They're basically getting together and committing different features to Postgres and they describe how this process works. So if you're interested in doing that, definitely check out this article. 
And the last article related to Postgres contributions are to look at Postgres 15 statistics and this is from peter.eisentraut.org and he's showing from version 12, different metrics across versions of Postgres, including changes listed in the release notes, number of commits, number of contributors listed in release notes, total files, lines of code, and the number of files changed. And he makes note that they're continuing at a pretty standard pace even in spite of things slightly growing in terms of the size of the code base, the number of features, et cetera. But if you want to learn more about that, definitely check out this blog post.
Next piece of content- "Cross-partition Uniqueness Guarantee with Global Unique Index". This is from highgo.ca and this continues on with their feature exploration of adding globally unique indexes to partition tables. Because right now, you can only have a unique index within a particular partition. It doesn't apply to the parent, which would apply across partitions. So they're talking about what they are proposing in terms of getting this working in Postgres. So it's essentially a feature in progress. So if you want to check that out, definitely check out this blog post.

Next piece of content- "Postgres Databases and Schemas". This is from crunchydata.com. Now this is a simpler post, but it's giving you the 30,000 foot of Postgres down to a little bit more detail. So for example, the first thing you address with Postgres is what instance is it running on or what container is it running on. So there's some sort of operating system that Postgres runs on, essentially, and that operating system is running on some type of instance. And then when Postgres is running, it's actually running as a cluster. Now that's a little bit of a misnomer, but basically, a cluster means a running database infrastructure that you can run multiple databases on.

So one Postgres cluster can have one or many databases, and then on a particular cluster, you have a particular database. And then he explains how you can connect specifically to that database. You don't connect to the cluster, but you connect to a database on the cluster. Then he discusses what happens within the database and he talks about how schemas exist. Basically, these are namespace areas within the database that hold the database objects you're going to be using. More than that, people use the public schema, but you can also create your own schemas and a lot of times that's creating schemas for individual users, for example. So if you want this kind of overview of the Postgres landscape, you can definitely check out this blog post.

Next piece of content- "Easier Upgrades and Image Management for Postgres and Kubernetes this is from crunchydata.com. They're talking about some enhancements that have been done to their Crunchy Postgres operator for running Postgres and Kubernetes, and some enhancements that they've added to make upgrading easier. So that feature was already there, but they've added a new pause feature as well as easier ways to increment the versions. So if you're interested in that, you can check out this blog post.

Next piece of content- "Postgres ML is 8-40x faster than Python HTTP microservices". This is from postgresml.org. Now, I'm not familiar with machine learning, but apparently, they set up a PostgresML installation and a Python installation and did some performance tests and the results were exactly what they said, 8-40x faster. Now, this is coming from the postgresml.org website, so there may be a bit of bias in that, but if you're interested in machine learning, maybe you'll want to explore this particular extension.

Next piece of content- "Moving Objects and Geofencing with Postgres and PostGIS". This is from crunchydata.com, and they're talking about how to set this up as a Web service using a few of the tools that they offer pg_eventserv and pg_featureserv, along with Postgres and PostGIS and the OpenLayers Map API. Some of the features that they were looking to build with this example and they do have a 'Try it out' button here. It does lead you to a website where you can experiment with it, but it offers a real-time view of the state of different objects that are on the map. Live notifications when objects enter and leave a set of geofences and then querying the history of the system to see where objects have been and to summarize their state. For example, 'Truck 513 spent 50% of its time in the yard'. So if you're interested in building these types of applications or features, definitely encourage you to check out this blog post.

Next piece of content- "ROUTING WITH LINES THROUGH POLYGONS". This is from rustprooflabs.com and their post describes this. I'm not that familiar with geographical information systems, but if you're looking to achieve this, definitely check out this blog post.

Next piece of content- "POSTGIS: UPGRADE GEOS WITH UBUNTU IN 3 STEPS". This is from cybertec-postgresql.com and PostGIS apparently relies upon GEOS and GDAL libraries to install PostGIS and they go through the process of upgrading these in Ubuntu. You can check that out if you're interested.

Next piece of content- "News" Postgres 15 available in Azure Cosmos DB for PostgreSQL". This is from citusdata.com. I was not familiar with Cosmos DB, but apparently Postgres 15 runs on it now and they talk about the announcement down here.

This next post on devblogs.microsoft.com talks about "Distributed PostgreSQL comes to Azure Cosmos DB" so apparently they're merging the Citus capabilities into Cosmos DB. But the link here says that Cosmos DB is basically a fully managed serverless distributed database on Azure or on Microsoft's Azure platform, and it looks like they have different variants of it. So they've had a MongoDB variant and an Apache Cassandra variant, but now they're offering a relational one PostgreSQL leveraging Citus extension to do it.

The next piece of content, the PostgreSQL person of the week is Étienne Bersac. If you're interested in learning more about Étienne and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was a holiday party discussing "Coding Horror Stories". So we had a panel of developers talking about their worst horror or coding stories that they've had. So if you're interested in that type of content, we welcome you to check out our show.

episode_image