background

Postgres 15 RC2, Shared Buffers, Secure Connections, PGSQL Phriday | Scaling Postgres 236

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the Postgres 15 RC2, optimizing shared buffers, how to secure your database connections and blog posts from PGSQL Phriday.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. We have a lot of content this week. I've been doing this for four years, and I don't think I've ever had this many articles. Now, normally I usually pick only the best, but there were a fair number of articles this week, just a lot. So I'm going to have to move pretty quickly.

So our first piece of content is "PostgreSQL 15 RC 2 Released!". This is from postgresql.org. And there are just a few changes, unfortunately, reverting. It looks like a particular feature from RC 1, but just a few fixes. They're still projected to release it on October 13th, 2022, which is about a week from now, but it may be pushed back to October 20th,2022.

The next piece of content- "Harnessing Shared Buffers (and Reaping the Performance Benefits) - Part 1". This is from enterprisedb.com. Normally, when you hear about how you configure shared buffers, you make it a quarter of the memory and you revisit it if you need to. Frequently, many people don't revisit it. Well, they actually decided to do empirical tests to see what is potentially the best setting for shared buffers today, given the current version of Postgres. Now, they actually went back to this part one and addressed more antiquated hardware. So, for example, something using magnetic hard disks. So they used pgbench, which primarily tests OLTP. They did another TPC evaluation using HammerDB, and then they did essentially an OLAP test as well. Now, what they found with pgbench is that the more shared buffers, they gave it up to half the memory, and the performance continued to improve. The performance was about double going from a quarter of a memory up to half of the memory.

Similarly with HammerDB, when they were doing the OLTPtest with the TPCC, they found a pretty significant increase in performance as shared buffers went up to half the memory. This looks like a fourfold improvement. And from their analysis, it looked to be basically how frequently data was in memory. So things that are kept in memory more frequently, you're going to see better performance the larger your shared buffers. But then they tested TPC-H, which is a more OLAP-focused test, and they actually found the reverse. So the smaller the shared buffers, the better the performance was. And what they found with this is that they rarely hit rows in this test. So it was gathering data from all around the disk.

It wasn't always in memory and available. So basically, the lower you're setting for shared buffers, the better the performance, pretty much. So the conclusion at this point is when you're using magnetic disks, it depends if you have an active working set in memory. Whether you can fit it in the shared buffers, having that as high up to as much half of the memory, seems to give you the best performance. Or if you have more OLAP concerns where you have so much data and it's not going to fit in memory, then maybe lower shared buffers will give you better performance in that case. But they're also going to be doing future parts to this where they're looking at more RAM and NVMe devices or maybe SATA drives. So I'm definitely going to be looking forward to the next parts of these blog posts. But if you want to learn more, definitely check out this one.

Next piece of content- "The Majority of PostgreSQL Servers on the Internet are Insecure". This is from innerjoin.bit.io. There is one graph that essentially tells it all. So they found from a particular service 820,000 PostgreSQL servers listening on the internet. They found that over 500,000 of them had no support for connecting via SSL, which seems like semi-insanity to me. Number one, it's a best practice not to put PostgreSQL so it's accessible on the internet. Number two, if you're doing it, you should enable SSL and also enforce certificate authentication or at least get dedicated certificates for it as well. They also found that SSL was supported on the server, but it wasn't required at all for 200,000 of those servers. And then they found another set of around 100,000 where SSL is supported, but they weren't quite sure if it was required or not.

It might have been, but a very paltry number that they knew SSL was required. And they do describe in the appendix their testing methodology for this, but they go into all the data that they found. Now, my understanding is that bit.io is a hosted service, so they offer hosted Postgres. But of course, what do you do to correct this? Well, number one, don't put Postgres out on the internet. Or if you do, you need to force or require SSL at the server and ideally at the client, as well as using dedicated certificates. Now, to require it on the server, you basically need to go to your pg_hba.conf file like they talk about here, and only use hostssl.

People frequently just use Host, which is not going to enforce SSL. You want to use hostssl and then there are some settings in your PostgreSQL comp file including turning on SSL and making dedicated certificates, there's more information about it here. They even include a link to mozilla.org that tells you how to do the configuration for Postgres to set up SSL securely. Now, in terms of the clients, they're saying to use verify-full because if you have those certificates set up, then you can connect verify-full and you'll hopefully avoid man-in-the-middle attacks by doing that. But this post goes into a lot more depth. If you want to learn more, definitely check it out.

Next piece of content- "PGSQL Phriday #001- Two truths and a lie about PostgreSQL". This is from softwareandbooze.com and this is the start of a monthly community blog post initiative called PGSQL Phriday. Now they did initially call it PSQL Phriday, which confused people. Is this only about the command line client psql? No, this is generally about PostgreSQL. So it's PGSQL and this is the kickoff for it. And the next set of five blog posts were actually written on this topic.

The first article is from andreas.scherbaum.la and he's talking about setting up Postgres initially and how he uses Ansible where he's developed Playbooks to set up Postgres for himself. But these are kind of the three most important areas to set up. One of them is a "lie" and he says you should do this but then he corrects the lie later. But the first is changing the most important configuration parameters then restarting the database. The second is to use pg_dump for backups. And the third is to create necessary accounts. So definitely check out this blog post if you want to learn more about those.

The second post is from mydbanobook.org and she did it a little bit differently where her title is "2 it depends and one absolutely always do this!" and she covers psql. She talks about the psql -E option that shows you the SQL that gets generated when you're doing backslash commands in psql. The second is to combine different options in psql to achieve what you want. So they're talking about a scenario where someone wanted to export to CSV and how you can make those configuration changes to do it in psql. Although I tend to use the COPY command, I suppose there are areas where psql could be used. And then talking about using the -W command to enforce asking for a password prompt. Now I'm not going to mention which of these is the "lie", but you can check these posts out to determine more about that.

Next post is from andyatkinson.com and this post is about "Query Stats, Log Tags, and N+1s". So basically query stats is using pg_stat_statements to make an assessment of your server. Log tags indicate queries sent to the database. You have your application say, for example, what class it has come from so you can identify what queries are being run by what objects in the case of object-oriented programming or modules. And then avoid N+1 queries where your application needs to pull back 1,000 rows and it goes and asks for it a thousand times as opposed to doing one query.

The next post on this topic is scarydba.com and he's talking about backups and restores using pg_dump, pg_dumpall and how to use pg_dump for individual schemas. So you can check out that blog post.
And the final post is from sqlasylum.wordpress.com and his post is on "Truth and Lies!" where he's all talking about logical replication and the different configuration parameters you should do to set up logical replication for your database. So definitely check these blog posts out if you want to learn more about Postgres.

Next piece of content- "Future Proofing SQL with Carefully Placed Errors". This is from hakibenita.com and he's talking about trying to avoid silent errors and he's focusing on a conditional where the ELSE hasn't been defined. So if you get a value you don't expect in there and you don't have an ELSE, it will kind of silently fail. So for example, he's not using an IF ELSE in this case where in SQL's case he's using a case statement and basically when it's this value, do this, when it's this value, do this. But he doesn't have an ELSE. Now you can have an ELSE and have something take that path, but what he advocates is essentially throwing an exception if a particular case does not exist. And how do you do that in SQL? And his proposal to do that is actually creating a function. So he creates a dedicated function he calls assert_never and passes in a particular element here. 
And what this function does, is it just raises an exception. So in your ELSE, you say else and then that function and you're passing it in the value that's being assessed and he's casting it to an int. Therefore, if it ever sees a value that does not match what's in your WHEN, it'll throw this exception and alert you hey, this is something that hasn't been handled. So this is a pretty interesting technique. And then he also follows up by talking about how you could do this if you don't want to use functions in your database. He tried a division by zero and there are some cases where this is still going to fail, so that's not a good solution. But he then tested out casting something to an int and that seemed to work. Although you do run into issues when you're trying to cast from non-text types. But if you want to learn more about this technique, definitely check out this blog post.

Next piece of content- "VACUUM DOES NOT SHRINK MY POSTGRESQL TABLE". This is from cybertec-postgresql.com and he's talking about how when you insert and delete data, it doesn't always free up disk space where people deleted a lot of data and they expect their disk space to be reclaimed. And it doesn't. Even running vacuum doesn't necessarily do it well. This post kind of explains why. So he creates a test table inserts as a part of two transactions, three rows each, and he shows you all the hidden rows that exist that help define the visibility of these rows to other clients who are using the Postgres system. And then you can see this first set of three. This is part of one transaction and the second part of three is a part of another transaction.

The ctid is basically the physical layout of where the system can find that data in the heap, essentially. And then he does an update of the last row and you can see that it goes from (0,6) to (0,7). Then he updates it a few more times and now you can see that it's (0,9) for the ctid. So essentially, as he says here quote, "Everything between 5 and 9 is full of dead rows which have to be removed". So he runs a vacuum on that. And those rows do get vacuumed up, but it's not going to reclaim space. Now, Postgres knows the space is available, so it could reuse it if it needs to, but it's not going to relinquish it back to the operating system. Because there's no way within a particular database file to free up a middle part of that file back to the operating system.

However, you can do it at the ends, which he's going to talk about here. So he talks about this exception to the vacuum rule and they basically show the database size. And if you actually delete all the rows from that table, and then do a vacuum, you can see that the relation size does shrink. Now, you could also just remove half the rows as long as it's at the end of the file, it's hard to know exactly where that is, but you should get your file shrinking as a result of that as well. But of course, if you really want to shrink and compress all available space in the file, there's VACUUM FULL. And also he mentions another solution called pg_squeeze. So if you want to learn more about vacuum and how to free up space, definitely check out this blog post.

Next piece of content- "Tuning a Postgres BRIN index". This is from j-carson.github.io, and they're talking about ways to use a Block Range INdex efficiently for their use case. So Block Range INdexes basically index blocks of ranges. So they're particularly good for append-only data, like time series data, just being able to store dates. And they're really good when you want to say just get aggregates. So you're looking for a whole bunch of rows and aggregating that data, not so much. If you're looking to pull up a single value, a B-tree index is going to beat it in that case. The other advantage is because they're only indexing the individual ranges, they can be super super small relative to say, a B-tree index, but they go into detail about how to assess what's the best size to make them because determining how many rows within a particular range is important to do. And they also talk about the importance of doing some things like enabling autosummarize. So if you're considering using a BRIN index, definitely check out this blog post.

Related to that, this next piece of content, which is "5mins of Postgres E38: When to use Brin indexes in Postgres, tuning pages_per_range and determining correlation with pg_stats". This is from pganalyze.com and Lukas goes into more depth about when is a good time to use a BRIN index for doing things like looking at the pg_stats to check the correlation of the physical layout of the data versus particular columns. Or which particular columns would be good to potentially use a BRIN index for it. So definitely check out this piece of content as well if you want to learn more about that.

Next piece of content- "PostgreSQL 15- New Features to Be Excited About". This is from percona.com. This is just a post that runs down some noteworthy features of Postgres 15. Number one is MERGE. Number two is the row and column filters and logical replication. The next one is server-side compression and client decompression and pg_base_backup. This is as opposed to the client doing the backup and pulling all the data, the server itself compresses it as it's sending to the pg_base_backup client. Next is offering the jsonlog format for logging. Number of performance enhancements in terms of parallelism and sorting algorithms, and also the new technique of storing the server stats in shared memory. So if you want to learn more about any of those, definitely check out this blog post.

Next piece of content- "Postgres WASM by Snaplet and Supabase". This is from supabase.com and WASM is Web Assembly. So this is basically running Postgres in your browser. This is something similar to what CrunchyData did. But this is actually an open-source version of this project where they collaborated with Snaplet to be able to develop this. And they talk about the numerous features that are available and kind of why they did it. And they mentioned documentation for tutorials and demos and potential uses for offline considerations like if you want to work with data offline or do data analysis offline without having to be connected to a main database. Maybe there are use cases for this. As well as testing and dev environments, but they go into a lot of depth about how they got this running. Similar to the previous CrunchyData post that I talked about on a previous episode of Scaling Postgres. So if you want to learn more about that, definitely check out this blog post.

Next piece of content- "Learning PostgreSQL Internals''. This is from cleverelephant.ca, and here he's talking about someone who wants to get started learning about the internals of Postgres, maybe become a contributor, or maybe they want to develop some extensions. Well, he goes through a process and good resources to go through to learn more about the internals.

Now, related to that, the next piece of content is "Queries in PostgreSQL: 7. Sort and merge. This is from postgrespro.com and their blog posts always go into intense detail about the internals of Postgres, discussing how it works in depth. So I would say Postgres Pro. A lot of their blog posts are excellent resources to learn more about the internals with regard to Postgres.

Next piece of content related to that is "How to Add More Environments to the Postgres CI". This is from citusdata.com and he's talking about if you're getting started with developing in Postgres, you're going to want to have the CI system set up and this talks about how to go through that process and set it up. So check this blog post if you want to learn more about that.

Now, if you want to learn about Postgres generally, maybe this next blog post would help you with that. This is "Learning PostgreSQL". This is from proopensource.it. They're talking about certain resources that they think are good for beginners to get started with Postgres in terms of tutorials, learning portals, and then different types of online courses and books, and of course the Postgres documentation itself. But these are all sorts of different resources to help you learn more about Postgres.

Next piece of content. There was another episode of Postgres FM this week. This one was on "102 Query optimization", and 102 is basically their quote "...the next step of query optimization work", which is quote "...the difficult topic of how to verify your changes won't make other things worse". So if you want to listen to this episode, you can click here to watch the YouTube video as well.

The next piece of content. The PostgreSQL person of the week is Joseph Sciarrino. If you're interested in learning more about Joseph and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was on "A Beginner's Journey with @CodeWithJulie". So in this show, we talked about developers getting started with programming and different resources that could potentially be available to them. If you're interested in that content, we definitely welcome you to check out our show.

episode_image