Recovery Time, Lost Data, Production Ready, PG16 Highlights

In this episode of Scaling Postgres, we discuss max_wal_size as it relates to recovery time, losing your data with collation changes, getting production ready and highlights coming in Postgres 16.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. As a reminder, this show is going to be a little bit shorter than normal. Because we're only covering about five days' worth of content, we'll pick up and do seven days in the next week because of how I shifted the recordings. But our first piece of content is "A little more on max_wal_size". This is from thebuild.com. And this is a follow-up post where he was talking about max_wal_size.

And this one is addressing a question that a commenter had about isn't it a risk to have max_wal_size too big because it increases the recovery time. Your recovery time is based on when the last checkpoint happened. All the WAL that's been generated since that last checkpoint needs to be reapplied to the database files before the database can be functional. And his recommendation, which is definitely typical, is to set that max_wal_size so that the limit gets hit when it does the checkpoint. You want the checkpoint timeout setting to be when checkpoints occur.

But depending on how much WAL you're generating, that could have a very large max_wal_size and increase your recovery time. So for example I had a client that had a lot of WAL activity and a lot of full-page image rights were happening. So much so, the WAL records being generated were many times the actual size of the database every day. And to get an acceptable performance, we had to increase the checkpoint timeout to 25 minutes. But of course, that's a downside. Because if that system crashes, you now have potentially 25 minutes of downtime, or however long it takes, for those WAL records to be applied before the database can be brought up.

So it is definitely a downside to have a large max_wal_size and you'll have to take that into consideration. So as he says here quote "If crash recovery speed is a concern, it might make sense not to go crazy with max_wal_size and cap it at a lower level". But that means you're going to be checkpointing more often, of course. But if you want to learn more about it, definitely check out this blog post.

Next piece of content- "Did Postgres Lose My Data?". This is from ardentperf.com. And this post, as you can see, is a FILED UNDER post and is talking about Collation, Corruption, Database, GLIBC, ICU, etc. So this is all about collations and potentially corrupted data. But this is a pretend scenario where Jeremy picked a specific language that had a coalition change. So basically, he set up an EC2 instance with Postgres 15 and he used ICU collations. So he's not using the GLIBC versions, but he's using the ICU correlations.

He sets up his table with some data and then he says 'Hey I want to add a replica'. So he adds a hot standby and starts replicating the data. But then he discovers there are no records in that hot standby. So what happened? Well, basically what happened is that the primary server is on Ubuntu 20.04. The replica was on 22.04. So it was a different version of the operating system. So the Postgres versions with the same but not the operating system. What that generally means is there are going to be different library versions, for example, the GLIBC library may have changed.

But in this case, it looks like the ICU collations that he was using have changed. And if you look in the Postgres logs it says "WARNING: database 'research_texts' has a coalition version mismatch". And this shows their correlation version for the database and then what the operating system is at. So as you can see it's quite different here. Now as a consequence when you try to query this, the data is not going to be there, unique indexes aren't going to be maintained as being unique.

You could have data disappearing and this is all from collation changes. What you want to do is keep those versions consistent. So make sure your replica is the same version as the primary and this even follows through to upgrades to when you're upgrading. You should keep those versions consistent. Now it's not necessarily the operating system version that needs to be consistent but it's whatever you're using for collations. So maybe you want to keep your GLIBC at a specific version.

And we'll look at another post in a second where you'll see someone doing that or maybe you keep your ICU collation versions consistent. If you don't plan to do that and you're doing an upgrade, basically you need to re-index every index. But one interesting thing he did is offer a library for looking at version changes between GLIBC and ICU versions for collations to see how many changes have been made across them. So this looks like a super interesting resource if you want to check what has potentially changed between different versions.

But if you want more information about this, there is also the episode of "Five minutes of Postgres" that pganalyze.com did called "Collations in Postgres". So he covers this issue as well as discusses some other resources to consider. Including talking about how RDS actually maintains a specific GLIBC version for all of their versions of Postgres so that they can avoid this collation problem when you're migrating from different versions. But if you want to learn more about that, you can definitely check out these two pieces of content.

Next piece of content- "Is your Postgres ready for production?". This is from crunchydata.com. And they're talking about some checklists you should have before you get ready to put your database into production. The number one thing to have, of course, is backups. And I would also add doing a test restore of those backups because you want to make sure your data comes back. That's the purpose of the backup and they cover different ways you can do the backups. The next area to consider is high availability. So if your database system goes down what are you going to do?

You can restore from the backup but generally, that takes a while. He says here quote "A rule of thumb is you should just plan for one hour of downtime per 200GB of database size". But the way to avoid that is to have a streaming replica. So this is a physical streaming replica of your data and then if your primary machine should have a malfunction, you simply promote that replica to be the new primary. So he talks about that. He talks about the importance of logging, especially logging slow queries and then being able to also look at the explain plan of them.

Because we want to understand what queries are running in the system. I would also advocate using pg_stat_statements and turning that on because that gives you great insight into what queries are running. And he also recommends potentially using pg_audit if you need to audit certain queries as well. The other thing you mentioned is protecting your database by using statement_timeout. Now my recommendation for this is generally don't apply globally, but apply it per user that's connecting.

Because when it's global, a lot of times if you're doing DBA duties with the Postgres User, things time out when it's not expected. So generally, I like to set the statement_timeout per user that's connecting to the database or per role that you have users that are members that connect to the database. And lastly, he covers connection pooling now. I don't think you need this to go live but as your application gains more users, you will usually see the benefit of adding connection pooling at some point. But if you want to learn more you can check out this blog post.

Next piece of content- "Postgres 16 highlight- File inclusions in pg_hba.conf and pg_ident.conf". This is from paquier.xyz. And I really liked this enhancement because this is similar to what exists for postgresql.conf. And generally, I don't touch that file; I store all my configuration in the conf.d directory when I do my configuration because it's just easier to do that with Ansible. Well, I like the ability to not have to individually change these two files. I can have a file located somewhere and just have an include statement to be able to include that configuration as a part of those files. But if you want to learn more about this enhancement coming in Postgres 16, you can check out this blog post.

Next piece of content- "PostgreSQL 16: GENERIC_PLAN option for EXPLAIN and SHELL exit codes for PSQL". This is from dbi-services.com. And they're talking about two enhancements. One is the ability to look immediately at a generic plan when using EXPLAIN. So usually for generic plans, you have to do what he describes here. You need to first create a prepared statement and execute that six times because otherwise you're going to get the custom_plan. And then the final time you run it, you'll get the generic_plan or you switch the plan_cache_mode to force_generic plan.

But with this new enhancement, all you have to do is run the EXPLAIN plan with generic_plan true and it gives you what the generic plan is. So that's great. The next feature is for PSQL to actually get the exit codes of operating system commands you run within PSQL. So he shows an example here and you can actually output it with SHELL ERROR or the SHELL_EXIT_CODE to get what the results are. So two great enhancements coming to Postgres 16.

Next piece of content- "NEW OLD 'DEBUG_PARALLEL_QUERY' SETTING IN POSTGRESQL 16". This is from cybertec-postgresql.com. And he's talking about a change in a runtime option that is going from force parallel mode to debug parallel query. So that seems odd. Well apparently, force parallel mode is as he says here quote "meant to be used to allow us to exercise the parallel query infrastructure to ensure it's working as we expect".

But people are interpreting this as turning all the parallel features on Postgres which clearly that's not what it does. So they're renaming this option to "debug_parallel_query, which is a much better description of what the feature does. So if you want to learn more about that you can check out this blog post.

Next piece of content- "Using postgres_fdw.parameter_name for advanced monitoring with multiple PostgreSQL servers". This is from postgresql.fastware.com. And there's an enhancement for Postgres foreign data wrappers. When you're using the Postgres foreign data wrappers to speak to a foreign Postgres server, you can now set the application name that the foreign server sees. So before, all it said was postgres_fdw, as you can see here in terms of the application name.

Well now, you can actually give it a custom name and use these escape characters to help build that application name. So for example you can include the application name on the local server, the session ID, and the cluster name. So he has an example here where he's setting this Postgres foreign data wrappers application name from the cluster name where the username equals whatever the username is.

And then when you run it, you can see what server is connecting using the foreign data wrapper. You can see these connections coming from the analyzer server with a particular user. And another user on the same server connects as well. So it gives you more information about the connections that are connecting to the foreign server. And if you want to learn more about that, definitely check out this blog post.

Next piece of content- "The Cognitive Load of Licensing". This is from softwareandbooze.com. And he's talking about one of the advantages of Postgres, of course, is that it's open source, but another big thing to consider is that because of that, you don't have to have discussions on licensing and what you can and cannot do with it. So for example, he says here quote "I can't tell you how often I used to sit in planning meetings with questions like these- Can we do online (concurrent) index builds? Why isn't the server using all the memory? Can we have a read-only warm standby?..."

And questions such as these. And he says the answer often was quote "Our current license doesn't support that". So you don't have to think about things like that with Postgres, you can use all the features. Unfortunately, the answer may then become, well our current version doesn't support that, which means you do have to keep up with upgrades if you want all the newest and latest features. But definitely an interesting perspective and if you want to learn more, you can check out this blog post.

Next piece of content, the PostgreSQL person of the week is Chris Ellis. If you want to learn more about Chris and his contributions to Postgres, definitely check out this blog post.

And the last piece of content, we did have another episode of the Rubber Duck Dev Show this past Thursday afternoon. This one was "Hosting After Heroku With Chris Oliver". So after Heroku got rid of all of their free hosting plans and went to only paid ones, we discussed with Chris what the landscape looks like in terms of looking to host your web applications. So if that content is of interest to you, we welcome you to check out our show.

Recovery Time, Lost Data, Production Ready, PG16 Highlights | Scaling Postgres 259

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript