background

How Much Observability Is Needed? | Scaling Postgres 294

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how much observability is needed, avoiding locks during database migrations, a huge Postgres events blogging event and FIPS mode.

Content Discussed

YouTube Video

Podcast Audio

Transcript

How easy is it to do Postgres performance optimization? Personally, finding where the problem is is not too difficult. I mean, generally, I just look at pg_stats_statements and it's pretty obvious what's causing at least a database-wide problem. The hard part is actually figuring out how to fix it. I mean, sometimes it's really easy, but other times it can take quite a bit of time to think about the best way to handle a particular problem. But our first post talks a lot about building in some observability so you can figure out what things are going on in Postgres. But frankly, my number one metric is how the CPU is doing, and based on that, I can generally track down what the problems are. But I hope you, your friends, family, and coworkers continue to do well.

So our first piece of content is "Optimizing performance of Spark's Postgres database". This is from bajtos.net, and they had an application that they just started and they quote "...quickly discovered that our database was having a hard time keeping up with the increased load". So the first thing they wanted to do was set up more observability. So they were on the Fly.io platform and they kind of said, 'Okay, we generally want what queries per second are happening, what's the overall query duration?'. And then as they researched it a little bit more, they wanted to know the fetches, the INSERTS, UPDATES, and DELETES, the proportion of index scans over total scans, CPU load, and memory use. Now, like I was alluding to, most of this stuff is nice to have for me personally. If I have the CPU load and I can distinguish between what's an I/O weight and what system or user, I can get a good sense of how hard the database is working and, roughly, where some problems might reside.

Now, not highly specific, but at that point you can look at pg_stat_statements and see what statements are taking the most time in it, and then from there do your query optimization. But this blog post talks about how he got all these different metrics, and you'll notice a lot of it is coming from pg_stat_database. So this is just a view on all Postgres installations. Then for the query throughput, again, that's also in pg_stat_database. For doing his query duration, there might be a way to do that in pg_stat_statements. I'd have to think about that, but he used a metric that, apparently, Fly.io is collecting, which gives the maximum transaction duration from pg_stat_activity. But again the rows loaded versus rows returned, this is pg_stat_database and then index vs sequential scans are from pg_stat_user_tables. But looking at this, he says they had a low fetch-to-returned ratio. That's not something I generally look at or pay too much attention to.

Max duration has peaks that are gradually becoming higher, and then the CPU load has spikes reaching 15. Okay, so the database is doing a lot of work doing something, and then he looks at missing/finding indexes, having to install the pg_stat_statement extensions and enable it. Then he had to set up some way to collect and store these. So it goes into a lot of different ways, although it's quite easy just to query pg_stat_statements. But actually fixing these little queries, he said, was the easy part. I guess if it's simply missing some indexes, yes, that would be easy to do. And again, this is what I generally do, find the slowest queries reported in pg_stat_statements and then figure out which columns need to be indexed. So I was looking at all these things he was trying to measure and my thought was okay, if you have a high CPU, make sure pg_stat_statements is on there, track what's going on, and then address the slowest queries. He also mentioned that they did some things at a caching layer above the database, which resulted in better application performance.

But of course, you're storing cache values somewhere and not hitting the database, so that can be very performant. As a result of these changes, he saw reduced CPU load in the database from 15 to 0.4, which is good. And then max query duration went down. So you should be able to see that as well in pg_stat_statements. He also did some things to reduce a lot of aggregate queries that were summarizing a lot of data and created periodic aggregates that they could query against instead of the raw data, and that sped up things as well. Anyway, has this quote "Setting up observability for Postgres performance requires a bit of work," so it can be, "but it gives you valuable insights into the performance of your database".

It definitely gives you insights, but again, most of the time for me, if I know what the CPU is doing, how busy it is, how much time is being spent in I/O wait, and they give me access to pg_stat_statements, I can generally detect when something's going on and then investigate and resolve the solution. But I'd love to know what you think. How important are all these other metrics for Postgres to you? Do you take action based on that? Do you see the ratio of fetch to return? And that leads you to take certain actions, but anyway, let me know and put it in the comments.

Next piece of content- "How to avoid deadlocks in PostgreSQL migrations". This is from the Dovetail Engineering section of medium.com, and they had an issue where they were running a migration within a transaction and they were altering a row-level security policy on one table and then altering another table at the same time. But they had some others doing a user request through a query on both of these tables and they were running into deadlocks when they were trying to run this migration. So basically, a deadlock is when you have a sequence of locks where they cannot be resolved. One lock is waiting for another, but what it's waiting on has been locked by the other process. And basically, Postgres chooses one of those transactions to be terminated as a result of it. So it's a deadlock. Now, their proposal for fixing this migration is actually to do them in separate transactions.

So for example, first alter the row level security policy of one table, and then do your alter statement in another transaction. They give some information about how you can actually see what is potentially locked when you're doing a particular DDL statement. In this case, they start a transaction, they start doing the alter command they want to do, and then you can check the pg_locks table to see what impact it has. Now, it's not their recommendation to do this for every migration to check what it's going to do, but they're just showing this here to demonstrate where you can see how things are locked. But the first recommendation is basically don't do multiple DDL statements within a transaction. Now, of course, that runs counter to a lot of benefits of Postgres is the fact that you can do that. So you can do a series of DDL changes, and then if something needs to roll back, you can easily roll back and not have partial DDL changes committed.

But that's one proposal to try to avoid deadlocks. The other is to try to order your lock-taking consistently. So if you can do it in a particular order that you know you can avoid deadlocks, that's another possibility. The other thing they mentioned is lint your database migration. So there are particular tools that look for certain activities that can actually lock tables in a long-running fashion and potentially bring your database down. I use Ruby on Rails and there is a strong migration gem that helps you avoid some problems when you're doing migrations for your database. They also mentioned you do have some Postgres configuration parameters you can adjust, such as setting the deadlock timeouts or adjusting the maximum number of locks per transaction.

But they said they actually haven't changed any of these yet. But the number one thing I did not see them talk about is the lock timeout parameter. I would say that's the number one tool in your toolkit to try to avoid deadlocks or other things that can lock up your application or prevent queries from running. So basically, when you do a DDL transaction, you set a lock timeout for that session. Maybe you set it for 900 milliseconds, or maybe you set it for a few seconds, depending on what you're trying to do. If that DDL statement cannot finish within that period of time, it rolls itself back.

So no harm, no foul, and queries can continue running without an issue. Now, if you're having an issue with deadlocks when you're trying to run a migration, I would think the lock timeout parameter could work. So maybe you set that at 900 milliseconds, which should be below the default deadlock timeout, which I think is 1 second. That way, if the migration is unable to be completed at 900 milliseconds, maybe it's because something's locked and it will eventually result in a deadlock. It'll roll back, allowing other queries to not be impacted. So that's my number one tool when doing migrations to avoid problems such as these.
This is also the post covered in "Five minutes of Postgres" at pganalyze.com, called "Avoiding deadlocks in Postgres migrations". So Lukas covers this post as well as he shows the impact of lock queues. So once you have something locked up in this fashion, even if it's not a deadlock, you can see other requests start to queue up behind it, which can cause big problems for your application. He also mentions the strong migration gem that Andrew Kane wrote as well. So definitely check out these two pieces of content if you want to learn more.

Next piece of content- "PGSQL PHRIDAY #014: WRAPPING UP!". This is from cybertec-postgresql.com and PGSQL Phriday is the monthly blogging event. This one was on Postgres events, not events within Postgres, but events about Postgres that people go to online or offline. And this had an enormous amount of participation. There were twelve blog posts written about this topic, and this blog post gives his own summary of them and links back to them. But we'll just quickly look at them in the tabs here.

So the first one is from Frank on dev.to in the AWS Heroes section, and he recounted what happened with regard to PostgreSQL at the AWS reinvent event.
Next one was actually from Postgres FM and this was the weekly episode they did on events. So you can listen to their episode here or watch the YouTube video down here. They talked a lot about Nikolay's experience actually running some events, and both of them gave their experiences as well as different advice if you're attending an event or want to be a speaker in the event.
The next one is from Grant at scarydba.com and he has a perspective of being a newbie to the Postgres world and he felt it's very welcoming, but compared to the Microsoft SQL community, it feels very top-down structured, at least the big events.

Next one is from Jeremy at ardentperf.com and he recounted attending the PASS Community Summit in the Seattle Postgres Users Group.
Next is from pgmustard.com, and Michael, who's also from Postgres FM, gave some additional words of advice and his experience in a written form as well. Next one is Ryan from softwareandbooze.com talking about his experience at the PASS Data Community Summit.Next is Claire from citusdata.com giving an illustrated guide to the same Pass Community Summit and talking a lot about the different talks that were present at that event. Next is Ryan from blog.rustprooflabs.com, and he gives more general experience about networking benefits and also talks about virtual versus in-person and benefits.
Another is Andreas from andreas.scherbaum.la. He actually leveraged different people he interviewed at postgresql.life and talked about how many different times all the different conferences were mentioned. So that's some interesting data. Next is Henrietta at hdombrovskaya.wordpress.com, talking all about her experience with events. Then we had Rajiv at rajivharlalka.tech give a short synopsis of his experience.

Lastly, Andrew at andyatkinson.com gave his experience of attending Postgres conferences as well as the benefits of each of them. So definitely check out these blog posts if you want to learn more.
Next piece of content- "PostgreSQL and FIPS mode". This is from peter.eisentraut.org and FIPS is a reference to the Federal Information Processing Standard. Mostly, you hear about this with regard to security, particularly with regard to OpenSSL. And he says the one that's probably the most relevant is FIPS 140-2, which I'll have to take his word for it, because I'm not quite familiar with this. And he says "...out of the box, OpenSSL does not satisfy the requirements of FIPS 140-2, but some versions can be configured to operate in a mode that satisfies FIPS 140-2". Now, he talks about this with regard to OpenSSL and then how it impacts PostgreSQL as well, and he goes through a lot of detail, but he has a very great summary at the bottom here.

So basically, Postgres 14 and higher, MD5 authentication, won't work if you've enabled FIPS mode. At PostgreSQL version 15 and higher, pgcrypto will require OpenSSL subject to the FIPS mode. Then at Postgres 17 and higher, all PostgreSQL test suites will pass OpenSSL and FIPS mode. So basically this blog post is a result of him trying to get all the tests to pass with OpenSSL in FIPS mode. So if this is something of interest to you, I highly encourage you to check out his blog post.

Next piece of content- "Fun with Postgres ASCII Map and Cardinal Directions". This is from crunchydata.com and this is the Day 23 challenge from the 2022 Advent of Code. Again, I won't go through this further because it would spoil it.

The last piece of content is "Advent of Code in PostgreSQL: Tips and Tricks from 2022. So if you're looking to try to use Postgres to resolve some of these challenges in 2023, here are some tips that you can go through to do that. So definitely check out this blog post if you want to learn more.

episode_image