background

pg_stat_statements, Transaction ID Wraparound, Consultant Knowledge, CitusCon | Scaling Postgres 265

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss different ways to use pg_stat_statements, how to handle transaction ID wraparound, consultant knowledge and all the videos of CitusCon.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "pg_stat_statements". So this is an episode of Postgres FM, and they've been doing a very consistent job of putting out episodes every week. So I highly encourage you to check out their weekly show. But the reason it's the first this week is because this is also the topic of this month's PGSQL Phriday.

So in this episode, as it says here, they show why everyone should use it but also some downsides and you can listen to their episode here or look at the YouTube video here. The next episode related to pg_stat_statements this month is "QUERIES FOR PG_STAT_STATEMENTS". This is from pgmustard.com.

And this is great because it shows you different queries for different questions you want to answer. So if you're looking to reduce your workload, say your database server has 80% of its CPU utilized at any given time and you want to actually reduce that down, you would use queries that are sorted by the total time that those statements are taking within the database. And if you're able to optimize those you will slowly drop down things like CPU utilization or maybe even disc usage.

The next type of query helps you speed up your slowest queries. So what sorts the data in pg_stat_statements in the meantime to execute? So this shows you your slowest statements and how you could speed those up. The third way is if you need to reduce your IO is basically focused on buffers. So he's outputting a lot of memory-related information and then sorting by the total buffers that are being used for given statements.

And then finally, if you need to tune your JIT settings, and you're on Postgres 15, they've added a lot of different comms for you to be able to look at JIT timings and these orders then by the jit_total_time_percent, including all of these comms so you can find out what's taking a lot of just-in-time compilation. So this is a great post, I definitely encourage you to check out.

The next post in the series of PGSQL Phriday is from andreas.scherbaum.la. And he's taking a different tact and looking at it from the perspective of PostgreSQL person of the week where he interviews someone weekly. And he asked the question- what is your favorite PostgreSQL extension? And the vast majority of answers are pg_stat_statements.

So it's the number one extension that everybody loves using. You can see the second one is PostGIS and then pg_autoexplain and then pg_trgrm, but definitely far and away. PG_stat_statements as the number one extension. And below this chart, he actually has some direct quotes from people who mentioned why they love pg_stat_statements.

The next piece of content is from e7e6.github.io. And he's looking at it from the perspective of changes in pg_stat_statements over time. So he's looking at how many columns existed and from 8.4, there were only six columns, but today there are 43 columns of information being tracked in pg_stat_statements according to version 15 and 16. So there's a lot of really relevant data to help you optimize your performance. And below he gives an example query he likes using as well as some recommendations on how to query and use it.

Next piece of content related to this is from stormatics.tech. And they talk about some benefits and shortcomings and I'll focus a bit on the shortcomings because not a lot of people mentioned that in the other posts. But it does not record performance data for rollback transactions. So if you have a transaction that keeps getting rolled back, it's not going to show up in pg_stat_statements. And secondly, the data that pg_stat_statements collects is not persisted across server restarts. So if you need to restart your server you're going to lose that data.

The next piece of content on the subject is "TRACK PERFORMANCE DIFFERENCES WITH PG_STAT_STATEMENTS''. This is from rustprooflabs.com. And they actually took a tact of where you want to compare performance before a change and after a change. So for example, you can run a set of queries against your database. You can then run the command pg_stat_statements_reset. It will reset the metrics it's collected. Then you make your change, run the performance again and you can compare and contrast what the timings were. He actually did that for a given set of queries and actually charted it so you can look at differences for different configuration settings in Postgres.

Next piece of content on this topic is from di.nmfay.com. And this covers a more practical example of where they use pg_stat_statements in their work and how it has helped them address some issues. So if you're interested in learning more about that, you can check out this blog post

And the last topic with regard to pg_stat_statements is actually the review of all the different posts, along with quick summaries with links to them. You can find this on pgmustard.com. So if you want to learn more about pg_stat_statements, there's a lot of content to check out.

Next piece of content- "How to Fix Transaction ID Wraparound". This is from timescale.com. And this is a great post that talks all about Transaction ID wraparound and basically why it exists. And from the information provided, it was deemed the easiest way to deal with tracking transactions. And it's only in the last number of years that it has started causing significant pain if you generate a ton of different transactions in your database.

Because when this was designed in 85, I believe it was, I'm sure they thought 2 billion transactions. It was more than enough for everybody. But of course, it's not so much anymore. Then this post goes into how to deal with wraparound failure, and of course, the most important thing is to monitor where your transaction IDs are at for your database and also for your tables. They have a query to do that here.

And they also talk about how this transaction ID wraparound issue usually is having problems today because it's a 32-bit transaction ID. If it was 64 bits, it wouldn't be a problem for a long time. But they go through each of the different pieces that have been slowly upgraded to support 64 bits, such as the system catalogs, different functions and procedures, the backends, and different utilities like PG-based backup, vacuum utilities, and all of these have been or in the process of being converted to using a 64-bit transaction ID.

But of course, the problem with this is that those transaction IDs are also stored on the rows. And what that means when we ultimately moved to 64-bit transaction IDs, you're going to have to rewrite all of your data tables, so when that time comes, that will definitely be a Herculean task, although there are different ways to minimize downtime. Like I think probably doing a logical replication upgrade in this case makes a ton of sense.

But we'll have to see what utilities they may come up with to help handle this. And then the post concludes by talking about different ways to make sure your transaction IDs don't wraparound. And it basically means keeping up with your vacuuming, making sure that your tables are being vacuumed frequently. They talked about manually scheduling vacuums as well. They mentioned re-indexing your indexes to keep them as small as possible.

They mentioned a few other items but another big one that people don't think about is batching their data changes. So if you insert a single row a thousand times, you're making a thousand transactions. But if you do one single insert statement with a thousand rows in it, that's just one transaction. So definitely, batching your modifications will help minimize your transaction row usage. But I found this a great post and I highly encourage you to check it out to learn more about transaction IDs for Postgres.

Next piece of content- "Should a consultant know everything?". This is from hdombrovskaya.wordpress.com. So basically she was discussing consulting with someone and the individual said they found it uncomfortable because quote "you are expected to know everything...". And then they're afraid they're going to get a question that they don't know the answer to.

Now she hasn't really had a problem with this because she says well I just have to go research it, and then in another case, she was interviewing a person for a DBA position and asked him to rate his knowledge and he said a seven. And then after a hundred days on the job, she asked, "What is your level of knowledge now?”. He says I think of seven. And at the end of a year, she asked him “What do you think of your knowledge level now?”. And he still said a seven.

And this brings up a great point that's actually mentioned in a comment here where they say quote "As Albert Einstein has put it 'The more I learn, the more I realize how much I don't know'". And that fits me to a tee because I've been doing these Scaling Postgres episodes for years now, 4-5 years I can't even remember. And the amount of knowledge that I actually retained from what I present weekly is a small percentage of the amount of information that I've forgotten.

So in my consulting when someone asks me a question, I don't have a perfect recall of everything I've ever learned. I might remember 'Oh yes, I remember something about this' but then I have to go load up on that information. I just don't have a perfect recall of everything because I'm a human and not a computer. But I think this has helped me draw a lot of parallels and draw connections between things so that once I see something, it helps make things a little bit easier to remember.

I mean I wish I knew everything, but to answer this question, no, consultants should know everything. Postgres is just too huge a topic to know everything about. And there are so many things I don't even attempt to cover. Like I haven't gone into PostGIS or a lot of the AI things that are happening with Postgres yet. Or even the source code, I haven't delved that much into the source code. I think it's more important to just know what you want to focus on. And my focus tends to be on performance. But I thought this was a very thought-provoking article.

Next piece of content is actually a YouTube playlist of all the different videos that took place at Citus Con, an event for Postgres 2023. And there are almost 40 videos of content that was presented. So if you're looking for some video educational content, here it is.

Next piece of content- "PgBouncer 1.19.0 is Released". This is from pgbouncer.org. And it has a number of enhancements including SHOW STATE, so if you've actually paused your PgBouncers, you can now run this command to know are they paused or are they not paused or are they running, et cetera. As well as it fixed a bug that was actually introduced in 1.18. So if you are running 1.18, you'll probably want to upgrade to this sooner rather than later to be able to patch that bug.

Next piece of content- "Extracting the list of columns from the catalogs". This is from fluca1978.github.io. And he's talking about pg_attribute. So when you create a table with the number of columns and he actually deleted the E column, you can actually use the pg_attribute table to look at all the different columns that exist within that table. And you can see it outputs the name of the column here as well as the system columns as well. The system columns have a negative attribute number whereas the user-defined ones are positive attribute numbers.

And you can see the E column that was deleted just happens to be absent. Because he's looking where the attribute is not dropped. So maybe you could find this of use when you're constructing queries for different tables. And maybe you want to use PSQL in G exact to run some commands on them. But if you're interested in learning more about pg_attribute, definitely check out this piece of content.

Next piece of content- "The History of Slonik, the PostgreSQL Elephant Logo". This is from vertabelo.com. And this was an interesting post talking about the history of Postgres and its logo. So they show some examples like Lennox has the pink one called Tux, but PostgreSQL has an elephant known as Slonik. And they go into the far history of Postgres, back to 1996, where an initial logo was shown.

But in 1997, some of the discussions talked about an elephant. But in 1997, some discussion was made about a mascot and elephants came up. But later in 1997, it was a cheetah logo. Some more discussion about elephants and then they go into an alligator, a cheetah, a gazelle, an eagle, a dog, a bridge, and back to the elephant. So basically from around 1999 on, it was the elephant. And apparently, Slonik is Russian for little elephant. So that's actually the name. So if you want to learn some more history about Postgres and its mascot and his name, definitely check out this blog post.

Next piece of content- "Various Restoration Techniques Using PostgreSQL Point-In-Time Recovery". This is from highgo.ca. And when you're wanting to do a point-in-time recovery, there are three different options. You can restore to the latest point possible, which is essentially the default. You can restore it to a specific log sequence number. And you can also restore it to a specific timestamp, which is the one I tend to use. But they go through the whole process, and show you the commands to run to do each of these three types of point-in-time restorations. So check this blog post out of here interested in that.

Next piece of content- "Postgres 16 highlight- JSON predicates". This is from paquier.xyz. And he's talking about being able to determine if something is some type of JSON. So you have- is JSON a value? Is JSON an array? Is JSON an object? or is JSON the scaler? So if you're wanting to learn more about that, you can check this out.

Next piece of content- "Use PostgreSQL DOMAIN rules to validate columns of data". This is from dev.to in the Francesco Tisiot section. And he's talking about DOMAINs, which are essentially custom data types that you place constraints on. And he's used them specifically for talking about address data. So if you have an interest in learning more about that you can check out this blog post.

Next piece of content- "Performance Tips Using Postgres and pgvector". This is from crunchydata.com. And they're talking about using AI in Postgres which uses the pgvector data type, but there are also indexes that you can use with them. And these indexes are called ivfflat, which I wish I knew what that meant. But apparently, there are indexes for pgvector data types. And then using these, they discuss them when you're working with your list. Also how to optimize index size.

So what they've found as the fastest query doesn't necessarily have the largest list size. And there was an optimum list size for the data that they were working with, about 500, which gave them the fastest query. But the index build time was as low as possible. So if you're interested in learning more, you can check out this blog post.

Next piece of content- "Forcing Join Order in Postgres Using Optimization Barriers". This is from pganalyze.com. And this has Lucas's "Five minutes of Postgres" and he talks about a post we discussed last week in Scaling Postgres on forcing a join order in PostgreSQL. So if you want to learn his perspective on it, check out this piece of content.

Next piece of content- "LET'S PARTY AND UPGRADE POSTGRESQL AND POSTGIS TOGETHER". This is from cybertec-postgresql.com. And they're talking about a case where you're running Postgres 13 with PostGIS 2.5 and you want to actually get to Postgres 15 and PostGIS 3.3. So there's a specific order of events you need to follow in order to upgrade both at the same time. And this blog post goes over that.

Next piece of content- "How to submit a patch by email, 2023 edition". This is from peter.eisentraut.org.And if you want to work on Postgres and submit patches by email, he gives a sequence you can use. Leveraging git format-patch in order to submit them. So if you're interested in that, you can check out this blog post.

Next piece of content, the PostgreSQL person of the week is Florr Drees. If you're interested in learning more about Floor and her contributions to Postgres, definitely check out this blog post.

And the last piece of content, we did have another episode of the Rubber Deck Dev Show this past Thursday afternoon. This one was on "Hotwires & Cookbooks With Andrea Rocca". And this covers HTML over the wire. So it's the concept of sending HTML snippets to a client browser as opposed to JSON and how that works primarily in Ruby on Rails. So if you're interested in that content, we welcome you to check out our show.

episode_image