
Top Ten Postgres Dos & Don'ts | Scaling Postgres 363
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss a top ten dos and don't for Postgres, understanding statistics and extended statistics, aligning columns to reduce padding and how to handle disaster recovery.
Content Discussed
- Hacking the Postgres Statistics Tables for Faster Queries
- Type alignment and padding bytes: how to not waste space in PostgreSQL tables
- Apr 16, 2025 - PgBouncer 1.24.1
- 8 Steps to Proactively Handle PostgreSQL Database Disaster Recovery
- On expressions' reordering in Postgres
- PgPedia Week, 2025-04-20
- Contributions for the week of 2025-03-17 (Week 11)
- Contributions for the week of 2025-03-24 (Week 12)
- Contributions for the week of 2025-03-31 (Week 13)
- Contributions for the week of 2025-04-07 (Week 14)
- Contributions for the week of 2025-04-14 (Week 15)
- Logical replication from Postgres to Iceberg
- How Timescale Solves Real-Time Analytics in Postgres
YouTube Video
Podcast Audio
Transcript
I don't know what it is, but apparently a lot of people are on vacation because there are hardly any blog posts this week. So what I'm gonna do is maybe make the consulting corner just a little bit longer. So I was asked by a consulting client this week to come up with a top 10 postgres do's and don'ts. So be sure to stay till the end to find out what my top 10 were, but I hope you, your friends, family and co workers continue to do well.
Our first piece of content is “Hacking the Postgres Statistics Tables for Faster Queries”. This is from CrunchyData.com and I'm not sure I would call this hacking the postgres tables, but this is a very good blog post explaining the importance of statistics and how they work. So basically Postgres stores statistics about the data in each of the tables and it does it per column. And those statistics get updated when you run, analyze or when autovacuum runs. It analyzes the table and updates those statistics and it gathers the following statistics. She lists distinct values, average data width, null fraction correlation, and this correlation is between the physical order of the tuples versus the values in the order of the column. It also collects most common values or MCV as well as histograms. And this is a view from the statistics table that you can just query the PGStats view to look at. And then it talks about how these statistics are used in query planning to determine how best to approach the plan to execute the query. And she goes through all of this in detail. So I think this is going to be a little hard to convey in the show. But I definitely encourage you to check this out if you want to learn more about how the statistics play a role in query execution. And then because these statistics are per column, there's not really relationships between columns, those aren't defined and you establish those through extended statistics. And you can define functional dependencies, multivariate distinct counts or multivariate most common values. And she walks through each of those examples and shows how you can have a poor performing query, but if you add in extended status statistic now you get better query performance. So she does that for each of those different examples. So again this is a really great blog post to really understand how statistics work and how they may be able to assist you in some of your query optimization. So if you want to learn more, definitely check out this blog post as well as her slides because this blog post actually comes from a longer presentation. She did deep dive into statistics that Was presented at PGConfEU in 2024.
Next piece of content “Type alignment and padding bytes: how to not waste space in PostgreSQL tables”, this is from cybertec-postgresql.com and we've covered this in previous episodes. But basically this is about aligning the data columns in each table so that you avoid padding bytes. So it basically compresses your data and makes the table smaller, which can have ramifications for better query performance. So he says there are a few different type alignments. One is C, which stands for the alignment of 1 byte, S, which stands for an alignment of 2 bytes, or short I or int for alignment of 4 bytes, and then d or double for an alignment of 8 bytes. And all these different data types are part of the 8, 4, 2, or 1. I was a little confused by the UUID being in the C type, the one byte type, but apparently it's always a length of 16, but it counts each individual character. And then using the pageinspect extension, he shows a process of how you can see what these padding bytes look like, and they're basically just filled with zeros. Where the padding is happening and why it pads is because the first column is a small int, which is an alignment of two, and then the next column is a timestamp, which is an alignment of eight, so it has to skip over six bytes before it can start the timestamp. And then it does an integer, which is an alignment of 4, and then it has an alignment of 8, so it has to add 4 padding bytes before getting to the double precision. So all of that is basically wasted space in the heap table. And then he goes through the process of aligning your data types to save as much space as possible. So the first one he says is define all frequently accessed columns of type UUID, because that has a fixed size of 16 bytes. So put that large column or columns in there first. Next define the columns with the alignment of 8 bytes. So big ints, timestamps, etc. Then define columns with 4 bytes again integers. Next is 2 bytes, like small ints, and then finally do the 1 byte or those with variable lengths. So that could be booleans but also text, Character, varchar, Numeric, etc. So now with this different alignment, this is how this new table would look. Now, I have mentioned this in previous episodes of scaling postgres, but normally I don't worry about this at all because normally I'm working with online transaction processing databases and the columns change so quickly. I usually don't spend any time on an alignment. However, if you're doing data warehousing or if you anticipate having billions of rows in a particular table in your application, then spending some time on alignment could definitely save you some space in the long run. But if you want to learn more, definitely check out this blog post.
Next piece of content is actually an announcement about a new release of “PGBouncer 1.24.1”, and this release fixes a CBE2025 2291 where if you set a password expiry in postgres using valid until PgBouncer did not honor that. So this release fixes that. It also fixes an issue with PAM authentication that happened in release 1.24.0. So if those two issues affect you, you probably want to upgrade to bgmounts or as soon as you're able.
Next piece of content “8 Steps to Proactively Handle PostgreSQL Database Disaster Recovery”, this is from pgedge.com and they go through 8 steps of if you suspect some kind of corruption is happening, this is what you do. First is detect it. And they mentioned a few different ways you can. You can check the PostgreSQL log files. So you may see something like this. Maybe you have panic or maybe some fatal message or definitely errors where it could not read a particular block of a file. That's definitely an indication. You can also use pg_amcheck in your database. So this is a tool to check the heap and the indexes for consistency. You do need to install this as an extension in each database you're working with. Next is if you created your cluster with checksums you could verify those. Now this requires shutting down the database, so keep that in mind. Maybe you would want to somehow clone your production database and then run the pg_checksums on that. But that can definitely highlight corruption due to potential hardware failures and you can tell when you run the checksums. It tells you where the issue is here. But once you've identified what's the next step is basically to stop PostgreSQL immediately if that's in the realm of possibility. Basically you want to prevent further damage. Next is restore from a known full good backup using point in time recovery if you're able to do that. And then step five is salvage what you can. So maybe do a dump of a table and try to restore that data into a new cluster. Now if you have a problem with wall files they do mention you can use pg_resetwal but they emphasize, you know, this is as a last resort. So it's very easy to lose data since the last checkpoint of your database. So step seven is prevention and that's basically enable data checksums on your cluster and running regular integrity checks with pg_amcheck. And of course they assume you are running some sort of replicas and have wall archiving enabled as well. But if you want to learn more, check out this blog post.
Next piece of content “On expressions' reordering in Postgres”, this is from danolivo.substack.com so this is talking about ordering different expressions, usually within a where clause to optimize for the best performance. And this does get a little bit into the weeds with regard to the planner and how it works. But if that's of interest, you may find interest in this blog post. He was able to get to the point of speeding up something 2 to 3% or 10% faster and he questioned, you know, is this type of micro optimization beneficial? But if you want to learn more, you can definitely check out this blog post.
And now it's time for my Consulting Corner. Now, earlier this week a client asked me to come up with some postgres do's and don'ts and it was going to be a part of a presentation. So I started coming up with a whole big long list of do's and don'ts and it wasn't really working for a presentation. So then I said, alright, what are the top 10 things I would want to communicate to a group of engineers, predominantly developers who use Postgres, but they don't keep 100% on top of everything? What are the most important things they should know in their usage of the database? What are some things that you should do and should not do? So I came up with the idea of just doing it as a top 10 list. So there's definitely tons of stuff omitted. But I tried to have within that top 10 what I thought was the most relevant and it kind of reflects what I see when I'm engaging with clients, the kind of issues they deal with. So it's kind of prioritized along that level as well as the relative importance of it. So I thought this would be fun to walk through the show and maybe in the comments you can tell me which ones you disagree with that shouldn't be there or did I entirely forget something that should be included. So let me know in the comments. But the first one I said is basically avoiding excessive updates. So basically this is due to Postgres, MVCC and that when you update one row 100,000 times, essentially you're creating 100,000 rows in the database. So if you do this a lot, there's a performance impact. It can also cause excessive vacuums and bloat, of course. So number two is talking about insert speed and that using postgres copy is more performant than any other method. Usually followed by multi row inserts and then followed up with single row inserts. Next is index concurrently, because indexing non concurrently basically locks the table. Number four was use lock timeout for schema migrations. So this setting cancels your migration if it locks another process longer than the number of seconds you've configured, basically helping you protect your system. Next is use a connection pooler once you need more than 500 database connections. Next is avoid excessive explicit locks. And this is you trying to lock your table to perform some operation or even having a really long lock on a set of rows. And basically minimizing these should help you maintain your read and write throughput. Next is usage of multi column indexes. So use these when they're beneficial as shown by Explain. So I don't normally throw on a lot of multi column indexes. I usually wait until I see a query problem and then I go ahead and apply a multi column index if I think it would be beneficial. Next is functional index. Use these when necessary. Because a lot of times I go to a client site and they say, well, I put this index on and it's not working. And so many times they're working with email and they're trying to get a lower of the email, but they don't have a functional index that has defined lower email in it, therefore the index isn't being used. So if you're querying a column with a function on it, you need a functional index to be able to query against it. Next is nine partial indexes. So use these with low cardinality fields. It only indexes a subset of the table rows. And I mentioned this last week on my consulting corner. But basically if you have something like a status column that has two or three values, putting an index on it is not going to be very efficient. But it can be efficient if you create independent indexes for each of those statuses. Or maybe you just need one. Like if you're checking to see if something needs to be processed, you might create an index where the status is incomplete and it may only have a handful of rows whenever you're using that index. So that would be a blazingly fast query. And number ten, I said consider partitioning if you plan to delete data or once your table is over 100 to 200 million rows. Now, I think this is a very conservative estimate. I've seen 500 million rows and up to a billion that are fine. But I do think this is the level at which you should start consider partitioning. Maybe you do, maybe you don't, but I think that's a pretty good level to consider. So what do you think? What do you think I left off that should be included? And what do you think is on here that should maybe be taken out in terms of top 10 do's and don'ts for postgres, please let me know in the comments.