Join Costs, zHeap, Autovacuum Tuning | Scaling Postgres 26

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering join costs, zHeap, autovacuum tuning and Postgres configuration for humans.

Content Discussed

YouTube Video

Podcast Audio


[00:00:00] In this episode of Scaling Postgres, we talk about join costs, z heap, auto, vacuumtuning, and human configuration. I'm creston. Jameson. And this is scaling postgres episode 26.

[00:00:21] All right, our first article is Cost of a Join, and this is from andin it he is asking the question of when you have, say, a product table and you want to add astatus for each of the items in that product table, what is the best way to do it? So he has anexample here. Do you want to add a status ID column to the product table and reference it in anew status table? Secondly, do you want to add a status ID column to the product table and letthe app application define what a status is by mapping it to something or three? Do you just wantto add a text column that has the exact value of the status to that table and what's the bestoption? And with that he's considering, all right, what is the actual cost of a join? So he goesthrough a methodology of how he was testing how expensive are joins. So he basically takes atable and he joins it to other tables, increasing the row count and increasing the number oftables that are joined to. And he goes through all of his code that he used to generate thesetests. And then he gives graphs at the bottom of the post here with what he found out. Now I'mgoing to skip down to the lower graph just because I think it's the most relevant, because he didsome tests without indexes and then with indexes and of course, I would tend to go for the onethat's using indexes to see, all right, what is the cost of a join doing it that way? Now, in histesting methodology, he went up to a million rows. So that is a little bit low from my perception,particularly if you're looking to scale postgres because that's where you start seeing impact for alot of rows. And it would have been more interesting to me is, okay, what did up to a billion ortens of billions of rows, how would that be impacted? Secondly, he was only testing tables with, Ibelieve, a maximum of three columns, or they were two or three columns and they wereintegers. So the tables themselves, each row didn't have a lot of data in it. So it makes it a littlebit harder to compare real world applications that probably you may be joining tables that have50 or 100 columns in them and a wide variety of data types, not just fixed integers. But using hismethodology, he's basically seen going joining from one table up to 50, definitely seen anincrease in the amount of time it took, again based upon the row size, but the performance is stillgood. And he says, but still joining 50 tables with 1 million rows each happens in just twelvemilliseconds, which is great. But again, with larger tables and with a larger number of rows, whatwould that performance look like? Because in my experience there is a real cost to joining tomultiple tables with a lot of different rows and you definitely need to minimize joining to additionaltables. But another way I like to think about it when considering performance is what am I askingthe database server to do? So for example, if I have three tables I need to join to and each ofthem have a billion rows, well, that's a lot of data to churn through. But if I have indexes, maybe Ionly want ten rows, then a specific enough index those tables will allow me to pull that data outvery efficiently. So I'm not asking the database server to do that much. So when I think about thecost of the join, it's not necessarily I need to avoid joins, but what data do I need and what is themost efficient way to get it and how can I ask for it from the database? So it has to do less workand that will inevitably lead to great performance, but that just tends to be how I think about it.But again, I like the methodology used in this post and I like the graphing that he did here, but itwould be super interesting to see for really large databases and more real world table sizes,what would that performance look like? But definitely a blog post to check out.

[00:04:18] The next post is actually a YouTube video and it's very very short. It's more like a littlebit like a marketing video and it's called Less Bloat, fewer Rights and just plain smaller thePromise of Zheep. And this is from the Enterprise DB YouTube channel. So basically Zheep is away to redo the postgres heap that Enterprise DB is working on for PostgreSQL. And it's relatedto a post that was done back in January of 2018 called --do or Undo. There is no vacuum. So essentially what the Zheep is doing is trying to avoidhaving to vacuum unused rows. So this goes back to when you do an insert in postgres. Itsimply inserts a row when you want to do a delete, it actually marks that row for deletion and it'sactually vacuum that goes through and vacuums up those rows. When you want to do anupdate, it's basically like an insert and a delete. So whatever you're updating, the new row isinserted and then the old row is marked for deletion and then vacuum has to go in and vacuumit. Now this basically creates Bloat problems and requires vacuum to go in and clean up theseunused rows. But with Zheep they're basically wanting to handle MultiVersion concurrencycontrol a different way, where they say, quote, we handle an update by moving the old rowversion to an undo log and putting the new row version in the place previously occupied by theold one. So it's basically it would do an update in place but it would save the old version in anundo lock. Now, for those of you who have used Oracle, this is very similar to their rollbacksegments concept. So instead of putting an insert in and marking the old row to be removed atsome point via vacuum, you do an update in place, but you save to another logging area whatthe old row was. So if you need to roll back, you can consult that undo log to get that row back.Now the promise of this, if it works, is basically no vacuum, which it sounds really great to mebecause I know that right now there is a big cost for updates and generally if you wantperformance, focus on doing inserts only and try to avoid updates, particularly updating thesame row multiple times. There's a real performance cost to that. But maybe this could allevsome of that pressure. Now they're projecting looking at doing this by maybe around postgresversion twelve, maybe 13. They're not quite sure yet, but definitely something to keep track of.And definitely two pieces of content, both the YouTube video and this blog post to check out.

[00:07:02] The next post is tuning auto vacuum in PostgreSQL and auto vacuum internals. Andthis is from the blog. So basically it goes over what is Auto Vacuum, why is itneeded, and I kind of mentioned why it's needed. What are ways to set up logging for AutoVacuum? Ways to set settings to determine when Auto Vacuum runs both globally andpotentially on a per table basis? How can you identify the tables that need Auto Vacuum settingstuned? Number of processes, how do you control how often it runs and what impact it may haveon your database server? So, definitely a good review post on how to set up and configure AutoVacuum for PostgreSQL.

[00:07:46] The next post is actually another YouTube video and this is called PostgreSQLConfiguration for Humans by Alvaro Hernandez and this is on the high load YouTube channel.Now this is from a presentation back in 2017, but he goes over most of the configuration valuesfor PostgreSQL that you want to pay attention to. And I've also have links to his slides used inthe presentation here that I'll also share. So if you're wanting to learn a little bit more aboutpostgres configuration, definitely two pieces of content to check out.

[00:08:21] The next post is actually super short, it's basically PG. Bouncer 1.9.0 has beenreleased as of August 13. So basically you can check this out and check out the change log tosee if and when you want to upgrade to the next version.

[00:08:37] The last post is Performance Monitoring and Auditing PostgreSQL top resources asthis is from the Several blog and this is basically a summary post that goes over eachof the posts that they've done related to monitoring postgres performance. So they referencecontent that we've actually mentioned on this channel previously, like a performance cheatsheet, things to monitor, audit, logging, best practices, decoding PostgreSQL errors, best alert,notification, tools for PostgreSQL, et cetera. So definitely another blog post to check out.

[00:09:16] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to, where you can signup to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes.Thanks, Our. --