5 Times Faster Aggregates | Scaling Postgres 394

In this episode of Scaling Postgres, we discuss enhancements bringing more than five times faster aggregates and 64% faster counts, better performance through sampling and my course sale is ending.

Content Discussed

YouTube Video

Podcast Audio

Transcript

I'm taking a break on Thanksgiving to give you an episode of Scaling Postgres. If you celebrated Thanksgiving last week, I hope you had a good one. And before we get into the content this week, I wanted to let you know that my course, “Ludicrouspeed Postgres is on sale!” until the end of 12 1. So that's essentially the end of Cyber Monday. So you can check out scalingpostgres.com courses ludicrous speedpostgres and this is my performance optimization course for Postgres. I do record it showcasing a two terabyte database that actually builds up to about four terabytes once all the indexes and everything are added. So if you are interested in it, you can get it now for a significant discount. But the price will of course be increasing after 12.1. But I hope you, your friends, family and co workers continue to do well.

Our first piece of content is “Super fast aggregations in PostgreSQL 19”. This is from cybertech-postgresql.com so we've just had Postgres 18 released not too long ago and now work has already begun on 19. So unfortunately this won't be here for about another year, but this is nonetheless interesting. So this blog post starts off and says you usually Postgres does its joins first and then it does the aggregates, but you can also do the reverse of that. You could choose to aggregate first and then join, but as of right now, it joins first and then aggregates. So if we look at it here you have a person table. Imagine this has millions of rows in it, and then you have a gender table that basically just has a few rows in it. And when you apply the join first, essentially it reads every row entry in the person table. And then for each of those entries it looks up the gender name because it's looking for what the gender name is here and then it displays the results. Now, there's nothing wrong with this approach, but there's a faster way to do it. So he's going to look at aggregating first and then joining later because this is a patch that was added to 19. So in this example he has a product table and then those products can have colors or categories and the categories, there's just four of them. The colors, there's just four of them. And basically if you want to look at category name by color and look at the different counts, you would join all the tables together and then group by the category name and the category color. Now if you look at what Postgres18 would do again, it would join first in aggregate later. So you can see it's doing a lot of sequential scans on the color and the category and only then does it do the aggregations. You can see the partial hash aggregate here, whereas in the new example you can see the hash aggregate happens much earlier inside of the two loops to check the color table and the category table. So here are the two loops here. So it aggregates first and then it joins to find the names in the color and category table. And as a result the performance is five times faster. It went from 95 milliseconds down to about 17 milliseconds. So this is great. This type of query suddenly becomes, in this example, five times faster. But the more data you have, the faster the speed up becomes. And you get this without adding indexes or making any changes. It's presumably after you upgrade to 19 you get these performance benefits. He does say that there is some limitations if you're actually using Cube. I don't see this hardly at all. But still a five fold improvement for just upgrading the database would be great for these classic queries. So check this out if you're interested.

Next piece of content “Settling COUNT(*) vs COUNT(1) debate in Postgres 19”. This is another example of a performance improvement of aggregates in Postgres 19. And this is from thatguyfromdelhi.com and here he doesn't hide the headline. You can get a 64% improvement in some counts as long as you're counting against a not null column. Now he says that historically count or asterisk has been faster than doing count one or counting a particular column. Because with Count star the planner knows it just needs to count rows. It doesn't need to check if any specific column is null if you check a particular column, it needs to check if it's null before counting it. But even count one was slower because it was treated similarly to count a particular column. So what the planner can do, it actually can convert a count a not null column or count one some sort of constant to be considered a count star. Just count the rows to give you a performance boost. So if we look at this before the patch you can see counting some particular column counts 10 million rows in about 195 milliseconds. Doing count one did it in 126 milliseconds and then doing count star did it in 119 milliseconds. Whereas after the patch and Postgres 19 counting the column went to 117 milliseconds. Count one did 116 milliseconds and then count star did 114 milliseconds. So basically there is a 64% improvement from counting a column before the patch compared to after the patch. And again, this happens as long as the column you're counting is not null, because it knows it doesn't need to check for nulls. So again, this is a pretty great win. But again, unfortunately, we have to wait till Postgres 19 for it to come out. But check out these two posts if you're interested.

Next piece of content “PostGIS Performance: Data Sampling”, this is from crunchydata.com and the basic premise of this post is that if you're needing to take an average of a lot of values, you could do a statistical sample of it and get pretty close to what that average value is if you want to be able to just get back an estimate. So if you wanted to take an average of a particular value of a column, you would just run something like this and it would give you the average of all the values. Now, he set up a table of 10 million values to do some experimentation, and he took a sample of 10,000 values to do the average. And it was very close to running average for the whole population. And how you take this sample is you do the same query as before, select the average of a column from the table name, but then you add table sample system and one in this case for 10,000 rows. Actually, even though the blog post says 10,000 rows, I think it may be 100,000 rows, because I think the value you put in for the table sample is a percentage. So I think he's looking at 1% of the rows, which should be 100,000 values. But you put an integer here of however many percent you want to take a sample of the table. And counting less rows is of course faster than counting the whole table. But then he also applies this to geospatial data and says, does this work with regard to this as well? And he found out that it basically did. So to take a count of everyone in the census table with a certain geometry took 7.2 seconds, and then he took a 10% sample and it took 2.2 seconds to run, but the estimate was pretty accurate compared to looking at everything. And then he took a 1% sample, and again it was probably within 1% accuracy, but it only ran in 0.2 seconds. So this is a way to give you much higher performance as long as you're okay with estimates. Now he's talking about geospatial information here, but you could apply this for other use cases in your database. So perhaps samply could give you boosted performance if you're okay with estimates. But check this out if you want to learn more.

Next piece of content “What do the new Index Searches lines in EXPLAIN mean?”, this is from pgmustard.com and when you run explain in postgres 18 now you will see a number of index searches. Now, when it says index search is one, it means that you've done a single descent on an index, so you've gone down to a particular relief node of that index. So if you're looking for a single entry in a primary key, it should have an index search of one. And he shows some examples here of information from the postgres documentation showing the different types of index searches. So for example, when you're looking at a particular column and you're looking for four distinct values, and this is Postgres 18, you can see it descends four times in the index to find each of these distinct values. Now I'm going to jump over to the docs real quick here because this example actually helped me, but there's enhancement that was added to Postgres a version or two ago that contiguous values can be read with a single descent of the index. So if you look at this example here, where it's looking in that column and looking for four distinct rows, but they are right next to each other, you can see postgres can just do a single descent of that index and then read across the leaf nodes to get what it needs for the query. So even though the example up here, the reason why it did four descents is because these are spread out across the index. But if they're contiguous right next to one another, you can get by with doing a single descent of the index. And then with Postgres 18 it offers the skip scan. So they have another example here where you're looking for three distinct values in one column and then another value in another column. And this requires three descents of the index. So in the first column you have to find the integer 1 and the second column the unique column. Then you have to find the number two and the unique column and the number three in the unique column. My apologies if you're listening to this in the podcast, because this may be confusing because they name their columns numbers, but it's pretty clear in the postgres documentation and if you're watching the video, but there's no guarantee that these are contiguous. So it has to do three descents of the index. If we go back to the blog post now, this is the example that he was talking about. And this brings up the question, should you be optimizing for fewer index descents? And I think generally that's a good idea to do if you can, but you should prioritize buffers and timing and things of that nature first. Because kind of what he's phrasing here is that optimizing based upon minimizing your index descents probably makes highly accurate indexes. And then maybe you lose more general indexes that could help with general queries that happen. Because when you have so many indexes, you may run into, as he says here, write amplification. You lose hot updates if you're indexing more columns than you need to and have of course, increased competition for space and shared buffers. But he shows another example here where you have a table with a big int with a lot of random numbers and then a Boolean field that of course just has two values in it. And he creates a two column index. The first column is the Boolean field, the second field is the integer field. Now that means it could take advantage of skip scan. So he's looking at a value from the Boolean field and only looking at that second column. Now, on Postgres 17 you don't have skip scan, so you can see it's reading 168 buffers. So that's a lot of data to read for essentially just a single value in the table. Whereas in Postgres 18 you can see the number of buffers read is only nine. So clearly this indicates the skip scan works. But you'll notice that has four index searches. And he was like, why are there four index searches? He said, shouldn't there be one for the Boolean true and one for the Boolean false? So the reason is that boundary conditions and nulls always need to be considered so you can get one or two extra index searches when these can't be ruled out. So then he modified his query and said and boolean needs to be N true or false. Then he got the two index searches he expected. But even though you have two index searches, there's actually a more optimal way to do this. If you actually create an index where the integer field first and the Boolean field is last. Because now you only get one index search because it just searches that unique value that's in the column. And he explains it visually this way is that in the multi column index where The Boolean is first, you have the 5432 value over here because it's ordered false. And then you have to skip over this other part to see where it's true because the boolean is first. But in the case where the integer is first, the integers are right next to one another so it can pull back both rows with the single descent. So he's basically saying if you see these index searches more than one, you may want to see if there's a better way you can optimize things if you think it makes sense for you to do that. And he also talks a little bit about what PG Mustard is doing. So if you want to learn more, definitely check out this blog post.

And the next piece of content There was another episode of Postgres FM last week. This one is on “What's new in EXPLAIN”, so Nick and Michael go over explain changes in 18. A lot of what was mentioned in the previous blog post, the fact that buffers are enabled by default with Postgres18 now and of course some information about Skip Scan. So if you're interested you can listen to the episode up here or watch the YouTube video down here.

Next piece of content “We Taught AI to Write Real Postgres Code (And Open Sourced It)”, this is from tigerdata.com and they're saying when you use an LLM to generate an SQL schema, a lot of times your results are going to be lackluster. Like for example, it may put random character links for different columns that you ask it for. It may default to using serial columns as opposed to identity columns. It may have timestamps with or without time zones. And we've even seen some issues in previous episodes of scaling postgres that had long term ramifications, particularly on how you set up foreign keys. Well, they've actually put together an MCP server and also a Claude code plugin that gives a set of curated instructions to the LLM to generate schemas. So in this particular case they're using the practice of using text columns, setting them to not null. They're using identities for the primary keys. But the thing about it is this is definitely curated a particular way. Like for example, my application is written in Ruby on Rails, therefore I'm adhering to the standards that it has. So it is still using serial columns, it uses timestamps without time zones, amongst other things. So basically that is the standard I'm following to make it easier. So if new developers come in to use the code, they have the expectation that this is a Rails project that follows the standard conventions. So if these conventions match your standard, that could be good, although I wonder how easy it is to alter said conventions. But nonetheless, this seems pretty interesting, so definitely check it out if you want. Here is the GitHub link to get started.

Next piece of content “AI-Powered Postgres index suggestions”, this is from planetscale.com so this is a new feature that's in Planetscale. I haven't seen this introduced for others yet, but they're actually using LLMs to propose index suggestions. So this is a pretty good use case for it. So check this out if you're interested.

And last piece of content apparently someone has written the book “Just Use Postgres!”, so if you're interested in learning how to just use Postgres for your project, definitely check it out.

5 Times Faster Aggregates | Scaling Postgres 394

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript