background

HNSW Indexes, Vacuuming Bloat, Watch Me Now, Connections | Scaling Postgres 281

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the benefits and disadvantages of HNSW indexes for working with vector data, configuring vacuum to reduce bloat and optimize performance, the new options available for \watch and all about connections.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Well, before we start the content for this week, I did want to mention that we've started making some changes for, not necessarily the show quite yet, but the website. So now we're starting to generate transcripts of each of the episodes. So I think we've done them for the last 20 episodes or so. It's going to take probably at least a week from the show release before the transcripts get placed on the website. So there is a slight delay. But we have I think at least 20 or 25 now and we're going to be working back all the way to episode one generating these transcripts. So if you ever wanted to read the content as opposed to listening to it or watching the show, you can do that now as well.

But the first piece of content is "pgvector v0.5.0: Faster semantic search with HNSW indexes''. This is from supabase.com. And we already talked about the release of pgvector 0.5.0 and HNSW indexes on a couple of previous episodes of Scaling Postgres. But there were two posts this week that had some additional good information to provide. So this post from supabase.com goes into more depth about how HNSWs actually work. And again this stands for Hierarchical Navigable Small Worlds. That's what wherever the acronym comes from and they give a representation of this. I'm not going to try to explain it because I'm still trying to wrap my head around it as well.

And actually the second post, I found it a little bit easier to interpret. So I'll leave this part of the post for you to look at and interpret how it actually works. But I'll first focus on their performance areas. So again, they're starting with something that has 1,536 dimensions. And this is generally what OpenAI produces. And when analyzing 224,000 vectors, they compared an ivfflat index to the HNSW index. And with these tuning parameters, to get approximately the same accuracy or even higher accuracy in this case, HNSW was three times faster than the ivfflat index. So that's huge in terms of a performance game. And then they tried it with a million vectors.

And they actually saw a six-fold improvement in performance. Which of course is even better in maintaining that high accuracy rating. Now this performance was so good it is even better than some of the results produced by qdrant which is a dedicated vector database. So that seems pretty impressive to me. So you can see the results produced by qdrant and then pgvector. But again you always have to be careful of comparing apples to oranges here. But still impressive performance improvements. And then they said as they scaled the database up, performance scaled as you would expect. Maybe it wasn't linear, but it was pretty good.

They talk a little bit about the build parameters that they set for building the index. Generally higher values are better for higher dimensionality, but it takes a very long time to build the index. And the other benefit that they mentioned here compared to ivfflat is that generally, you need to load all the data in the database and then build your ivfflat indexes. It relies on the existing distribution of the data to set those up. But you don't actually need to do that with the HNSW indexes. They say quote "...you don't need to fully fill your table before building the index".

So you can just add the indexes and start adding data like you would do frequently with BTree indexes or other index types. But they also mentioned the improvements to the ivf indexes and mostly this was regard to build times and parallelization with regard to that. But they do mention here the HNSW vs. ivfflat indexes is that the HNSW will always be slower to build and use more memory than ivfflat. So if your memory is constrained, you're probably going to need to stick with the ivfflat or you don't have a lot of time to build the index. And of course, they mentioned this extension is now available in their Supabase offering.

The next piece of content is "HNSW Indexes with Postgres and pgvector". This is from crunchydata.com. And they were saying that the ivfflat indexes are basically list-based, whereas the HNSW is a graph based algorithm. So it's an entirely new way of doing approximate nearest neighbor shirt searches. And they say these indexes will use approximation. So they're not precise. They're more performance than the list-based indexes like ivfflat. They've required longer index build times, which we mentioned before, and require more memory to use them. Because ideally, you want all of them to be a memory to have the most performance results.

Now I think this post did a better job of explaining what an HNSW index is or how it works. And basically imagine you have this 2D representation with each of these points representing a vector or a point in your data. So they have a first level of a hierarchy that's very sparse with the data. You pick a point and you find say the five nearest points, I think they're talking about the 10 nearest points. Once you have that, you then refer down to a more dense level and get the 10 closest related points to that. Then you do it again for an even more dense hierarchy. So apparently you're becoming more accurate as you're going through the different hierarchies.

But I do encourage you to read this. They have links to some of the research that was discussed as well as the source code of how this actually works. So if you want to dig into the details, definitely use this as a reference. So they talk about the advantages, basically, it is faster, usually with higher accuracy. The trade-offs are the build time and the need for a lot of memory to be able to use these indexes. And then they go into index tuning. There are these two parameters that you can change as you're building it. "m" is the number of connections to the nearest neighbors or the vertices. And ef_construction is a candidate list size during the index bill, they say here. And the ef_construction should be twice the value of the "m" value. And the higher the values, the lower the index building.

But again if you're going to be using this, definitely suggest you read this post more in-depth to get a better understanding of it. There's also a query tuning value you can adjust for the ef_construction value that you can do at query time as well. So not just to build time but this can let you adjust the number of nearest neighbors maintained. And then lastly they go into a code sample using their Postgres AI tutorial that they have on their site and they go through the process of doing a query, seeing a sequential scan, and then applying an index to get the results and see the index scan. So definitely, if you're using AI and you want to start using pgvector and these new indexes, check out these two blog posts.

Next piece of content- "Optimizing Postgres's Autovacuum for High-Churn Tables''. This is from tembo.io. And they're talking about the need for vacuum because when you do an update, it's more like doing an insert and then deleting the previous record and a delete operation doesn't immediately delete the row. It marks it for deletion and it gets deleted later and that's the vacuum's job to go in and actually remove these rows. Now as a consequence, you can end up with table bloat with all these unused rows in the table. So if you're doing a lot of updates, you're going to have a lot of dead rows in the table. And they show an example of that here where they create a table and they insert a million rows.

I think they turned off the autovacuum. They did a query and it ran into 191 milliseconds and they then updated the whole table, I think, five times. They ran the query again and it took about a hundred milliseconds longer to execute because it had to go through all these extra, it looks like, 50 million rows. But then once you do a vacuum FULL, all those dead tubules are gone as they show here and your performance gets back to normal, back to the 190 or so milliseconds. Now there are other posts that of course have covered this but this is where it kind of gets interesting because they have an extension they developed called PGMQ.

Typically how this queuing system works, you insert data into it then maybe you do an update or two to change the status and then you delete that entry from the table. So that's typical queue usage. So with that INSERT, UPDATE once or twice, DELETE, you're going to have a lot of dead rows in that table relative to the number of live rows in the table. So maybe in your queue, you want to typically keep it at maybe zero. But if it's highly active, you're going to have tons of dead rows in that table. So they use this as an example and they set the system configuration back to the default in terms of autovacuum settings.

Then they did a PgBench run and you can see from this graph here. And we'll just start here. Autovacuum runs so now you have max TPS and the lowest latency. So the purple is latency and the pink or the red is the transaction per second. Over time, the transaction per second drops and the latency increases, then autovacuum runs again, TPS has boosted up to its max again, and latency drops down to its minimum. And then it continues to get worse until the autovacuum runs again. So this is the cycle you'll run into. But then they said all right let's change different configuration parameters to see if we can minimize this. So the first thing they did was drop the vacuum cost delay down to 10 milliseconds. So they halved it and then increased the cost limit to basically its max.

So the cost limit is how much work can be done while doing a vacuum and then the cost delay is once that limit is hit, how long should I pause for? So basically, they allowed a lot more work to take place during a given sequence of work and then only delayed 10 milliseconds for the delay. Before they get to do work up to this limit again. Now with that, you did get a little bit of a TPS increase and a little bit of a latency reduction. But not too significant. The graph still looks relatively similar. But it does have some higher TPS counts and a little bit lower latency counts.

So next they wanted to say all right shall we run out of vacuum more frequently? So they dropped the vacuum scale factor to see what difference that makes. And again not too much of a difference. Then they said okay let's reduce the naptime in half. So this is before it starts vacuuming the table. How long to wait? And with this, you can see now the vacuums are happening twice as fast. So definitely needs to be vacuumed but this nap time was causing it to wait to do that vacuum. So you definitely get more vacuum performance by doing that. Then the next thing they adjusted was the cost delay, instead of 10 milliseconds they dropped it to zero. So not too much of a difference with that. And then they dropped the naptime down to 10 seconds as opposed to 30 seconds. And now you can see, the vacuum is running quite frequently and busy.

But you see less of a TPS impact like you saw in the previous graphs as well as latency. The only thing else that would have been interesting to find out is how the CPU usage changed. So yes, we're configuring autovacuum to run more aggressively. Well, what impact did that have on the overall system? Did the CPU utilization increase as a result of running more quickly? But I found this really interesting because you could see how changing individual configuration settings adjusted the latency and the TPS performance for this workload they were looking at. And if you want to learn more about it, definitely check out this blog post.

Next piece of content- "PostgreSQL 17: Allow '\watch' to stop based on the number of rows returned". This is from dbi-services.com. And watch allows you to run a command repeatedly after a delay. So for example, he runs a pg_stat_activity query here. And then he types '\watch', and it runs the command, by default, every two seconds. So he gets whatever the results are of the command being run again. And I should mention this is within PSQL. so this is using the PSQL client to do this. Now historically, you could add one more parameter to watch and that was to change the amount of delay. So by default, it's two seconds.

But you could add a different number here and it would delay that many seconds. So again this is great for doing DBA-related stuff. Sometimes I just set a query to run for every 10 seconds or 20 seconds just to check the status of some job running. Maybe it's an index build or something like that. But as of 16, which might be released in the next week or so in addition to the interval, you can also specify a count. So if you only want it to run three times, as you can see here it's going to run that query three times and then stop.

Or you could specify any count you want. But in 17, which we're probably a year away from, they added another parameter called min_rows. And what it does is keep repeating that query until the number of rows returned is below the value that you said. So this particular query is returning three rows. So he says \watch min_rows=3. So it'll keep running because there are three or more rows. But once it drops to less than 3 rows, it then stops the watch. So these are some pretty cool features coming in 16 as well as 17 which, unfortunately, is about a year away.

Next piece of content is also related to this particular feature but depesz.com also mentioned it. In his blog post "Waiting for PostgreSQL 17- Allow \watch queries to stop on minimum rows returned".

Next post is also from depesz.com, where he posts "Waiting for PostgreSQL 17- Generate new LOG for 'trust' connections under log_connections". So basically, if you're logging connections, I guess it didn't record anything before when someone connected via trust. But now, it will log trust methods to the log if someone logs in using that method. So that's great from a security perspective.

The next post also from depesz.com is "Waiting for PostgreSQL 17- add to_bin() and to_oct()". So these functions allow you to convert numbers into their binary or octal representations. So if you want to learn more about these features, definitely check these blog posts out.

Next piece of content. There was another episode of Postgres FM last week. This one was on "Connections". So they talk about different ways you can connect in terms of options, security, and performance trade-offs, primarily with regard to connecting over TCP/IP or using Linux sockets. But they also talked about some difficulties users have actually connecting to Postgres when they install it. But if you want to listen to the show, you can click here or check out their YouTube channel here.

Next piece of content- "Now it's PostgreSQL's turn to have a bogus CVE". This is from opensourcewatch.beehiiv.com. The subtext for this is that "PostgreSQL and cURL aren't the only ones. Someone is faking security alerts for numerous open-source projects''.

Now this next post I'm going to show is referencing this CVE where Postgres says "...it's not a security vulnerability". And apparently, someone posted a vulnerability for Postgres 12.2 where you could create a denial of service by repeatedly reloading the database. But by virtue of the fact you have to be a super user and to have the privileges to be able to do that, they don't really consider it a security vulnerability. And that's exactly what this post is talking about. Someone apparently is posting bogus security vulnerabilities. So this is something I haven't really heard of. But if you want to learn more about that, definitely check out this blog post.

There's also the companion post done by commandprompt.com where they sent out a courtesy notification about this purported security vulnerability as well.

Next piece of content- "POSTGRESQL 16 IMPROVES INFINITY: PGSQL PHRIDAY #12". This is from rustprooflabs.com. And the feature he's most looking forward to in Postgres 16 is +infinity. So he's been able to use infinity with numerics, but now it can work with timestamps. So he's super happy about that. But check this out if you want to learn more.

And the last piece of content is "Exploring Postgres Performance: A Deep Dive into pg_stat_statements". This is from stormatics.tech. This is pretty much my favorite Postgres extension because it allows me to highly optimize my database. They go through all the different parameters that you can set it up with as well as all the different columns and what they mean from the pg_stat_statements system view that's provided. As well as they give you a few commands so you can find long-running queries, IO-intensive queries, and even temporary IO-intensive queers. So definitely check out this blog post if you want to learn more.

episode_image