Postgres LLM OS & 30 Times Faster Index Builds

In this episode of Scaling Postgres, we discuss a LLM OS dedicated to Postgres knowledge, 30 times faster index builds for HNSW indexes, a pgbouncer release and an interview with Michael Christofides.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Before we get started with the show this week, I wanted to let you know that I am planning a webinar. This is kind of associated with the course, but this is a free webinar that I'm going to cover, basically, the framework that I use when I do consulting work in terms of trying to identify performance opportunities. So these webinars will be held over the next few weeks in February 2024, and they are also entirely free. So I welcome everyone to attend, and this will be the top link in the show page for this week's episode. Thanks.

Now, on to the show. I struggled this week to pick which post should actually be the top post. I kept flipping one back and forth because both of them were very good. One is specifically focused on performance, the other one just has a lot of implications for the future. So really there are two best posts this week. But I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Postgres.AI Bot. Towards LLM OS for Postgres''. This is from postgres.ai. If you've used Chat GPT to ask questions about Postgres, well, this is the same thing, but on steroids, and then adds something even more to it. So let's take a look. So first, they're building their own bot. They're using Chat GPT for Turbo, and they're explicitly using specific sources. So number one, they're using the documentation of Postgres. Two, they're using the source code. So being able to read the comments. I don't know how well these models can actually look at source code and interpret knowledge from it, but maybe eventually.

Then also specific articles and blog posts he calls credible or from verified authors. I should say the person writing this is Nikolay, who also does the Postgres FM podcast. So basically this model focuses on high-quality sources, and he even lists some of the articles and blog posts down below that he's actually using. The other thing that they're doing is also verifying what has been stated through DB experiments. This is something that Postgres AI has been doing for a while in terms of databases, then cloning to very quickly copy databases to do different experiments. So basically that's the other service other than just having the chatbot available. Now, this is the architecture of what they set up. You can see the data sources here. Looks like they're also talking about mailing lists as well and also the results of DB experiments that have been done. Then the other pieces of the stack.

He shows an example of talking with the chatbot and some of the results that were received. He also makes a point of mentioning he lists the source material for everything, unlike what Chat GPT does today, where you have no idea what sources this knowledge comes from. He felt it was important to put the source for the response to questions posed, and he goes into the Retrieval Augmented Generation or the RAG, basically how most of the data is processed and stored. He's using Postgres, the pgvector, PostREST, a custom crawler and scraper, as well as a vectorizer using the OpenAI API. So basically he intends to create an LLM OS, a large language model operating system dedicated to Postgres for answering Postgres questions as accurately as possible. He says a few people have tried it and they found better responses than what Chat GPT would offer.

But again, the other part that he's offering is these database experiments, and I know personally that's what I love to do as well. Once I have an idea of something, I like to empirically test it to make sure what is the best path. So I found this post super interesting and this is a paid service. Normally I don't make those in the first post, but I found all of this content super interesting. But it's also the case that you can build your own Chat GPT with your own data and you could implement something similar because he's listing all the sources that are being used. Because at this point, the knowledge for the model is all known sources or publicly available sources, although as they do more experiments that would probably be more proprietary data that they could offer in terms of their responses. But I thought this was super interesting and welcome you to check it out.

Next piece of content- "pgvector 0.6.0: 30X faster with parallel index builds". This is from supabase.com and as you can tell, 0.6.0 was released. The main feature of this release, at least according to this blog post, is parallel builds for HNSW indexes. As a reminder, these are the indexes that have a super high recall, which to me means high accuracy or a higher probability that you're going to get the actual values that you're looking for, and they're much more easily updated than ivfflat indexes. The problem with them, it took a very long time for them to build compared to ivfflat, but now that you can do it in parallel, they're much faster. In the first example here, they show we're the same machine using these parameters m=16 and ef_construction of 200 it was nine times faster building on version 0.6.0 compared to version 0.5.1. Then down here, they show a few different parameters where you can see 0.6.0 being much faster.

It's also much more consistent compared to 0.5.1 pgvector with the non-parallel index builds. Now the charts above were from a 16-core machine. When they went to 64 cores, it went from 9 to 13.5x faster. So it's definitely not linear, you're not going to get linear performance, but still it improves it further. But then you may be thinking, where is this 30x you're talking about? Well, it's for unlogged tables. So when they tested unlocked tables versus the 0.5.1 build time, although I don't know if it's unlocked here, that's when they got the 30x improvement for one value. So not everything is 30x faster. But in this blog post, there was nothing that was less than nine times faster. So still a significant improvement with these parallel index builds. So I don't really know why people would still be using ivfflat after this. If you happen to know, let me know in the comments section.

Next piece of content- "Pgbouncer 1.22.0 is released". This is from pgbouncer.org and the main feature that was added is the ability for PgBouncer to respond to DISCARD ALL and DEALLOCATE ALL commands. So apparently I think some clients that were using it for using prepared statement support in transaction pooling mode, they needed to be able to send these commands because what DISCARD ALL does, as the Postgres documentation here says, is it discards session state. So if you're in transaction mode, it can become important to reset the session state if you're using prepared transactions. The other one, DEALLOCATE, basically de-allocates a prepared statement and ALL basically de-allocates all the prepared statements. So if you're using PgBouncer and you want to use it with prepared statements, I would say you would definitely want to upgrade to the most recent version of PgBouncer here.

Next piece of content, I finally started doing my interviews about Postgres, and as I had mentioned on the show before, I'm doing it actually on The Rubber Duck Sev Show and then I'm just sharing the content here. So my first interview was with Michael Christofides of pgMustard and of Postgres FM, and we talked about Postgres Explain. So this is definitely a longer format, about an hour in length. So you can either listen to the YouTube video recorded here, or the podcast down below.

Next piece of content is actually a gist from GitHub where they're talking about the Postgres extension ecosystem. Basically, this is a jobs and tools post highlighting what they're trying to build this extension to be. I believe this is David who's working at TemboIO now, who started the Postgres extension network, and this is basically what they're looking to build. So here are the numerous jobs, one being authoritative and basically being the official canonical source of record for all Postgres extensions. They're using examples like the Go Module Index for Go packages, the crates.io indexes for Rust, and the PAUSE indexes for Perl.

I wonder what that acronym means. The examples of RubyGems for Ruby, of course, which I'm very familiar with, and PyPl for Python packages. So basically they want to make an extension network matching what they are doing. Well, but this whole post lists about 20 different ones. A few other ones are they want to make it beneficial for people to publish their extensions there, make it integrated to make it easier for developers to start writing and publishing extensions, make it easy to find new extensions, and it goes on and on. It's basically a requirements document for what they kind of want to build. So you can check this out if you're interested. I'm definitely going to be keeping my eye on this.

Another poster related to it is that David posted "PGXN Challenges". This is from justatherory.com and he's talking about the challenge of converting PGXN into this new extension ecosystem, basically. But he says there are a number of challenges with it as it exists today. In addition, he talks about all the different languages that they're using. I think the site itself is written in Perl, but then they have a client written in Python and developer tools in Ruby, so it makes it hard to have developers come in and work on it because you have to be a polyglot to work on different things. But if you want to get more of his insight, you can definitely check out this blog post.

Next piece of content- "PGSQL Phriday #015: UUID: let's fight!". This is from mydbanoebook.org and she's thrown down the challenge of talking about UUIDs for the next PGSQL Phriday. So the posts need to be in by the Friday before this episode goes out. So basically we'll just have to see what posts come out next week. But she wants to know why you use UUIDs, what problems you've encountered, and what kind of UUID you use. Do you use them with sharding? Do you sort them? Do you use extensions related to them? Etc, etc. I know that I will say I haven't really used a lot of UUIDs.

I've basically only used them when Javascript has generated a unique identifier for someone and it just sends it to the database. I haven't used them for primary keys in my particular projects. I do have clients who have done it, and I will say it seems that projects that use UUIDs for their primary keys have the largest amount of WAL generation I've ever seen. So that's something to keep in mind if you're thinking about using UUIDs. I'm assuming that will improve once UUid seven becomes available in Postgres, but we'll just have to see. This other post is the same announcement, but just on the PGSQL Phriday website.

Next piece of content- "Who Contributed to PostgreSQL Development in 2023?". This is from rhaas.blogspot.com and here Robert posts the top contributors to Postgres for last year, and this is sorted by, I believe, the number of lines of code. This list is for the principal author of the commit. He also has another one showing committers who did the most work but were not the principal author on particular commits, as well as people who sent a lot of emails to the PGSQL hackers list. As you can see, Tom Lane was the top in all of these. But thank you to everyone who works and contributes to PostgreSQL. I greatly appreciate it. I benefit highly from it, and I hope you enjoy these Scaling Postgres episodes because it's, I guess, part of my way of giving back.

Next piece of content. There was another episode of Postgres FM last week. This one was on "Bloat", and here Michael was joined by Chelsea Dole, who's a staff and software engineer and tech lead of the data storage team at Brex. They talked about bloat, how to identify it, how to resolve it, and also the dangers of it. But I guess the major danger of it is if you become in such a bloated state, autovacuum can't clean up tuples fast enough and your txid starts extending. That's probably the big danger. But also if you're severely bloated, as they mention in the podcast, you're basically hurting your IO because there's all this dead space in your data, and when you're pulling pages from the disk. It doesn't hold relevant information for whatever percentage of your bloat is. They did mention in terms of resolving it, basically optimizing your vacuum settings to help keep a lid on bloat.

But if you really want to remove all of the bloat, of course, you can do a vacuum_full, as long as you don't mind shutting off reads and writes to the table, or if you need to do it online. There is pg_repack and they did mention that they've heard people having issues with it. I know I tried it on a terabyte table in a test environment and it didn't really work or it hung up or it didn't go to completion. So I haven't really used that effectively, so I can definitely attest to having issues with it, although I know others have had success with it. At this point, my preference for getting rid of bloat is actually doing a logical replication upgrade. You don't officially really need to do an upgrade, you can just create an entirely new database, do a logical replica, and do a failover to it. That'll get rid of all your bloat. Well, as they make a point, not all the bloat, but at least a significant part of it. So personally that's my go-to, but definitely an interesting episode that I encourage you to check out.

Next piece of content- "What's 'Rows Removed By Filter' in PostgreSQL Query Plans all about". This is from andyatkinson.com. He's talking about when you're looking at an explain plan and you're doing a query, if it has a filter on it, meaning it's doing a filter not just on an index, but on looking at a particular set of data, it tells you how many rows were removed by this filter. Now in this particular scenario, they're querying a table, looking for a specific name code. Now there's no index on it, so it has to do with sequential scan and they're not doing an order by, but they just have a limit. So look for this particular name code and then limit one. So in this example, he was looking at the rows removed by filter and it reported every row being removed but two.

But there was only one row with this name code, so he was a little bit confused by that. But what's happening when you ask it for a specific name code and you just say limit one and there's no order, Postgres is going to start scanning through the table. As soon as it sees that name code, it's going to grab it and it's going to stop because all it needs is one, limit one. It's not going to continue scanning through the rest of the table because that would be unnecessary work. So how many rows were removed was based on the value you are looking for, wherever it is within the physical structure of the table or the heap. But if you want to learn more, you can definitely check out this blog post.

Next piece of content- "GROUP BY reordering". This is from blog.anayrat.info and he's talking about a new commit that was added to Postgres that does as it says, GROUP BY reordering. So it may choose to adjust the ordering of the GROUP BY if it thinks it will be more optimal. An example he showed down here is an optimized order where it returned in 3.5 seconds. He then switched the group by order up here compared to the previous query and it ran in about five and a half seconds. So it was slower. So clearly the optimum order is in the first case here.

But when he added the patch to it and enabled it. So we set enable GROUP BY reordering to on and kept the bad order, Postgres optimized it and now the query returns in 3.5 seconds like the faster one. So this is a great improvement. It's interesting that what he said here is that this particular work on this patch started in 2018, and it took five and a half years to reach this particular commit. So that's a lot of consideration to make sure everything was in place for this enhancement. Definitely hope it gets into Postgres 17, but if you want to learn more, check out this blog post.

The last piece of content- "PostgreSQL 17: Track skipped rows from COPY in pg_stat_progress_copy". This is from dbi-services.com. I'm not necessarily as excited about this one as the save error two feature of copy that we talked about on last week's Scaling Postgres episode, because pg_stat_progress_copy only maintains a row in that table when a COPY is actively going on. So for example, down here you can see something is being processed, and eventually it'll tell you how many tuples were skipped.

So that's what this new feature does. It shows you how many tuples were skipped because of errors. But then once the COPY is done, that row goes away, so you lose that information. So it's only valuable to have if you're actively monitoring a COPY going on, you want to know if some error happened. So I definitely prefer the other feature that it actually saves the rows that have the error. I think I would use that a lot more than this particular feature, but great nonetheless.

Postgres LLM OS & 30 Times Faster Index Builds | Scaling Postgres 301

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript