Optimizing Trigram Search, Replication Review, Logical Improvements, Timescale Investment

In this episode of Scaling Postgres, we discuss optimizing trigram searches, a review of Postgres replication, improvements to logical replication and a significant Timescale investment.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Optimizing Postgres Text Search with Trigrams". This is from alexklibisz.com. So this is a very significant post and you can see here it's a 37 minutes read and it has this comprehensive table of contents here on the right covering all the different areas. But given it's the first post that I'm showcasing, you should know that that means it's a good post. So I'd highly suggest reading it. He starts off with some very basic simplistic implementation, which runs in about 360 seconds, but he gets it down to just over 100 milliseconds. So he goes through the process of doing a fuzzy base search using trigrams and showing you how to accelerate that. So first he says, okay, what is text search? It's basically you have one input and it's going to search through one or more columns in a particular table.

Now why Postgres? Basically for a lot of search cases, and I hear this from other places, but Postgres is probably good enough. If it's not good enough you're going to have to get something more sophisticated like Elasticsearch and have someone manage it, et cetera. Then it covers what trigrams are. So it's basically separating a word into three character sequences. So for example, hello would be represented by these trigrams here. Now if you're a little confused here, I think some of the double quotes are missing from some of these, but that's the basic gist. And there's actually a Postgres function called show_tgrm that you can put in a text string and it will give you all of the different trigrams for it. Now he's not addressing full-text search because, well, to my understanding you can't really do a fuzzy search with that. And full-text search is not useful when you have something that's not naturally language-based.

So for example, if you have code or if you have some sequence of numbers like he says, product SKUs and email addresses, et cetera. Now he talks about the test environment he set up. The Amazon data set he's using is a review data set. So we place this into a single table called Reviews and he puts the data in its own columns. He only used five columns from this data set, but it was about a gigabyte in size. Then he uses EXPLAIN ANALYZE with buffers to check the performance of things. So before he gets into his first query, he talks about the trigram operators and you have a similarity function as well as two different operators he's using. So similarity gives you a numeric function that determines how similar some trigrams are to something else. He has this great query here that shows the comparison between two different sets of trigrams: what their intersection is, what the union is, and what the similarity function is.

The percent operator, which basically returns true above a certain default threshold which is, say, 0.3 here. And then the < -> gives you the distance between two sets of text. And that's useful for doing your ranking or doing your sorting. So he's searching for the name Michael Lewis within the reviews, and here's the query he's using. So he's using the percent operator to find those that match a certain threshold of similarity. So only pull those records and then just pull the top ten ordering or sorting by their similarity. With a query like this, you get results that look like this. So you get the score listed here, what was returned, and the review ID. This ran in a minute and 34 seconds, so not great. Then he did a fuzzy search and that gave results similar to this. He's not really happy with the 94 seconds, so he asks, 'Okay, how can we speed that up?'. The first thing he's reaching for is, of course, indexes.

So I typically for text searches use a GIN index, and actually, that's what the Postgres documentation talks about. But you can also use GiST indexes, and he actually uses the GiST index because GiST supports both filtering and sorting, so you can easily do that sorting operation to get the similarity if you use a GiST index. Also with the GiST index, he's using a newer feature called siglin, which is the signature length. You can think of it as how it builds the index is the level of granularity. So if you have a high siglin, the index will be larger, but you'll actually be able to target a smaller portion of the index. So less scanning should be involved to do a given search, whereas if it's smaller, it's going to be covering more of the index when it does a search. But that index's overall size should be smaller. So we actually test a signature length of 64 and 256.

Now using the GiST index, with the shorter siglin, it went down to 4.5 seconds, and with the 256 siglin in, it went down to just under 2 seconds. So that's the first set of improvements. Now he also went into doing an exact search and then a trigram search. So using the like operator, he says just give me an exact search to see how fast that can happen. Doing that exact search on a single column returned in six milliseconds, and the fuzzy search returned in ten milliseconds, so definitely super fast. So his thought process here is to do an exact search to pull out those. Because he's only looking for the first ten, you might not need to do a fuzzy search. So you do the exact search, get what you need then you'll only do the trigram search if you need to. So that's a thought process he went through here about potentially doing that.

So now up to this point, he's just been searching for a single column. So now he wants to do all the columns and he did a few implementations of this. He tried doing UNIONs, querying on each column that really wasn't very efficient. He then tried doing an OR so this column or this column or this column and that efficiency was actually even worse. So finally he said, okay, I'm going to put all the columns together in the index. So he's basically using an expression index. So he's concatenating the different fields together and searching on that. The other thing he discovered there's a set of functions for word similarity. So we were talking about the similarity index. Well, this is a word similarity index and what he discovered is that it's immune to different sizes of text comparisons in terms of matching similarity.

So for example, if you do this fuzzy search of Louis against Lewis, it has pretty good similarity, but once you start having a very large text string like you're concatenating them together, the similarity starts being hurt by that, whereas it maintains it with the word similarity. So he's switching over to use, essentially, that function and some slightly different operators. So he creates his GiST index using the siglin of 256 and he concatenates all the different columns together. He wants to be part of that functional index. And then this is a query that actually queries it. You'll see this is a slightly different operator that's using the word similarity operator and also this uses the word similarity as well. Again, this is for ranking and this is for filtering. Also, this is the trigram searching portion and then he did a like implementation for the exact search and the results look good and here's what the performance times look like.

So using an exact only search, just using like for looking for the exact name, it returned in 37 milliseconds. The trigram search for the exact name was 39 milliseconds. So it's super close, it's almost identical. Whether you're doing ILIKE or a trigram search, it's identical. So I don't even know if you need to first do that exact search and then try trigram. I would just go with this implementation. Same thing with a fuzzy search. The exact search was 87 milliseconds, whereas the trigram search was 113 milliseconds. So I would probably just go with the trigram search and not worry about this exact search.

But basically, this is how he gets it down to 100 milliseconds across four columns. Now, one thing he didn't mention, but I wondered if you could do this. These queries are quite verbose and I wondered if instead you could use a generated column feature of Postgres where you generate a column that contains this concatenated information and then you put the GiST index on that. So essentially you would get rid of certain code and all the concatenation code you could get rid of because you're just querying what your user gave to a particular column. But this was a super impressive post and I definitely suggest you check it out.

The next piece of content- "5mins of Postgres E6: Optimizing Postgres Text Search with Trigrams and GiST indexes". This is from pganalyze.com and this is a follow-up to the post that I just discussed. So if you want to add a little bit more insight into what this post is communicating highly suggest you check out this episode. They also talk a little bit more about GiST indexes in general if you want to learn more about how those work.

The next piece of content- "Replication Revue". This is from enterprisedb.com and this kind of goes through the process of replication and how it was added to Postgres. Basically, in the beginning, there were only WAL files that existed for the write-ahead log for crash recovery purposes. So it was never for replication. But they had the idea, hey, maybe we could use these to build and keep a replica Postgres database in sync. That's when they started having this process called log shipping to keep a replica in place. And then with Postgres 9, they started introducing streaming replication where using a wire protocol could keep a replica up to date. You didn't have to do the log shipping method. Then over the different versions they added Cascading Replication and replication slots to help prevent replicas from falling out of sync. They added different views as well as synchronous replication.

In version 9.4 they added logical decoding so you could decode the WAL files that are being transferred over and follow it on with logical replication in version 10. So this post is talking about how we have a lot of these pieces in place but if you want to actually create a cluster that's aware of what is the master, who are the replicas, and what state are they in, you actually have to query each database to get an understanding of the topology of the cluster. So it's basically stating an argument. It would be great to have this type of tool to understand the state of the database in general. Now separate tools have done things to try and serve that role. He's talking about EDB's bidirectional Postgres extension, which this EDB blog mentions, but other tools are trying to do that as well, such as Patroni for example. But this is a great post talking about where we've been with replication and where we can potentially go in the future to build a system with more high availability. So definitely suggest checking out this blog post.

The next piece of content- "LogicalReplication/Decoding Improvements in PostgreSQL 13 and 14". This is from percona.com and this is a follow-up from the post where they were discussing that Patroni has added some features to address the problem with the logical replication slot failover not happening, or there's no way for it to happen now in a standard Postgres database. They actually built that process into Patroni failover. Well, he's talking about some other improvements that have made logical replication easier to use. Number one, there were some problems with some memory usage and disk usage, especially when you have long transactions or a long load time that could impact logical replication.

Well, they made some changes in version 13 to improve this by spilling over some things to disk and then sometimes people were having, even on 13, some replication delays and CPU load taking care of that spilling to disk and there's been some changes to improve that. So one of the main things in 13 is they added a logical decoding workmen to allow you to allot more memory dedicated to handling logical decoding and not have to spill it to the disk when it runs out of memory. And then secondly, they added the ability to stream in-process transactions in Postgres 14. So if you have a really long transaction happening, you can actually start streaming it to replicas before it's actually committed. And we've discussed different posts about that happening, but they cover it in this post as well and they actually show you how to set that up.

So if you actually want to do streaming logical replication or have that capability for in-process transactions, when you create your subscription, you go ahead and state that with streaming on and that will go ahead and turn on streaming. Or if you have an existing subscription, you can actually alter it to set streaming on. And then there are a couple of different system views that have been added. One is to check the progress of a copy which is with pg_stat_process_copy, which is important when you're initially setting up logical replication as well as additional columns to pg_stat_replication_slots because this allows you to track what transactions are spilling to disks, the size of it, as well as what's streaming. So what's being streamed to other replicas. So definitely some great additions to Postgres. And go ahead and check out this blog post if you want to learn more.

The next piece of content- "Year of the Tiger: $110 million to build the future of data for developers worldwide". This is from timescale.com and they just received a $110 million investment. So that's quite significant. And I believe all told, this organization has received $180 million in funding, if that's what they're talking about here. So you can read more of the details here. But this is significant because, of course, they are an extension of Postgres, so they rely on base Postgres and these funds should help them contribute more to the open source project, making Postgres better itself as well as their solution. So it's been an interesting number of years where, for example, Microsoft has purchased Citus for their scale-out technology and they're doing more with regard to time series functions as well. Now Timescale has received this significant investment. But if you want to learn more, definitely check out this blog post.

Next piece of content- "pg-osc: Zero downtime schema changes in PostgreSQL". This is from shayon.dev. He's talking about a new tool that he's developed called pg-osc, which stands for Online Schema Changes. And basically, it's a tool that he says is based on the implementation design of pg_repack. So what he's done is that any schema change, you're going to remove a column, you're going to add a column, and you're going to change a column. What it does is it just doesn't execute those potentially unsafe changes on an active database, it actually creates a whole new table and then uses transactions to flip over to it. Now, I think this is a great idea and can work for small tables, but if you have a multi-terabyte table or even really high hundreds of gigabytes, I don't see how this is really going to work that well.

I have some clients that have a multi-terabyte table and there's not even enough disk space to replicate the whole table. So we can't even do, for example, pg_repack operations on it unless we were to actually bring up a whole nother volume to do that process. He designed this so that engineers wouldn't have to think about carefully doing migrations without causing issues with the database. But I don't necessarily see this as the solution to that. But if you have relatively small tables, maybe this could work. Now this is relatively new, so I wouldn't call it production-ready yet, but it's an interesting concept. If you want to learn more, you can check out this blog post.

The next piece of content- "GOOGLE CLOUD POSTGRESQL UNDER PGWATCH2 MONITORING". This is from cybertec-postgresql.com and he's talking about how, yes, you can manage some of your cloud-managed databases with third-party tools. So for example, they are using a Google Cloud instance of Postgres to be able to monitor it using PGWatch2. Basically, instead of having some monitor on the instance itself, it calls into the database connection to collect the necessary data. He goes over the process of doing this and why would you potentially want to do it. I mean, maybe you have a multi-cloud infrastructure and you ideally have one tool that you can add more customization to to do monitoring as opposed to relying upon each cloud vendor's solution. But if you want to learn more about how to do that and get it set up, definitely check out this blog post.

The next piece of content. The PostgreSQL person of the week is Julia Gugel. If you're interested in learning more about Julia and her contributions to Postgres, definitely check out this blog post.

The last piece of content. We had another episode of The Rubber Duck Dev Show this past Wednesday. This episode was on "How To Learn a New Code Base". So if you've just been hired and come to a new organization, what's the best way to get started? Or, for example, maybe you're a consultant and you're seeing a new code base for the first time. What are some techniques you can use to get familiar with it? So if you're interested in that type of content, you can definitely check out our show.

Optimizing Trigram Search, Replication Review, Logical Improvements, Timescale Investment | Scaling Postgres 204

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript