background

Foreign Data Wrapper, Timescale Compression, Fuzzy Search, Query Execution | Scaling Postgres 179

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to use foreign data wrappers, compression available in timescaledb, working with fuzzy search and describing Postgres query execution.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Understanding Foreign Data Wrappers in Postgres and postgres_fdw". This is from blog.crunchydata.com and it's talking about using Foreign Data Wrappers. A Foreign Data Wrapper basically enables you to speak to another database. In the case of the postgres_fdw, Foreign Data Wrapper, it allows you to speak to another Postgres database and make a virtual table on that database, mirroring one that exists on the foreign database. So you can query it locally and it goes and pulls the data for you from that other database. Now, there exist many other Foreign Data Wrappers. There are Foreign Data Wrappers for, I think, MySQL, SQL Server, Oracle, as well as just a general file. So you can use a file as a foreign data source. So this blog post walks you through setting up the Postgres one. So first you would create the extension, the postgres_fdw extension.

You define a foreign server and here's the syntax you would use here. Then you create a user mapping to be able to set up the authentication to make sure that you can connect to that foreign server. Then you create your foreign tables on your database, referencing the tables on the other servers. So you can create a foreign table or you can import a foreign schema, which is what they did here. So now you can query those tables as if they existed in your database, and you can even join them to tables that exist within your database. This post conveniently shows you a number of psql commands to be able to inspect the foreign servers that exist, user mappings, foreign tables that exist, et cetera. Now, in terms of optimization, they recommend using a use_remote_estimate.

So basically allow the foreign server to calculate its own estimates to pass that over to the local server. But you can also leave that off. Then they talk about being sure to analyze your foreign tables relatively frequently. They even said that if you wanted to have more real-time access to it, maybe you want to create a materialized view of that foreign table so that the data really does exist in the local table and doesn't have to go over and bring it back. That may give you some performance improvements at the expense of having to refresh that materialized view. Then they follow up saying how this could be an alternative to ETL. If you don't want to transfer all that data to the server, you just want to be able to go and access a few different pieces of data from that foreign server. It's convenient to set this up and do it that way, but if you want to learn more about form data wrappers, maybe you want to check out this blog post.

The next piece of content is "TIMESCALE, COMPRESSION AND OPENSTREETMAP TAGS". This is from rustprooflabs.com and they're talking about some previous posts that they had done where they have a bunch of data that they use for the OpenStreetMap application, that's OpenStreetMap data, and they partitioned it using the built-in partitioning in Postgres, the declarative partitioning. But what they wanted to try to do was show the partitions in Timescale be different. The two benefits that they were looking for that Timescale offers is no need to manually create the partitions. Now, personally, I don't really have a big problem with this. It's pretty easy to create the partitions, but I guess it is convenient to not have to create them if you don't want to do that. The second is compression because by default Postgres will just compress when a row exceeds the default size and that default size is around 2 KB. But Timescale enables you to compress the entire row essentially no matter its size. So this post talks about the differences he sees when he migrates from the Postgres native partitioning or declarative partitioning to Timescale's partitioning scheme.

He goes over setting it up, basically adding Timescale DB to the shared preload libraries and creating the extension timescaledb in the database. Then at that point, he created some tables and enabled them to be hypertable using the create_hypertable function. Then he inserted all of his data into it. Now, he kept the old source data so he could do some comparisons for querying and row counts and things like that. Then he went ahead and added compression to it as well. Now, his initial compression looking at JSONB fields resulted in about a 93% compression ratio, which is pretty huge. So the size before was 607 megabytes and it went down to 45 megabytes, so that's quite significant. Now, he did say this was a little bit of an outlier and some other columns that didn't compress as well.

It was about a 40-50% compression, but still that's pretty significant. Then he also follows up by looking at query performance with that. So I'll take a look at the table down here. This is the original dedicated table, how long it takes in the planning time and execution time for the query. He showed up here and the hypertable took a little bit longer, I suspect, because this is a dedicated table by date, whereas you also had to narrow it by the date when querying the hypertable. So it took more planning time to identify which table to execute the query against. So that took a little bit more planning time. But again, the execution time is just as short, whereas when you look at a compressed Hypertable, then the execution time increases significantly as you can tell here. Now, it was still about one millisecond, but you could tell the magnitude here is quite a lot longer. So the compression does save you a lot of space at the cost of query times. But this was a very interesting blog post using TimescaleDB for a particular use case, and if you're interested in that, you can check out this blog post.

The next piece of content- "A Handbook to Implement Fuzzy Search in PostgreSQL". This is from arctype.com and he's talking about various different ways of searching text in Postgres. The first way they talk about it is just doing a direct equals comparison to find an exact string. With that, a B-tree index works great to be able to find an exact match of a text string. It starts becoming a little bit more difficult when you want to look for a particular prefix or suffix in the text string. With that, you can still use a B-tree index, but you're going to want to use the text_pattern_ops declaration when you create that index, which will enable you to do a suffix or prefix search with a B-tree index and get you pretty good performance. Now, they did say you could also do an internal search and it is faster, but not by much at all. Then if you have a lot of text in the field, the B-tree won't even work.

For example, when they created an index on a larger column here containing a lot of data, they got an error that said the index row size exceeded the maximum for this index. Now, they didn't talk about this here, but the solution for this is to use a gin index because that will enable you to create that. Generally, if you want to use full-text search, you would use T queries associated with T searches to do it. They didn't mention that in this post, but then they went into how to do a fuzzy search. There are basically three ways right now to do a fuzzy search within Postgres. The first way is to break things up into trigrams, and with that, there's an extension pg_trgm you can use. So you create that extension and then you can do queries against it or do similarity searches to find instances of the text that match what you're querying.

The other way is Levenstein matching and you can do this as part of the fuzzy string match extension and that basically looks for something that's close to what you're querying so many characters off, for example. So this did a Levenstein comparison with this text string where there are less than four differences and that gives you a particular output. The other way to do it with the fuzzy string match extension is a phonetic match, so it sounds like that. Here, they're using the difference function to be able to do that type of fuzzy search. Now, whereas they didn't mention the gin up above, they do mention it here. They even have a gin_trgm_ops to allow a special trigram search within a gin index to get you better performance. So if you want to learn a little bit more about text searching and fuzzy string matches in Postgres, I definitely suggest checking out this blog post.

The next piece of content-"Postgres query execution: simple query". This is from a dev.to in the Yogabyte section. Now this is actually a series of three blog posts covering Postgres query execution. The first post talks about simple queries and the second covers a simple query using prepared statements. So it prepares a statement and then it executes it multiple times with different parameters and then also a PL/pgSQL query execution. So I encourage you to check all of these posts. But this post gives some interesting insight into what's happening in the background and it shows you the different steps that a simple query goes through. Then it talks a little bit about performance in that usually, what you would use to track the timing of different queries is pg_stat_statements. But he says here that that doesn't give the whole picture in that it only gives you the duration from the time that ExeutorStart_hook begins and ExecutorFinish_hook begins.

But there are some other things that aren't included. Namely, if you're on a version before 13, it only gives you the execution time of queries, but apparently in 13, you can change it to give you execution only, plus planning, but it still doesn't include everything. Whereas the log_min_duration_statement that logs to the logs does give you a more comprehensive view of how long a query takes to run. We saw this in a previous post that was covered on Scaling Postgres where they were seeing timing differences between these two. One of the main things that log_min_duration_statement included is how long it took to deliver all the data to the client. So that's other areas where this can be more accurate than pg_stat_statements. But this was a great series of three blog posts explaining query execution in Postgres. So if you're interested in that, you can check out this blog post.

The next piece of content- "How batch insertion was done in postgres_fdw". This is from highgo.ca and this is talking about the Postgres Foreign Data Wrapper. It goes into more detail about the patch that was recently done in order to batch multiple inserts. Before it was just doing one insert at a time over the network to the foreign server, whereas now they've added a feature where you can batch those up to give you much better performance. And this goes through the patch that did some of that. So if you're interested in that, you can check out this blog post.

The next piece of content- "pgbench --client --jobs". This is from dev.to in the Yugabyte section and they're talking about using pgbench and clarifying what Client and Jobs means. Now at the beginning here, they say Client is about the number of servers and Jobs is about the number of clients, which I read through the blog post and I actually kind of disagree with that. I believe what the documentation for pgbench says here is that the clients are the number of concurrent database clients. That makes sense. But what I think he's trying to convey here is that a database client equals one process on the server. So basically, the number of database clients you spin up is the number of server processes you're going to get. Now, the Jobs, according to documentation, equals the number of threads, and that makes sense to me as well. But the thing you have to keep in mind is that if you have one system that's trying to run pgbench against the database, if you are asking it to simulate a lot of clients, you may need to give it more than one thread to be able to do that. Or you may run into a bottleneck on the client. But apart from that, this blog post does go into a lot of the details and explains some of the differences between Client and Jobs for pgbench.

The next piece of content- Waiting for PostGIS 3.2: Secure Cloud Raster Access". This is from blog.crunchydata.com, and they're talking about when you use PostGIS with Raster data, there are two ways you can do it. You can basically have it in the database, or you could have it outside of the database and you use PostGIS to access that Raster data that's outside the database. A lot of times people are going for the external solution as opposed to storing it in the database because that makes the database really huge. They're talking about some of the disadvantages that they list right here, in addition to just size. Now they say by default there's a Raster format access library called GDAL, and by default, it was built for local file systems. But they've recently added some changes to support using cloud access. With the most recent changes, they allow you to access secure cloud access to resources and PostGIS 3.2 will also enable this as well. So you could have your own private buckets in S3 or Google Cloud storage and be able to access those from PostGIS. So if you want to learn more about that, you can check out this blog post.

The next piece of content- "Validate PostgreSQL extension upgrade scripts using pg_validate_extupgrade". This is from migops.com, and they're talking about a new extension called pg_validate_extupgrade. Basically, if you develop your own extensions for Postgres, this enables you to better manage your upgrade scripts. So if you want to learn more about this new extension for helping you manage your extensions, you can check out this blog post.

The next piece of content is "PG_TIMETABLE V4 IS OUT AND AVAILABLE IMMEDIATELY!". This is from cybertec-postgresql.com, so if you use pg_timetable to help run time dependent jobs in Postgres, you can check out this blog post.

The next piece of content. The PostgreSQL person of the week is Andrey Borodin. If you're interested in learning more about Andrey and his contributions to Postgres, you can check out this blog post.

The last piece of content, we did another episode of The Rubber Duck Dev Show this past Wednesday. This episode was on "Writing Performant Code". So if you're interested in more software development content, particularly around writing more performant code, you can check out this blog post.

episode_image