PG 14.4 Release, Subscription Skip, Using Schemas, Open Source Citus

In this episode of Scaling Postgres, we discuss a PG 14.4 release, how to skip transactions with logical replication, how to use schemas and Citus going fully open source.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 14.4 Released!". This is from postgresql.org, and this has been an issue we've discussed in the last couple of episodes of Scaling Postgres. There is a bug, since the release of Postgres 14, that can affect indexes that are built concurrently or reindexed concurrently. So there is a potential for a corruption issue to happen. So this release does resolve that issue and they again restate. Here is how you can check for corruption using the pg_amcheck capability. But if you do have corruption, what you basically need to do is rebuild those indexes by just re-indexing concurrently now, because now it should work successfully. You can do that individually, by database, by schema, and there's even this reindexdb command you can use that does the entire cluster concurrently. You can specify a number of jobs to do it faster. But if you want to learn more, go ahead and check out this blog post.

The next piece of content- "Addressing logical replication conflicts using ALTER SUBSCRIPTION SKIP". This is from postgresql.fastware.com, and there was a previous article released about two weeks ago talking about a way to skip over transactions during logical replication. In that post, they talked about using pg_replication_origin_advance, and this does have some risks associated with it. However, there's this new feature being added to PostgreSQL 15 called ALTER SUBSCRIPTION SKIP, and it actually makes it easier and safer to skip over particular transactions. So this is definitely an interesting blog post to check out. Now, this is very similar to the one that was presented in Scaling Postgres a couple of weeks ago. So I would probably go back and review episode 218, or you could just look at the previous blog post with regard to this. But basically, they set up a subscription and the source table did not have a unique constraint on an ID. They put a unique constraint on the destination.

So basically they're going to hit an error where they're going to send a duplicate, and they put disable_on_erro_=true for the subscription. So basically the subscription will disable itself if it runs into an error. So they process some transactions on the source database. All these IDs were entered. It hits the conflict on the destination database, and you can only see that 5 and 1 were inserted, all the rest were not. So they're going to continue sending data to it by skipping over a transaction. Now, the logs give you some important detail in that it tells you where the transaction finished. So this is the LSN of the finished transaction. So that's what you want to use in the ALTER SUBSCRIPTION SKIP. So basically you reference your subscription and you give it the LSN that it finished at. Now if you look at the subscription at this point, you see the subscription skip LSN is located here and you can see that the subscription is disabled because the sub enabled says false.

You then go ahead and enable the subscription. Now when you look at that table, you'll see it skipped over the transactions that were inserting more rows, but it just inserted the final row here. So it did skip over one transaction. And when you look at the subscription again, you can see it's enabled and that there is no skip LSN. Then you can see information in the logs about this being completed successfully and then they talk about the safeguards that were added to make this a lot safer than trying to advance the LSN yourself. So definitely a great feature coming in Postgres 15 and I wonder why they didn't mention that in the previous post that hey, you can use this technique, but hey, here's a better way to do it. But definitely check out this blog post if you want to learn more.

The next piece of content- "Using Postgres Schemas". This is from aaronoellis.com and they're talking about schemas. So basically schemas are basically a namespace for objects in a database. Tables and indexes can fit within a particular schema within a database and you can have multiple schemas. So they're just like namespaces. And it's interesting, I haven't seen a lot of schema use with Postgres clients. Almost all the consulting clients I deal with are only using the public schema. Whereas my experience with Microsoft, SQL Server and Oracle, there's a lot more schema use. I think part of it is I think Oracle creates a schema for each user. So it basically already has this namespace schema area for each user.

But I haven't seen something similar with Postgres and maybe because it can get a little bit complex as you'll see here, the permissions to be able to set up a schema and access it the way that you would potentially anticipate. When you create a table, you can namespace it with the schema followed by a dot and you can have the same name for different tables as long as they are in different schemas. So they created a private schema, named it example and then you have a public schema named as example. The other thing to consider when you're working with schemas is the search path. So it tells you what schemas it's going to be searching in to find those objects of interest and you can change your search path and it describes a little bit how to do that here.

And then even once you have a search path, you can also still just explicitly give the fully qualified name to be able to access those particular objects. Then it talks about even though you create a new user, create a schema for it as they're doing here, and give authorization to that user to use that schema. You can create objects there and that user will own them, but even the Postgres user can go in and create objects and now the Postgres user owns it. And if that user that was created tries to access the object created by Postgres, you get a permission denied error. So you have to grant certain privileges to the users for the different schemas to make sure it's working as you expect. So this blog post goes into that and probably reflects maybe why in the wild I haven't seen that many Postgres installations using schemas other than the public one. But if you want to learn more, definitely check out this blog post.

Next piece of content- "Citus 11 for Postgres goes fully open source, with query from any node". This is from citusdata.com and they're talking about the release of Citus 11. So again, this is a Postgres extension that allows you to do scale-out Postgres. So it lets you have multiple nodes and have a coordinator node that can then query data that's on separate nodes. So it allows you to scale out one database across several database servers. Now, with this release, the newest feature is being able to query from any node. Before, you had to do the query from a coordinator node, but now you can query from the individual data nodes as well, apparently. But one thing that makes this really significant in my eyes is that everything in the Citus extension is now fully open-source. So previously they had two versions. They had the open-source version and what they called their Enterprise version.

And of course, the open-source version or the Community Edition had fewer features than the Enterprise release, but apparently, they've added all those features into the open-source version. So to me, this is fantastic. And they mentioned some different features here that have been enabled for the open-source version. Talking about rebalancing shards. So you can rebalance data between your data nodes using logical replication to avoid blocking writes. So basically you can rebalance shards without downtime. Multi-user support, tenant isolation for multi-tenant applications, fine-grained control over inter-node authentication, routing internal connections via connection pooler, and performance optimizations for data loading. And of course, this blog post goes into each of these features and details and how to do the upgrade. But if you're looking for a scale-out solution, Citus appears to be now fully open source. So if you're interested in that, definitely check out this blog post.

The next piece of content- "CASE-INSENSITIVE PATTERN MATCHING IN POSTGRESQL". This is from cybertec-postgresql.com. They're talking about searching for text and you want that search to be case-insensitive. You want to ignore the case. So the easiest way to do that is just to lower both the column you're searching and the search string you're receiving so that they will be the same case. Now personally the way I prefer to deal with this is at the application level. If I know I want everything lowercase like an email address, I go ahead and force lower on insert to just make everything lowercase. That way I don't have to worry about a function on the column I'm using and then I don't have to worry about functional indexes. I just have to lower what I've received from the user, for example. But that's the way I prefer to deal with it. But when you do this technique you're definitely going to want to use a functional index.

But he says there are some disadvantages of this and that it can take a while to convert this whole value to a lowercase. So there's a little bit of a speed difference with that. Now he didn't talk about indexes in this post so I guess he was just doing all of his speed tests without indexes. The next way to deal with it is the citext extension or the case-insensitive text extension. So basically, it's an extension you install and it becomes a data type CI text. Now he says the performance can get bad for longer values because internally it basically calls lower and then a different correlation for it. Regular express matching is not case-insensitive and you have to use the case-insensitive operator explicitly. But I mean that's not much of a downside. The other technique he explored is using case-insensitive ICU correlations. But to do this you have to have Postgres built with ICU support.

So it won't work if you haven't built Postgres with that. In most installations I encounter, I haven't done this. So he explores this but it actually has problems with doing a case-insensitive pattern match. Doesn't really work and then he goes into the explanations of why that is given different languages. Particularly he focuses on German here and it looks to me like a very difficult issue to deal with. So it doesn't look like that's necessarily a great route to go. But then he gets into some performance checks. So he looks at a query looking for an exact match, a LIKE match, we're changing the collation and doing an ILIKE, and this other looks like a regular expression check here. And in all the cases pretty much the citext or the CI text wins. Except for this latter one here. But it didn't lose by that much. So again, ignoring indexes, it looks like the CI text may be the more efficient way to go for doing case-insensitive text, but you can run into an error if you're going to try to use these ICU correlations to do it. But if you want to learn more, definitely check out this blog post.

The next post- "PostgreSQL Fuzzy Text Search: Not so fuzzy to fuzziest". This is from the blog.brendanscullion.com. He talks about all the different ways to do fuzzy searches from least fuzzy to most fuzzy and he even talks about how to enhance their performance by what kind of indexes you can use. The first example he gives is simple pattern matching. So everybody should be familiar with this using LIKE or ILIKE with a wild card to try and match something in the database. The other way you could do this is by using a regex. So you could use the similar keyword that gives a little bit of a more friendly way to do a regex expression or you could use the Tilde symbol to kind of give the POSIX syntax. Now to improve performance, the only way you can improve it is if you do a prefix search. So basically you're doing a left-anchored search and your wildcard is at the end. That's the only type of search you can get a speed-up for with indexes.

And basically, you use a standard B-tree index but you can use text_pattern_ops or the varchar_pattern_ops operator. Next, he looked at something that's more fuzzyish which is text search vectors. And basically, this is the full-text search feature of Postgres. So you're converting your text to a tsvector and you're taking your query and converting it to a tsquery. So basically it breaks up things into words, potentially simplifying words. That's why he calls it fuzzyish and it does searches to try and find matches with them. Now of course with a full-text search your gen index is going to give you the most speed up. You can use it to index, but I tend to use gen indexes. The next area fuzzier is trigrams. So basically you take your text and break it up into three-character sequences, hence a trigram and you can do matches against that using similarity to find how similar your query is to what's being queried.

There are also word similarity functions and strict word similarity functions. And with that, you can use just ingen indexes to speed up the performance of the searches. With that, the next fuzzier one is Levenshtein distance. So it's a different algorithm to be able to calculate similarity basically from, I believe what he said, turning one string into another to calculate the distance. But unfortunately with this one, there's no way to give a performance improvement to it. So it's going to just be slow without an index possible. The next one that he calls very fuzzy is phonetic similarity. So using functions like soundx or metaphone or dmetaphone to look for things that sound similar. You can speed up these methods using a function-based index because essentially these are just functions. So this is a pretty interesting blog post that shows all the different methods you can do fuzzy searching with it or similar searching in Postgres. So check out this blog post if you're interested.

The next piece of content- "Quick and Easy Postgres Data Compare". This is from crunchydata.com, and they're talking about a technique where you have, say, a production database and a reporting database and you want to make sure that the data between them matches. How could you do that? The technique that they use to do this is actually a foreign data wrapper. So they basically were able to query both of them and put the results in one table. And how they did it is in this one table that's collecting the results. They give a source name that identifies where the data came from either the production database or reporting database and the ID column to identify the row.

Then they take all the data columns that exist in that row other than the ID or the primary key and hash it to an MD5 hash. And basically, if the hashes are identical, the row is identical. And you can see the results of that here. So where the MD5s are identical, it's an equal value. Where it's different, it essentially has a difference. But you can also have instances where a row is not in a target or the row is not in the source and it gives you the query that you would want to do against these tables to set this up. And again, the primary way that they're doing this is just using foreign data wrappers so that one database can query another. But if you want to learn more about how to do this, definitely check out this blog post.

The next piece of content- "5mins of Postgres E22: Reducing AWS Aurora I/O costs with table partitioning & understand partition pruning". This is from pganalyze.com and he's talking about a blog post that was put on the AWS blog where they were showing how they could significantly reduce Aurora I/O costs by using table partitioning. And I think the costs dropped up to a third. Now, this post actually came out last week, but I didn't cover it. The reason is because at least with I/O, that's kind of the differentiator of Aurora and Postgres. So I didn't see this as something that you could replicate with just open-source Postgres. Now you can definitely do partitioning and it uses partition pruning, and there are definitely advantages, but a little bit of apples to oranges to me. But in this episode, he covers that and he also discusses what partition pruning is, and how it is beneficial if we want to use it with Postgres. So if you're interested in that, you can definitely check out this blog post.

The next piece of content- "Oracle vs PostgreSQL - Transaction control statements". This is from migios.com. If you're looking to convert from Oracle to Postgres, you'll definitely see that the transaction control statements behavior is a little bit different in terms of how transactions begin and are committed and things of that nature. So this blog post goes over those differences. If you're interested in that, definitely check out this blog post.

The next piece of content- "Handle empty strings when migrating from Oracle to PostgreSQL". This is from aws.amazon.com, and they're giving some advice with regard to how to handle empty strings when you're going from Oracle to Postgres. And they say here quote "Oracle databases treat NULL and empty strings as the same...", whereas with Postgres, they're different. So NULL means unknown, and an empty string is a string of length 0, as they say here. So if you want to learn about the differences and how to manage them as you're transitioning from Oracle to Postgres, you can check out this blog post.

The next piece of content. The PostgreSQL person of the week is Vigneshwaran C. If you're interested in learning more about Vigneshwaran and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was on Ractors, which are actors for Ruby, so they're the active model of concurrency. They were introduced in Ruby 3.0 as being experimental, and we talk about them and look at their performance. So if you're interested in that type of content, definitely welcome you to check out our show.

PG 14.4 Release, Subscription Skip, Using Schemas, Open Source Citus | Scaling Postgres 220

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript