Index Corruption From ICU Collation Change

In this episode of Scaling Postgres, we discuss index corruption from a ICU collation change, another caveat for PgBouncer prepared statements, ways to version data, and using Postgres as a cache.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Do you know at the end of the movie Revenge of the Sith, where Obi-wan defeated Anakin and he had said that he was supposed to destroy the Sith, not join them? Reading this first blog post gives me memories of that, but I hope you, your friends, family, and coworkers continue to do well. So our first piece of content is "The collation versioning problem with ICU 73". This is from postgresql.verite.pro, and the reason why I said this reminded me of Revenge of the Sith is because the ICU collations were supposed to deliver us from issues with GlibC changing when we upgraded operating systems. So this was supposed to be a way to prevent collation changes from breaking our indexes and potentially corrupting our data. Unfortunately, in ICU 73, there's a bug due to a quote "...uncommon move..." where they "...changed the root sort order without upgrading to a whole new Unicode version". So of course this can break sorting in Postgres and some warnings that Postgres has set up to detect this.

They're counting on these version number changes to put warnings in the logs that say something has changed. Like, for example, right after connecting, it does a warning database, whatever the database name, has a collation version mismatch. Or when you're using the collate clause, it says collation and the collation you're using has a version mismatch. And basically, if you ignore these warnings, you can end up with data corruption. As he says "...silent data corruption". Now he shows an example of what can happen. So for example, in ICU 72, when you do a sort in this specific way, you can see A and B are ahead of A and 2. But when you sort it in ICU 73, A,2 is ahead vs A, B, but the versions are exactly the same, 153.120 and 153.120.

So if there's a sort change, these versions should change as well. So that way you know you need to essentially reindex your indexes to avoid any kind of data corruption. Now, we took a look at this on Fedora 38, created a database, put some words in, did an upgrade, and it did go from ICU 72 to 73. The issues were there was no warning about the version mismatch. When searching for certain specific existing values within the index, they were not found. When searching for the same specific values without the index, just looking at the heap, they were found. So they were there, but that wasn't reflected in the index due to the correlation mismatch.

When checking with the B-tree index checker from amcheck, corruption is detected. So he shows all of that here. Now, the good news is that ICU 74, the next version, is at the RC stage or the release candidate stage, and it should be released in the next few days. So maybe people will never see this upgrade to ICU 73, although it could of course happen. But just know if you intend to upgrade to ICU 73 or you've already done it, you need to go ahead and do a reindexing. He also adds at the end here, that if you're using bytewise sorts, so a C, POSIX, C.UTF-8 for your collation, you shouldn't have to be concerned about these upgrade difficulties. But he says he's going to release a blog post about this in the future. So definitely something to be aware of if you use ICU correlations with your Postgres database.

Next piece of content- "Prepared Statements in Transaction Mode for PgBouncer". This is from crunchydata.com and this follows up on the topic we discussed in last week's Scaling Postgres episode with the release of PgBouncer that supports prepared statements, specifically version 1.21, and he reiterates this release and its importance. Basically, PgBouncer can now support prepared statements in a transaction and statement pooling mode. Prepared statements basically mean the query only has to be parsed once and the query only has to be planned once and basically it can be executed many times at that point. Now, because of this, a lot of the performance improvements you'll see are going to be in the reduction of the planning stage. For example, he did a test here where he joined a table to itself about 14 times, making the planner do a lot of work to figure out how to run this query.

You'll see, the planning time is 170 or so milliseconds for the planning, and even though the execution time was much shorter, it on average looks like twelve milliseconds or so. But if he prepares this statement and then executes it multiple times, you could see during the first execution it's still around 170 milliseconds for the planning time, but the planning time thereafter is like two-hundredths of a millisecond. So super fast, hardly any work needs to be done. So that's the major benefit of the prepared statements. He discusses the importance of it working in transaction mode because that's how most people set up PgBouncer to be able to maintain more than one connection from your app to a single Postgres server connection. Basically, it allows you to minimize the number of server connections you're using and how when you're running in transaction pooling mode, on the old version you would get errors like this "ERROR prepared statement does not exist", whereas after the upgrade to the new version of PgBouncer, you get success and no such error.

Now, he does mention a caveat we mentioned last week in that this only supports protocol-level prepared statements, it doesn't support SQL level prepared statements. So what you saw above where he was doing a prepare that wouldn't work. So the library you're using needs to use protocol-level statements. But the second caveat that was not mentioned last week is the potential issues with deallocation. So apparently the protocol supports destroying prepared statements, but the library that most people use to connect to Postgres. Even if they're using Ruby or Python, usually it's still going through libpq. This library or interface doesn't support preparing closed statements until Postgres 17, which is a year away and what he says is quote "Hence, many drivers and programs will resort to doing a manual DEALLOCATE".

So basically these libraries will have to be adjusted. He mentioned he did do a patch to the Perl Postgres library to offer an option to PG Skip deallocate, because without this, apparently he sometimes got a destroy failed error prepared statement. A reference to it does not exist. So definitely, even though PgBouncer supports this, you need to make sure that whatever library you're using in whatever language is using protocol-level prepared statements and doesn't have this deallocated issue or it has a way to get around it. But if you want to learn more, definitely check out this blog post.

Related to that is pganalyze's "Five minutes of Postgres" where they cover this topic as well. "PgBouncer 1.21 adds prepared statement support and transaction mode".

Next piece of content- "Versioning data in Postgres? Testing a git like approach". This is from specfy.io and he's talking about the desire to do data versioning and he kind of tried to take it from an approach of how git works. Now, in the first part of the post, he's talking about and explaining how git works to help you kind of understand what he's going for here. But I'll jump down to the meat of the blog post down here and what he wants to set up is say, a post table. So blog posts and it's associated with a blobs table. So this is where all the versions essentially reside. So if you're going to be updating a post, all the content will be added to this blob and then you simply change the blob ID reference to that newer version of the blob. And he did want to support being able to walk to see what the previous version was. So he is including a parent ID here. So that's how we want to do his versioning. So what that means when you create a row, you insert the content into the blob and he's using JSON to store a lot of the data. For example, he just has the title here.

Then once you know that blob ID, then you can insert it into the Post table and use that blob ID to be able to identify which version to use. If you update a row, you essentially insert into that blobs table the new version, and then you update the version that's referenced in the post table using this command here. If you revert a row, he actually has two different ways to do it. One, you just, in the post table, point to the original version or the previous version. Another option is to actually copy the original blob and then point to that. I don't know if I'm necessarily a fan of that, but that's possible to do. In terms of deletion, he said you could do soft deletes using a deleted column in the table or do a soft version delete where you're essentially doing the same thing in both tables.

Or a hard delete where basically you insert a version stating that it's been deleted and then remove the reference in the post table. Oh, and then a full hard delete where you just delete both apparently. And he has a query on how you can list the versions. Then he actually wanted to talk about branching where you essentially queue up multiple blobs or multiple blog posts and have them associated with one branch that I guess you can commit all at one time into your posts. To me, this seemed to push the git analogy a little too far. I would have probably kept it more simple and not have this table. But he did say the biggest drawback of this solution is dealing with schema drift. So as the schema changes over time, how are you going to be able to handle all of these different versions and migrate between them?

He said he dealt with it by having an array of ignored fields, which seems a little complicated to me. As I was thinking about this, looking back at his original simplified schema with just a post and a blob table. Personally, I think what I would do is just have a post table and a post versions table, that way I wouldn't have to deal with the JSON in here. So I think that would probably help to minimize schema changes because you can just do standard migrations of this post versions table. I could also imagine a scenario where you can just do the versioning in one table and have different columns define which is the active version. But I thought this post was pretty interesting and you can check it out if you want to learn more.

Next piece of content- "You Don't Need a Dedicated Cache Service- PostgreSQL as a Cache". This is from martinheinz.dev and basically, he wants a cache that supports expiration. So basically, data will expire in the cache. Eviction, so basically data can be evicted maybe if it reaches a certain size when the cache is full. Support invalidation, so overwrite data when it changes. Something that's performant, doesn't have persistence and a key value stored. Now how he's designing this cache is basically he's using an unlocked table, so he's avoiding writing to the WAL. That should give you a lot more performance, but you still have to deal with vacuum. So if rows are going to be highly updated, you're still going to be fighting vacuum with this. So I don't know how well that would work. In terms of expiration, he basically says you need to set up a delete process that gets removed after a particular retention period. So basically put that in a Cron job. He's using a pg_cron extension here instead. He said you may be able to do it with triggers.

I don't necessarily want to do that because that would just slow things down more. He didn't go too much into eviction, but that would require tracking a last_read_timestamp and essentially being updated after every select, which now you're doing an update with every select, which is going to slow things down a little bit more. So I think the performance is where you're going to be hurting trying to do this. And he mentions there are also downsides, one of them being performance. So I thought this was an interesting thought experiment, but I don't think I'm going to be giving up things like memcache any soon for my application cache. But if you need cache-like storage, unlogged tables are a good candidate for that. Just keep in mind they do get erased if your database crashes. It's not crash recovery proof because you're not using the WAL. So just make sure you don't put anything in there that you don't want to lose. But check out this blog post if you want to learn more.

Next piece of content- "Using pgactive: Active-active Replication Extension for PostgreSQL on Amazon RDS for PostgreSQL". This is from aws.amazon.com and this is an extension on RDS. But I looked for a reference that it was open source but I couldn't find it. It says "...pgactive is based on the open-source BDR project", which is a bi-directional replication that I think Second Quadrant originally did that is now a part of EDB. But looking through this, I don't see that it's open source, so it may only be available on RDS. So if you're looking for that you may want to check out their service. Or also the new Active-Active Logical Replication that's available in Postgres 16. I don't know how ready it is for primetime and production workloads yet, but that's a new feature that's here today in the most recent version of Postgres.

Next piece of content. There was another episode of Postgres FM last week. This one was on "Over-indexing". So whereas most of the problem is not having the correct or the right indexes on, it can also be a problem if you're putting on too many indexes. Now the most obvious problem is you're going to be slowing down inserts and updates because all those indexes need to be kept up to date as data is added. But they also mentioned a few other problems that you can encounter. You could actually slow down selects because of increased planning time. Again, like we saw on the post earlier where someone did 14 joins and how the planning time got really long. Well, the same thing can happen with too many indexes that might be covering the same row, so it has to spend more time planning out what it needs to do.

They also discussed the lock manager and how it has a I'll call it magic limit of 16 and once you exceed 16 objects that need to be locked as part of a query, in terms of a shared access lock, you start to slow things down. It's no longer, I think, a fast-path query. So you can get to this number of 16 by having too many indexes on a table or you're joining multiple tables with a lot of indexes, or even having a partition table where you're not pruning down to a single partition. But the more partitions you're touching and joining to other tables and having multiple indexes, you could potentially hit that 16 really quickly. Now they did say most of the time you're not going to see this because you actually have to have lock contention with the same query being run many, many times. And a lot of the frequent queries aren't going to be all these complex joins happening at the same time, but it's just something to be aware of and to, again, keep your indexes set at the appropriate mount for your workloads. But I found this to be a great episode. You can listen to it down here or you can check out their YouTube video here.

Next piece of content- "Animize dump of your Postgres data". This is from tembo.io and they're talking about using the pgcrypto extension along with the postgresql_anonymizer extension to anonymize your data that resides in Postgres. Now, the scenario they're doing here is they're saying you have a Postgres database with data you want to anonymize and he wanted to bring it down to a local system, anonymize it, and then you can use it at that point. Personally, what I would do is I'd probably put a separate Postgres instance in the production area. Due to this process, he describes here, anonymizes it on that production instance in the secure area and then downloads it to a local system. But talk to your security engineer about that. But once he gets the extension set up in the database, he just loads the data into local database he created, sets up the extension and defines the columns he wants to anonymize and then goes ahead and anonymizes the database. So if you're interested in that, you can check out this piece of content.

Next piece of content- "Using psql Variables to Introspect Your Script". This is from fluca1978.github.io and you can use variables in psql and they have the format that looks something like this. You can do a gset to set them and then you can reference them using this syntax here. So if that's of interest, you can check out this blog post.

Next piece of content- "Human/version sort in PostgreSQL - revisited". This is from depesz.com. In here, he's talking about ICU correlations and he wants to be able to sort this sample data in a way a typical human might do it. So basically, he used a syntax down here to set up what he called a human_sort and then applied that collation as part of his query. And now it sorts the way one would expect. So definitely feel free to review this post if you want to understand 100% of what he's doing here to get this collation sort the way he wants it to work.

Next piece of content- "A quick glance at pg_basebackup compression". This is from highgo.ca he's talking about in newer versions of Postgres, starting from 15, you can now add a compression option and specify whether you want to tap it on the server or the client. It also can support more than just gzip. Like. Personally, I really love using zstd. Now, what this basically does is determine which system has the encryption load. Is it the client or is it the server? So based upon how you want to do your backup, you can define either to have the Postgres server database to it or whatever client you're taking the backup from. And they show an example of it working here. Now, I actually haven't tested this yet, but I wonder if it uses all the cores with an example like zstd. I know what I've typically done is take pg_basebackups without using the compress option, and I pipe it into zstd or pgiz, which is a parallel gzip to use all the cores of the machine. But if you want to learn more, you can check out this blog post.

The last piece of content- "Quick Overview of PostgreSQL's Table Access Method". This is from highgo.ca. This is just a quick summary of the table access method in Postgres and this is the interface that's used to define additional ways of storing Postgres data. Essentially different storage engines. And it's what, as he says here Orioli DB is using to develop their custom engine that's trying to do things like support update in place and a redo log. So if you're interested in that, you can check out this blog post.

Index Corruption From ICU Collation Change | Scaling Postgres 288

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript