Schema Change Mistakes, Canceling Statements, pg_rman, Pedantry Removal

In this episode of Scaling Postgres, we discuss mistakes you can make when doing schema changes, how best to cancel statements, looking into pg_rman, and looking at a pedantry removal.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Common DB schema change mistakes". This is from postgres.ai, and he's talking about mistakes that frequently happen when you're trying to do a schema change to a database. Now, sometimes these are called DB migrations in different application frameworks. But it's basically you wanting to add a table, add a column, add an index, and you're changing the schema in some way. And typically, as he says here, these changes tend to be incremental. So you perform these changes one step at a time. Typically they're reversible, so application frameworks have the ability to allow you to reverse a change after it's done. And then they're also versionable, so you can store them within your source code repository, for example. This is a great post describing different mistakes that can be made.

So he's bucketed these into three primary areas. One is concurrency-related mistakes. Either failing to acquire a lock in a particular amount of time, or you acquire a lock but you leave it open for too long, impacting other transactions. Or you're updating too many rows at once, which can cause locking and other concurrency issues. The other bucket is mistakes related to the correctness of steps or logical issues. So you have a schema change that happens that's unexpected, or you have a mismatch between what the application code expects and what the database expects, or even just unexpected data. Then a miscellaneous where he's talking about exceeding potentially a statement timeout with a migration, or you have vacuum behaviors, or even primary key links. So this post covers about 18 different cases and we'll just run through them quickly.

The first one, case 1, is a schema mismatch. So imagine you go to create a table, but then you get an error that says the relation already exists. So this indicates there's some sort of problem with your process of applying migrations because something has already run create tables or it was done manually. Basically, you would want to try to resolve whatever that issue is, how did this get run before it was supposed to be run? And what you don't want to do is case 2, using IF [NOT] EXISTS, because you could use this. But then the problem with this is that it silently fails and you don't know that you have this problem that caused this. Something else created that table before your schema change ran. So basically, the appropriate way to resolve this is to find out what caused this to be run in advance or if it was run manually, and to remedy that issue as opposed to trying to use IF [NOT] EXISTS.

Case 3 is hitting your statement timeout. And he says typically you see this if you run in a testing environment that's not a replica of production. You can be under a statement timeout, but once you move to production, sometimes this gets hit. So you want to do schema changes that you know are going to spend a little bit of time in a production-like environment, like having a copy of the production database in some way to be able to test things like that out. They're talking about their product, actually, the Database Lab Engine, which allows you to do thin clones of a database for this type of testing. Case 4 is an unlimited massive change. So maybe you're deleting or updating a lot of data in a table and like this query here, updating took 76 seconds to do. So that's way too long that you'll want to hold locks on a system, typically. And he says some of the problems that could happen, you could run into locking or contention issues just by the number of rows that are being updated.

Your checkpointing could happen super frequently given the updated amount of data that needs to be done and that could actually lead to bloat issues. So how you would typically want to update a lot of data at a time is to do it in batches and he recommends limiting the actual active work to about 1 second. So maybe you would put this in a loop, do some work for 1 second or approximately that amount of time, then take a break and then apply the change again. And then what you could also do is vacuum between changes as well to make sure that you're avoiding bloat issues and even doing analysis to make sure that your statistics are kept up to date. Case 5 is to acquire an exclusive lock and wait-in transaction. So basically you're doing an alter table command, adding a column, but then you do something else that causes a long delay and basically, you want to avoid this if you can. Anything that requires an exclusive lock because this is a very fast operation, but it requires an exclusive lock, so you want to keep that separate.

So basically, break out your migration. Usually, it's always better to do complex operations in terms of multiple steps, particularly since one step can take a long time to do. And they're using an example here in case 6 where, say, you're copying a lot of data after you add a column. So you would want to break those steps out and do the COPY command. Maybe you wouldn't want to do that in batches as well. Case 7, waiting to acquire an exclusive lock for long blocking others. So you don't want to run into a lock queue where things are waiting for a lock on a table and then you have other locks that are waiting for that lock behind it. So one way to avoid that for your schema migrations, what you want is to apply a lock timeout and a statement timeout as well. But basically, the lock timeout should avoid a queue of locks building up. So basically, if a statement is unable to acquire a lock within the lock timeout, it'll go ahead and cancel it. So a canceled migration is usually better than one that would cause locking issues.

Case 8 is the careless creation of a Foreign Key. So basically it needs to check for consistency in the foreign table to make sure that all the values are there. So the best approach to this type of migration is to do it in two steps. The first step is to create the Foreign Key with the not valid option. That creates the foreign key it's present but all the historical values are not valid yet or haven't been checked. And then as a second step, you do alter table and validate that constraint and that shouldn't necessarily block but it will validate that constraint and make sure everything's good. Case 9 is the careless removal of a foreign key and basically, the problem you can run into here is if you're not using a lock timeout. Generally, you would always try to use a lock timeout for so many of your migrations other than when you're creating indexes to avoid any kind of lock queue issues. Case 10 is the careless addition of a CHECK constraint.

Again, this is similar to the foreign key adding and what you would want to do is create the constraint as not valid and then as a second migration or a second step, alter the table, validating that constraint and that'll allow you to add that check constraint to the table. Case 11 is a careless addition of NOT NULL. So NOT NULLs are a little bit different in that they can lock up the table to verify things are not null and there are a couple of different ways to handle it based on your version of Postgres. Before 11, you don't really have a good way of dealing with it and maybe you would want to go with a CHECK constraint instead of NIT NULL on the column. But if you're at 11, there's a way to do that and then if you're on version 12, the easiest way to deal with it is you go ahead and create a CHECK constraint that's NOT NULL and you can create it as not valid and then validate that constraint. But once that constraint is in place, you can then set NOT NULL for the column and version 12 and higher will allow you to apply that relatively quickly.

Case 12, careless change of columns data type. Now there's also mentioning renaming objects, so I think these are related. So careless change of a column's data type or renaming the objects. You want to do something similar in this case that he recommends and that is actually creating a new column, writing to both columns, probably using a trigger, backfilling the data in batches, switching your application to use the new column, and then dropping the old column once it's fully done. So I would do that for both the renaming objects and a need to change the column's data type. So case 13 is careless CREATE INDEX. Basically, you never want to CREATE INDEX, you always want to do CREATE INDEX CONCURRENTLY. Similarly, with DROP INDEX, you want to DROP INDEX CONCURRENTLY because that avoids locking the table during the index creation.

So case 16 is adding a column with DEFAULT. The great thing about this is that if you're on Postgres 11 and later, this works without any issues. Case 17 is leftovers of CREATE INDEX CONCURRENTLY. So this is a downside of CREATE INDEX CONCURRENTLY is that if it runs into issues or has a failure, it will actually leave around an invalid index and you can actually inspect the table and you can see the index is invalid. And if that's the case, you need to drop the index concurrently and then add it back concurrently. Case 18 is four-byte integer primary keys for large tables and basically recommends using the eight-byte integer types for those. But this is a great post, covers mistakes related to database schema migrations and gives very good recommendations on how to get around those issues.

The next piece of content- "HELP, I CANNOT CANCEL A POSTGRESQL QUERY!". This is from cybertec-postgresql.com. And basically, he's saying when you're running a query, there are times where you can't cancel it. So what's the problem with that? And he says there is a secret key that's sent with a new connection that you can send a cancel request using the library. I actually haven't used that. What I typically do because I'm the sysadmin is I use pg_cancel_backend. And if that canceling of the statement doesn't work, then resorting to terminating the whole backend by using pg_terminate_backend and you just put in the process ID of whatever process you want to terminate. Now the cancel backend sends a SIGINT signal and the terminate backend sends a SIGTERM signal. Now, it doesn't immediately just stop the process or cancel the statement, it actually finishes work before it cancels or terminates.

And in the code, he says it actually looks for a check for interrupts to check for these, and then it goes ahead and finishes its work. Now some of the reasons that you could get an issue where a query doesn't get canceled is that execution gets stuck in a loop that doesn't contain this check for interrupts. And he says this would basically be some sort of a Postgres bug. The other possibility is you're stuck in a third-party C function that doesn't have the ability to give control back. And basically, this would indicate another bug in this particular function. And then third, it's stuck in a system call that cannot be interrupted. So this may be some sort of problem at the operating system or hardware level. So this kind of example should be relatively infrequent where you can't cancel it, I would hope. Now, he does say do not use kill -9 unless you're desperate.

Because of the issue, when you go to use a kill -9 on a process, that process has not had the opportunity to clean itself up. He says the postmaster detects if one of its child processes did not shut down cleanly, and then it kills all other Postgres processes and initiates a crash recovery. Depending upon when the last checkpoint was, that could be a fair amount of time to bring the database back up. So you never really want to do this even for just a Postgres process, unless you absolutely have to. And never, never kill the postmaster process, the main Postgres process. Now, he did have a trick here to cancel a stuck query without crashing the server, in that he created a custom C function that basically did an infinite loop.

And then he called that function and then he tried to cancel it. It didn't work. He tried to terminate the backend. That didn't work either. What he actually did to that process was he attached a debugger and sent a ProcessInterupts command, and that process for interrupts command enabled that process to stop. Now, how you can create an infinite loop within a C function is you can actually use that CHECK_FOR_INTERUPTS function as long as you have it in there, then you can actually cancel this query at some point because the system will check for any interrupts that are happening. But this is an interesting blog post and if you want to learn more, definitely encourage you to check it out.

The next piece of content- "How to do backup and restore for PostgreSQL using pg_rman". This is from highgo.ca. Now, from my previous work with Oracle, I know Rman is their main backup and recovery tool, and I haven't really used pg_rman for Postgres, but apparently, it's very similar. I guess the DSL that it uses is similar to Rman, but this post describes how to use Rman to do a backup and restore of a Postgres system. Now, I didn't see a lot in here, nor on the pg_rman site itself, about what are the advantages or disadvantages relative to something like pg_base_backup, because that's what I tend to use. But if you come from Oracle and you're looking for a utility similar to Rman, maybe check out pg_rman.

The next piece of content- "Removing some pedantry off Postgres". This is from mydbanotebook.org, and they're talking about an issue where when you're doing a SELECT and you're looking for a count of all rows from a table, they say that there's no real reason to use count (*) instead of just counting, because when you use a row number, you don't have to use an asterisk. An asterisk means something entirely different when you're selecting all rows from a table, like, it doesn't mean SELECT ALL rows necessarily. So basically they think it doesn't really need to have this asterisk. So if you run a select count without the asterisks from the test, you actually get an error. So what they did is they went in and made a patch for Postgres so that they removed it so that you can do a select count without the (*). So if you want to find out more about how they did this, you can definitely check out this blog post.

The next piece of content- "5mins of Postgres E19: Speeding up sort performance in Postgres 15". This is from pganalyze.com. And they're talking about enhancements to Postgres 15 that have improved sort performance. Now, we covered a similar post posted to the Citus Data blog last week. This one covers one that was done in the Microsoft blog, but it's pretty similar in terms of the content. But if you want to learn more about PGAnalyze's perspective on this post, definitely encourage you to check it out.

The next piece of content, the PostgreSQL person of the week is Masahiko Sawada. If you're interested in learning more about Masahiko and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was on "Typed or Untyped Ruby". So we were talking about the advent of static analysis tools for Ruby 3, but in general, we were also talking about static vs. dynamic typing. So if you're interested in that type of content, we definitely welcome you to check out our show.

Schema Change Mistakes, Canceling Statements, pg_rman, Pedantry Removal | Scaling Postgres 217

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript