background

Faster Copy, psql Variables, backup_label, Bad Encoding | Scaling Postgres 268

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss faster COPY in Postgres 16, how to use psql variables, the importance of backup_label and issues with bad character encoding.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Well, ask and you shall receive. On last week's episode of Scaling Postgres, I brought up that Postgres 16 
Beta 1 was released and they had in the release announcement that COPY was 300% faster in some cases. And I was like, 'What is this? I haven't heard of this enhancement?' because it's something that I really need and I asked in last week's episode if anyone knew anything about this.

Well, lo and behold, Lucas's "Five minutes of Postgres" covers this exact topic. So he gets first position this week with this post- "How bulk loading with COPY can be 300% times faster in Postgres 16 Beta 1". This is from pganalyzed.com. And of course, watch the video here. But basically, what he was saying is that there was an announcement in the notes, which is why I couldn't see any reference to COPY when I scanned them, to quote "allow more efficient addition of heap and index pages". So this is when you're expanding the size of a file, as you're adding data as inevitably COPY does.

It allows it to do it much faster and he discusses how this was something that was kind of a side detour and an enhancement they could make along, trying to get async I/O benefits, potentially landing in 17. But the particular bottleneck was related to relation extension locks. So locks that are being placed on the relation on the table while it's being extended. There was a more efficient way to do it and that's what the patch, or I believe there were two patches, did.

Now he says not everyone's going to benefit from this and it depends on how much you're being held back by this lock. So if you look in pg_stat_activity, look for the lock weight event extend, and if you're seeing that, presumably this enhancement will be able to allow COPY to copy your data faster. So definitely thank you for this. Great piece of information and definitely watch his episode or check out this post if you want to learn more.

Next piece of content- "Variables in psql, how to use them?". This is from depesz.com. And I actually used PSQL variables this week and I was pulling my hair out trying to figure out some things that this post covers. So if you want to use variables and PSQL, definitely keep this as a reference because this is very comprehensive. He goes in-depth on all the different ways that you can set up and use variables. And basically, the simplest way to do it is you just use the "-v" command and you say this variable equals this value.

Now he says you can also use "-set" and "-variable" as the longer parameter names. But one thing he also mentioned that you should be aware of is that variables don't work when you're trying to just run a command. Like if you do "-c" and pass on a command to PSQ, all the variables don't work. Which I was pulling my hair out about. He also shows an example where you can set a query using \gset and you can actually set variables and the PSQL session itself. So this is a really comprehensive post on using variables in PSQL, so I would definitely bookmark this and keep it as a reference.

Next piece of content- "Do I Really Need That backup_label File?". This is from rhaas.blogspot.com. And the backup_label file is something that gets creative for you automatically when you're using a pg_basebackup. But if you use pg_backup_start and pg_backup_stop to do your backup, you need to retain the information that pg_backup_stop returns and write it to a backup label file so that you can have it during the restore.

Now he said there are some conditions where you won't even need the backup_label file. Things will restore without issue, and he goes over instances of where there are some differences but it really doesn't impact your ability to restore the database. But when you have a lot of activity and multiple checkpoints occurring during the backup, that's where you can run into problems with recovery.

And primarily, it relates to knowing when to start the recovery from. So the backup_label file tells Postgres at what point in the WAL to start recovery. Otherwise, it relies on the control file, which is not reliable. It can work in some cases, as he's demonstrated here, but it can cause you to run into problems. So definitely, you're going to want to retain that backup file if you're using the start/stop backup technique. But if you want to learn more check out this blog post.

Next piece of content- "ERRO: INVALID BYTE SEQUENCE- FIX BAD ENCODING ON POSTGRESQL''. This is from cypertec-postgresql.com. And I have to say, time zones don't bother me that much. Cashing doesn't bother me that much. But bad encoding, or encoding problems, are the bane of my existence. I hate dealing with these types of issues. But this post goes into different encodings and explains it. So basically, there is a server-level encoding that is set at the database at the point of creation and you can never change it. Because otherwise, you'd corrupt your text strings in the database. There's also a client encoding that you can set when you connect and what it does is it translates those into the server encoding as it's working with the data.

But usually for consistency, I like keeping them the same. He sets up some test data because, of course, he's going to show you different ways of doing corruption. Now in the example here, how you can see some of these errors in Postgres, is that he took a latin1 encoding file and tried to load it into Postgres using a UTF8 encoding. And it actually gives you an error invalid byte sequence for this encoding. So as he says here, this being very strict about data integrity, is super good. Compared to some other database systems he mentions here. So that's good but you can still get corruption occurring if some of the encodings are similar for characters.

He then switched it to using latin1 and it was actually able to insert this UTF8 encoded data. But if you look at it, it is different from what you would expect in the database. But as long as you pull it out with the same encoding, it'll translate it back fine. But that can still cause you some issues. And then it goes into ways that you can fix bad encoding. And it's basically taking a dump of the data or maybe a copy of the data and copying it back in with correct encodings. So if you want to learn more about encoding, definitely check out this blog post.

Next piece of content- PgBouncer 1.19.1 is released. This is from pgbouncer.org. And this is a minor release, but it fixes quote "two important bug fixes to avoid rare crashes" and these crashes weren't introduced in  1.18. So again if you're running 1.18, you'll probably want to upgrade to this as soon as you can.

Next piece of content- "POSTGRESQL FOR THE SQL SERVER DBA. TRANSACTION ISOLATION, AND TABLE BLOAT". This is from softwareandbooze.com. And even though it says this is for the SQL server DBA, it also just means anyone who's relatively new to Postgres. There's not a lot of reference to SQL server in here. But he talks about Postgres's transaction isolation and with that, you're definitely going to cover the multi-version concurrency control. And he has pretty quick points to make with regard to it. Number one- PostgreSQL creates new rows for updates, always.

Number two- The old rows eventually become "dead" and their space needs to be cleaned up. Number three- The VACUUM process frees up the space and keeps tables in tip-top shape. So he doesn't go into details about that. Number four- Vacuuming the database and tables is so important that the autovacuum daemon should always be running.Number five- As part of the autovacuum process, the table is analyzed to update statistics. Now of course, if things don't get vacuumed up, you end up with table bloat.

So there are too many dead rows in the table that haven't been cleaned up. A lot of times, you cannot actually remove the space from the file. And the only way is to rewrite the file by shrinking it. Then he goes into all sorts of different autovacuum settings and even tuning individual tables for their vacuum settings, again, to keep bloat under control. But definitely check out this blog post if you want to learn more.

Next piece of content- "SQL:2023 is out". This is from peter.eisentraut.org. And the new standard is out and he has hyperlinks to all of the different parts of the ISO website. And he says here, quote "Note if you want to purchase a copy, you'd probably only want part 2, unless you intend to work in particular areas…". So you can check this post out if you want to learn more.

Also, from peter.eisentraut.org, is "PostgreSQL compile times". Now, this isn't Postgres specific, they are compiling Postgres, but it's just interesting to look at the comparison between macOS and Ubuntu. And the different compiler versions to see how fast you can make things. And definitely, the thing that had the most improvement and performance was doing things in parallel. So specify the number, I don't know if he uses processes or threads to do the compiling, but check this post If you want to learn more about that.

Next piece of content- "New Site, New Partman". This is from keithf4.com. And he's actually the creator of PG Partman, so this is a tool to help you maintain your partitions in Postgres. And this is an announcement for version 5.0.0 of PG Partman. And it looks like as part of this release, he's dropping the older, trigger-based partitioning. So he's just going to be using the declarative partitioning and helping you keep those tables up to date and remove ones that you no longer need. I haven't personally used PG Partman, but if you're interested in this tool you can check out this blog post.

Next piece of content- "pgBackrest SFTP support". This is from pgstef.github.io. If you're interested in learning more about this enhancement, you can check out this blog post.

Next piece of content- "Disaster Recovery for PostgreSQL on Kubernetes". This is from percona.com. And this post is talking about the Percona operator for PostgreSQL that Percona develops, and they're showing you how you can use that for disaster recovery purposes. So if you're interested in that you can check out this blog post.

Next piece of content- "SVG Images from Postgres". This is from crunchydata.com. They've actually created a library that they're calling pg-svg that is basically just a set of functions that you query against your geospatial data or, I think, PostGIS data. And it will actually create SvG images for you. Then, of course, you can use it on the web. So it looks like they show an example here and show different ways that you can use these new sets of functions. So if you're interested in that, you can check out this blog post.

There was another episode of Postgres FM this week. This one was on "Parallelism", which covers basically all the different parallel operations that you can do in Postgres. So if you're interested in that, you can listen to the audio episode or click to watch the video version.

Next piece of content, the PostgreSQL person of the week is Carlos Chapi. If you're interested in learning more about Carlos and his contributions to Postgres, definitely check out this blog post.

For the last piece of content, we did have another episode of the Rubber Duck Dev Show this past Thursday afternoon. This one was on "The Making of Avalon in Dragon Ruby with Janes Stocks". So we get to talk about James and how he's starting to build his game called Avalon in Dragon Ruby. So if that type of content is of interest to you, we welcome you to check out our show.

episode_image