background

Postgres 15 RC1, ICU Collations, Listen & Notify, Understanding TOAST | Scaling Postgres 235

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the release of the Postgres 15 Release Candidate 1, new ICU collations features, how to use listen & notify and understanding TOAST.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 15 RC 1 Released!". So this is the first release candidate and this is from postgresql.org. This means we're really close to releasing Postgres 15 and they're projecting to do it in basically two weeks, October 13, 2022. So it looks like, of course, most of the changes are primarily bug fixes. Now is a good chance to try out this release if you're planning to upgrade it as soon as it comes out.

The next piece of content- "ICU features in PostgreSQL 15". This is from peter.eisentraut.org. So he says since version ten, they've had ICU support to add it to specific objects, but not at the whole database level. And now this is possible with Postgres 15. So when you create your new database using initdb or your operating system version's way of doing it, you can specify the locale provider to be ICU, and then you can also specify the ICU locale you want to use. So this is great and it should make it a lot easier for people to start using ICU collations. Now, there is an objective to eventually move to this as the default ICU collations as opposed to the libc collations. But there are still some things in the Postgres code that you still need to define the default locale as well, using the libc collations, for example. And he talks a little bit about that here.

Now the reason why this is particularly important is that when you do version upgrades of the operating system on which your system runs, typically the libc version changes and there could be hidden correlation changes that break, essentially, your indexing of your database and it could cause rows to be missing or unique indexes to not operate properly. So it can cause a big mess. Basically, if you're changing operating system versions, which typically implies a libc version change, you have to re-index all the text-based indexes to handle that. But that pretty much includes every non-number-based index because even I think some JSONB-type indexes run into issues with collations changes. But I really like the last bullet point he mentions here is that quote, "I would like to move toward making ICU the default".

So basically, quote, "...get to a point where we think ICU and UTF-8 first, and where all the binaries are built with ICU enabled...". So I really like this path to avoid this surprising version change, because presumably you could then upgrade operating system versions and upgrade your ICU collations version whenever you wanted to. You didn't have to do it along with an operating system upgrade. But it looks like this is still a work in progress. I mean this feature is ready for 15 but in order to make the reality here, ICU, the default, sounds like it may be version 16 or version 17. We'll just have to see. But if you want to learn more about that, definitely check out this blog post.

Now related to that, the next piece of content is "5mins of Postgres E37: New in Postgres 15: Logical replication column filters & database-wide ICU collations". So of course the content related to the previous article was the database-wide ICU collation. So Lukas talks about that. But he also mentions the logical replication column filters that are coming in Postgres 15 as well. And that was mentioned in last week's Scaling Postgres episode. But if you want to learn more about that definitely check out this piece of content from pganalyze.com.

Next piece of content- "LISTEN/NOTIFY: AUTOMATIC CLIENT NOTIFICATION IN POSTGRESQL". This is from cybertec-postgresql.com and they're talking about essentially the pub sub-feature of Postgres called LISTEN/NOTIFY. Basically, this lets you check for changes and do some work. Now one way you could do it is doing polling. So you could have some sort of code that has a loop, checks for changes, say to a table, does some work, and then it sleeps for a period of time. But then it polls again. So this is a pub/sub where there's an actual change in the database and it notifies you that the change has happened rather than you having to check periodically. So first there's the LISTEN command. And that essentially lets you listen on a channel that you choose whatever name you want to make it. And then if a particular event happens you can notify those who are listening to that channel. So you notify and then say that same channel name that says other clients are listening on and you can optionally include a text-based payload.

So to get started you would just open a client connection and say LISTEN and then whatever channel name. And he chose X here. But then if you want to send a notification you just say NOTIFY X, the channel name, and then whatever payload you want to send, say some message. Now the psql client sends this type of information. It says "Asynchronous notification 'x' with payload", shows the payload received from the server process with PID, and the process ID. Now different clients are going to behave differently. So if you have a Ruby client or a JavaScript client, there are different ways to hold a connection open and keep listening for notifications to come in. So there are different ways to do this type of implementation. But this is how psql gets back the notifications. But you don't just want to send a NOTIFY. You want the NOTIFY typically to be triggered by something. And the example he has here is he creates a table. He creates a function that will be fired by a trigger and basically waits for an INSERT or UPDATE.

Then it sends a message where it's executing the NOTIFY command, the given channel, and then the payload that's coming in where essentially he's just sending the contents of the row. But there's another convenient function that you can use called pg_notify, and that basically is a function. The first parameter is whatever the channel name is, and the second parameter is the text-based payload. So you can see here when it sent the notice, it sent the payload as essentially whatever the row contents were. Now, they do emphasize the point here that the notification is not immediately sent as soon as something is inserted. It needs to wait until that transaction is committed because other sessions need to be able to see that this insert has happened. So if you do this as a part of a transaction, the notice will be sent as soon as it's committed. But this is a great post that explains how LISTEN/NOTIFY works and how you can get it set up.

Now, related to that, there's another piece of content, this one from crunchydata.com, called "Real-time Database Events with pg_eventserv". So they're covering LISTEN/NOTIFY as well. So they have a diagram here that shows a separate web service that they designed, I believe in Go, but essentially uses LISTEN/NOTIFY, and then it sends out information via web sockets. So you could design something equivalent to pg_eventserv in any other language you want to, but this is a particular web component that they created to be able to do it. It's still using LISTEN, so it happens to listen on a channel they called People, and then they're using, again, that same trigger mechanism to send notifications on the People channel. And then their pg_eventserv picks up those notifications and then sends out those messages via web sockets.

Another example of the function they're using, they're actually using JSONB to put everything together and then using pg_notify to send that JSONB that they've constructed as text. So they're casting it to text because the payload of the notification needs to be text. But then presumably their pg_eventserv parses that text back into an object-based format. So you can download this and run it. And it gives this little interface here. Once you go to localhost 7700, you put in the channel name and it will send notifications as you insert rows into the database. So this is another type of implementation where you actually can look at the Go code, presumably to see how you can actually consume these messages. But if you want to learn more about that, definitely check out this blog post.

Next piece of content- "TOASTing strategies in PostgreSQL". This is from dbi-services.com, and this is a series of three blog posts explaining TOAST, and they're relatively short, but packed full of a lot of useful information. If you're not as familiar with TOAST, which stands for The Oversize Attribute Storage Technique. It's basically a separate table dedicated to storing additional information that can't fit in a particular row for a particular column. When you create a table, there are actually three different strategies that can be used for storing the data by a column. So for example, one column could be Main, which means there's no compression and no out of line storage, which means it's not going to be stored in the toast. There's Plain, which can be compressed but has no out of line storage, it's not going to be stored in the TOAST.

Then Extended, it's compressed and it can be in the TOAST. So he created a table here with three different data types— an int, a text, and a numeric— and then you can actually look at the different strategies that, by default, the database assigns to them. So an integer is Plain, which means it can be compressed, but it won't be stored in the —. Numerics don't get compressed and stored in the TOAST, whereas text can be compressed and stored in the TOAST. Then he mentioned there's also Extended, which doesn't do compression, but offers out of line storage. And he says this is good for pre-compressed files. Potentially, you want to store- PDFs or Pings or MP3s, anything that's already compressed. You don't want to compress it again because you're just wasting CPU cycles, basically. So that's what this post talks about, basically the strategies that are available.

The second post- "TOASTing in PostgreSQL, TOAST tables". So this shows basically how TOASTing actually works. It shows you that when you look in your list of relations, like if you do \d in a psql prompt, it won't show these extra TOASTed tables, you actually have to explicitly look for them. And this is how they show up in the pg_toast schema. And it's a prefix of pg_toast followed by the OID of the table. The best way to find the OID of a table because it can potentially change, is to use a pg_class. So he did this here to identify the particular TOAST table and he shows what the structure looks like.

So basically you have a chunk ID that pinpoints the OID of the particular object or column. You have a chunk sequence number, which is an integer, which means if its data exceeds more than can fit in a row, it splits it down to multiple rows. So you can have as many different sequence numbers related to that particular column for a row in a database. Then the actual data which is a byte array, so everything gets converted to a byte array when things are stored in the TOAST. So another brief post giving a great explanation of TOAST.

And then the third one is "TOASTing in PostgreSQL, let's see it in action". And here he's using the same table and he inserts a very large value into a text field, but he takes a look at how many TOAST rows are available, and there's actually zero. So why is that? Well, it's being compressed, and it's being compressed so well, it doesn't have to store it in the TOAST. Then he talks about compression and how by default, PostgreSQL uses a built-in pglz compression method. But as of Postgres 14, you can use lz4. And then he took a look at that compression algorithm, and the performance benefit, he says, is quite huge.

So inserting a million rows using each type of compression algorithm, lz4 was able to store data in 3 seconds vs. 51 seconds for the default Postgres compression. So that's a pretty significant win as well as the size is almost half the size for lz4, so that seems like a win all the way around. He shows different ways that you can turn this on and enable it for your Postgres database as well. Lastly, he shows if you go back to do the INSERTS and make the inserts totally random it's harder to compress that type of data. You now can see that 81 chunks have been created in the TOAST table. So if you want to learn more about TOAST, highly suggest checking out these three blog posts.

Next piece of content- "BOOK RELEASE! MASTERING POSTGIS AND OPENSTREETMAP". This is from rustprooflabs.com, and basically, he's announcing a book that is available for purchase and it's all about PostGIS and using OpenStreetMap data to work with it. So if you're interested in getting started with Spatial Analysis and Postgres and PostGIS, definitely recommend checking out this blog post and the book to see if you're interested.
Next piece of content- "All about pg_stat_statements". This is from dev.to in the Yugabyte section, and this is exactly what it says talking all about pg_stat_statements, which is probably the most used extension that's available for tracking the performance of your statements that are running in Postgres. So if you want to learn more about how that works, definitely check out this blog post.

Next piece of content- "Preferred types". This is from mydbanotebook.org, and she's talking about an issue where someone was doing a Power function and they noticed that when they tried to do a mathematical adjustment minusing it by one, it wasn't working. It was giving the exact same value, even though one was subtracted from the version one code. The reason is because of floating point errors. So basically, if you don't explicitly state it, the Power function, when you're putting in numbers, assumes it's a double precision or it's a float, a double. But in order to get the correct answer, you need to actually cast it to a numeric or ensure that it's a numeric when you're doing it. And if you do that, you get the right answer. But it just so happens the double is actually what the default is. So if you don't specify what the data type is, it assumes it's a double, and she goes in and shows how you can discern what the preferred data types are for Postgres. So definitely check out this blog post if you want to learn more about that.

Next piece of content- "Introducing pgsqlite: a pure python method to import SQLite databases into Postgres". This is from innerjoin.bit.io, and they're talking about a utility that can convert your SQLite database into Postgres and it's a Python module. He goes over the process of creating that and has a link to the Python module.

But the other thing he mentions that has been around a while longer is  PGLOADER. This is another way to do a migration into Postgres from three different types of databases- MySQL, SQLite, or even Microsoft SQL Server. And this is from pgloader.io.

Next piece of content. There was another episode of Postgres FM this week. This one is on "Why is Postgres popular?". So definitely listen to this episode, or check out the YouTube video if you're interested in that.
And the last piece of content, the PostgreSQL person of the week is Nicolas Thauvin. If you're interested in learning more about Nicolas and his contributions to Postgres, definitely check out this blog post.

episode_image