background

JSONB Performance, Bidirectional Replication, Most Recent Record, PG14 JSON | Scaling Postgres 201

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss JSONB performance, bidirectional replication, getting the most recent record and using JSON in Postgres 14.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Postgres large JSON value query performance". This is from evanjones.ca, and he's talking about the performance of JSON, JSONB, and HSTORE in Postgres, particularly as you add more data to it. Especially when you exceed the two-kilobyte limit per row. Now, this is because Postgres generally has an eight kilobyte page size and there's a minimum of four rows per page. Therefore, you have about a two-kilobyte limit per row. What he's reporting is that when accessing these columns, performance gets pretty darn slow, 2-10 times slower. Now, this is due to both the compression and the use of the TOAST table. So whenever a row exceeds that two-kilobyte limit, the excess spills over into the TOAST tables. And just looking at the JSONB test that he did, the query time for inline uncompressed was 746.

When you compress it, it goes to 1,178. When you go to the TOAST uncompressed, it's 3,393, and then the TOAST compressed it's 7,624. So again, these are query times. As you can see, when data is being stored in the TOAST and it's compressed, it's ten times slower than just accessing data that is uncompressed in the actual row, not in TOAST. So that's a pretty big performance difference. Now, they don't mention indexes or anything, so I'm assuming this is just a raw reading of the table. The reality is accessing and pulling the data out takes ten times longer versus just having it smaller and uncompressed in the row. So definitely something to keep in mind if you want to store large JSON blobs in a JSONB field for any of these other types. He advocates that maybe you want to consider storing things that exceed 2 KB in multiple rows, as well as using, say, lz4 compression to hopefully get you a little bit more speed. But definitely a super interesting post and something to keep in mind as you're determining how you want to store your data.

The next piece of content is actually an enhancement of the previous post. "5mins of Postgres E3: Postgres performance cliffs with large JSONB values and TOAST''. This article is from pganalyze.com. So basically, he primarily covers the previous JSONB post, and he adds a few more posts describing how TOAST works, some other things to keep in mind with regard to it, as well as how updates can be really slow. Because as he says here, quote "When you update a value that's in the TOAST, it always duplicates the whole value". So this could be for text fields or even a large JSONB field. If you only want to change one section of that JSONB, the entire JSONB needs to be duplicated and that also has an impact for WAL files, which has an impact for vacuum. So it can really cause performance issues. And going back to what the original post mentioned, maybe breaking it up into multiple rows would make that easier to work with. We just need to be cautious about how often you're using updates with very large JSONB fields, but two great pieces of content to check out.

The next piece of content- "PG Phriday: Nominally Bidirectional". This is from enterprisedb.com and what they're talking about is there was a post that talked about using pglogical to achieve bidirectional replication. This was on the Amazon blog and I actually reported on this in the previous episode of Scaling Postgres. His suggestion is to do anything but this. So whatever you read from this post, he says definitely don't do it because you're going to get yourself in a world of hurt. Then at the bottom here is a famous quote "Here Be Dragons". So it's super dangerous to try and use pglogical to achieve some sort of multi-master setup or bidirectional replication.

Because there are so many things that can get out of sync and cause issues having to rebuild to get logical replication back up and running again reliably. He goes through all the different issues you'll encounter. I mean, one of the main ones is that replication slots, the locations are not passed to replicas. So if you have an instance where you have a failover event, that slot information is going to be lost. So you need to do specific procedures to try and get logical replication backup and running. And when you're using it for replication purposes, that can cause a number of issues. They also talk about there are certain considerations where the last update that wins could also get you into problems.

They're talking about MERGE conflicts where you're looking to update a balance that could definitely cause big issues with the data. Basically, he advocates if you want a multi-master or some sort of a bidirectional replication solution, use a piece of software that was designed for that. Now of course, EnterpriseDB does offer their BDR product, which is their bidirectional replication product, but there are other solutions that are also viable. But I've also seen posts where they're talking about BDR and some of them seem to be actively discouraging some use of it. So basically it has its role, but it's not perfect for everything because you really need to change your application to support this type of database replication. But definitely, a great post to read and consider if you're thinking about setting up some sort of multi-master or bidirectional replication.

The next piece of content- "Select the most recent record (of many items) with PostgreSQL". This is from timescale.com. They're talking about a scenario where you have a lot of devices, a lot of trucks that statistics are being reported on on a regular basis. So maybe you have, say, a set of trucks whose mileage is being reported fuel, longitude, and latitude, and you want to get the truck and its most recent value. They have an example of that query here and they said by default, the Timescale solution indexes by the timestamp. Well, that's not going to work too well when you're looking for a truck ID. So the best solution to do that is to do a compound index where you're actually indexing on the truck ID and the timestamp. So that can allow you to return that query pretty quickly. Now I should say that this is talking about Timescale, but a lot of this also applies to Postgres as well, particularly if you're using say, partition tables.

But another scenario that can cause real slowdowns is where you're looking for, say, just the first value from a truck and there are a few techniques that you can use to do that, but some are better than others. So they talked about using a naive group by and in all cases this is probably not going to give you the best performance. They talked about using a lateral join to try and find this type of data as a viable option. They have a SkipScan capability. So using that as one. If you aren't using SkipScan, they say you can do a loose index scan. So that's another technique. Lastly is just logging the last value of each truck in a table. So basically you take the hit at insert time as opposed to query time. And they actually have a table down here of some of these different techniques and some of the benefits and disadvantages of each one. So if you want to learn more about that, definitely check out this blog post.

The next piece of content- "Working With JSON in Postgres 14". This is from aaronbos.dev and he's talking about the new subscripting that's available in Postgres. So basically before we had to use these operators, I'll call it the arrow( - >) or the double arrow( - >>) operator to pull out individual values of JSON or JSONB. Well, now you can use subscripting. So he has some example data he inserted into a table and shows you how easy it is to use by just using the syntax to be able to pull out the first name or the last name or even do a WHERE. The key thing you have to remember is that when you're using this, you do need to double quote it because what gets returned automatically is JSONB, whereas when you're using the double arrowhead operator, it actually returns it as text.

So you just need to remember to do that if you're going to be using the new subscripting capabilities. He says you can even access items and arrays by just using a numerator here and then updating data is a lot easier than using some of the jsonb_set or jsonb_set_lax functions to do it. So here you can change the value of a column by just setting it to the particular value you want. You can also update an array value by its index using syntax such as this. You can even add a key when it doesn't exist already by doing something like this. So you're inserting a new key into the JSONB for all values in this case. Or you can even append to a JSON array by simply adding a new index to it. So this is a great post. It's very brief and it tells you exactly how to use some of these subscripting techniques. So if you want to learn more, check out this blog post.

The next piece of content- "How to gain insight into the pg_stat_replication_slots view by examining logical replication". This is from postgresql.fastware.com. They're talking about the new view that's available in Postgres 14. Basically, it lets you track disk spilling where the data that has to be sent for logical replication exceeds the logical decoding work_mem, or when it's streaming. So it's in the process of streaming an in-process transaction, for example. And it basically helps you configure this configuration parameter. But this post goes into a lot of detail about logical replication, how it works, and how this view gives you insight into exactly what's going on. So if you want to learn more, definitely check out this blog post.

The next piece of content- "AUTOMATIC PARTITION CREATION IN POSTGRESQL". This is from cybertec-postgresql.com and they're discussing a way to automatically create partitions in Postgres. Now, the way that I do it is I just set up a cron job with my language of choice which could be Ruby or any other language. You could even do it as a bash script, or you could even create a function in Postgres to create these tables ahead of time. So they talk about using some of these different things using an operating system schedule.

You could even find a scheduler that works within Postgres. There's an extension, pg_partman, that will do it for you as well. But they actually tried to do something in real-time using LISTEN and NOTIFY in conjunction with triggers. Now, they said that they got something working, but it really was a hit to performance and they definitely wouldn't advocate doing it this way. But it was definitely an experiment to see if it would actually work. But if you want to learn more about that, definitely check out this blog post.

The next piece of content- "Secure Permissions for pgBackRest". This is from blog.crunchydata.com, and basically what they wanted was to only give permissions to pgBackRest, which is a backup solution to just be able to view or read the Postgres data files but not have any modification permissions to the database at all for security reasons. This post describes how to get that set up and get that working. Now you may not use pgBackRest, but a lot of the techniques basically get a user that just has read-only access to the Postgres data files that need to run the backup could be applicable to any other tool or way you intend to back up the system. So if you want to learn more, definitely check out this blog post.

The next piece of content- "PostgreSQL Timestamps and Timezones: What You Need to Know — and What You Don't". This is from blog.yugabyte.com. This is a super long post, so apparently there's a lot you need to know about timestamps and PostgreSQL. So this is definitely a tour de force set of work covering this topic. So if you're interested in learning more about times, timestamps, time zones, dates, etc in Postgres, you can check out this blog post.

There's also the companion blog post addressing primarily intervals, and this is "PostgreSQL Timestamps and Timezones: How to Navigate the Interval Minefield". So basically intervals are the distance between times, so they could be so many hours or so many minutes, so many days, et cetera. So this is from an interval perspective and it's equally long in duration, but if you're interested in that type of content, you can definitely check out this blog post.

The next piece of content- "Creating a Postgres Foreign Data Wrapper". This is from dolthub.com and they're talking about a way you can create your own foreign data wrapper. As a reminder, a foreign Data Wrapper allows your database to talk to other data sources. So you can use Postgres SQL to query from your database and it would actually contact this external data source and query it on your behalf. So you can send it to other databases, you could send it to files, as it says here, and you could even send it to other databases like MySQL, Oracle, or Microsoft SQL Server. So they set up their own way of creating their own foreign data wrapper. So if you want to learn about how to do that, you can definitely check out this blog post.

Next piece of content- "How to use regular expression group quantifiers in PostgreSQL". This is from endpointdev.com, and they had an address in an HL7 V2 format in a Postgres data column that they needed to parse or at least extract the text from. They had a grep pattern that worked, but when they tried to put it into Postgres, it really wasn't giving them what they wanted, so they actually had to use a particular technique to be able to properly extract the address. So if you want to learn more about how they did that, you can check out this blog post.

The next piece of content, the PostgreSQL person of the week is Álvaro Herrera. So if you want to learn more about Álvaro and his contributions to Postgres, definitely check out this blog post.

The last piece of content. We did have another episode of The Rubber Duck Dev Show this past Wednesday. This episode was on "Exploring the Jamstack". So this is a development technique where you predominantly use precached pages in JavaScript to enhance the application or even just the website. So if you're interested in long-form, developer-based content, maybe you'd like to check out our show.

episode_image