background

Saving Space, Working with TOAST, Protecting Data, High CPU Utilization | Scaling Postgres 203

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how data alignment can save space, working with TOAST as it relates to JSONB, ways to prevent data deletion and a high CPU incident.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "How much disk space can you save by using INT4/INT instead of INT8/BIGINT?". This is from depesz.com, and he's talking about people posting in threads where they're debating whether you should use an INT4 or just use INT8 because the space doesn't matter that much. So we actually did a test and he created a table with just INT8 and another table with just INT4, inserted a million rows into them and then checked the sizes, and the sizes were identical. Now, he did mention that this is on a 64-bit system. Now, he did the same test on a 32-bit system, and you actually see that the INT4 is a bit smaller. Well, why is that? This is due to data alignment. So Postgres aligns the data when you're working with a 64-bit system to 8 bytes.

Now that kind of makes sense because there are 8 bits to a byte and it's a 64-bit processor. So essentially each set of data is aligned to 64 bits. So what that means is in an INT4, it's only going to use 4 bytes of that segment, and four bytes will be free. But if you actually have two INT4s in a row, like he did in this experiment here, and loaded the table, you get the exact same size as just using one n four or one end eight. And if you do an INT4 and then an INT8, it's going to spill over to the next grouping and you'll get a size of 42 megabytes. Similarly, using INT2 or BOOLs, you're still going to get the exact same size as if you were to do an INT8 table. So on a 64-bit system, the alignment is going to be 8 bytes. Now, you're also going to have a lot of other columns associated with it doing things to keep the table's system managed and for data visibility.

Additional columns in the table that you don't normally see are the tableiod, the cmax, xmax, zmin, xmin, ctid, et cetera. But each of these have a data cost per row. Now, a lot of people want to leverage this alignment to be able to fit the data most compactly. But the problem with that is that as soon as you add another column onto the table or remove a column, you're going to be breaking your alignment. So I don't think this is a really good thing to rely on for, say, OLTP use cases, typical operational data stores. But if you have an analytical database or data mart or a data warehouse leveraging this alignment could allow you to save a fair amount of space. So you could take into account this alignment and align your Boolean fields and yours appropriately and all your different data types, like he was using INTs. But of course, the same thing applies to floats. But if you want to learn more about that, you can definitely check out this blog post.

The next piece of content- "PostgreSQL: TOAST compression and toast_tuple_target". This is from blog.anayrat.info and he's talking about TOAST, which is The Oversized Attribute Storage Technique. Basically, when a row is going to exceed the default block size, which is 8 KB, it's going to store that additional data in the TOAST and it will have a pointer to it from that row in the heap to the TOAST. He talks a little bit about the TOAST structure here but also talks about the role that compression has. Sometimes it tries to compress the row that it puts in the heap, other times it's going to put it in the TOAST without compression and sometimes it compresses it. And there are some different parameters you can adjust. So for example, there's the toast_tuple_threshold, it's going to try to compress the columns to fit the row on the block. But there's also the toast_tuple_target, which defines what's going to spill over into the TOAST table.

By default, these are 2 kilobytes. Then in terms of the TOAST, there are different settings that you can set for a column to determine whether it's going to be stored in the TOAST or not. Like for example, the most common thing you'll see is plain where the column is stored in the heap only and without compression. You'll typically see this with integers and floats and things of that nature, but there are also different values and you'll typically see for text or string data, you'll see it flagged as extended, which means the column can be compressed and stored in the TOAST if necessary. Now, he did some examination of this with JSONB looking at the effect on performance and he's not a fan of using JSONB in general only for attributes, particularly if they tend to be NULL.

Like if you have a set of products with a lot of different feature flags maybe, or different attributes and a lot of them may be NULL for a particular product, that could be a good use case for JSONB where it's just flagging something. But in general, there are a lot of disadvantages that he mentions here. One is you can't do any type checking, you can't have any integrity constraints, and you don't have any foreign keys. Also writing queries becomes a little bit more complex and you don't have statistics on the keys of a JSON field. You also have a loss of storage efficiency because now you're storing the keys in each row as well, not just a value. And then you can't partially update a JSON because if it's stored in the TOAST, you have to de-TOAST it, make changes, and then re-TOAST it so that it has a performance impact. Now, he particularly looked at the performance of aggregation, so he had separate columns and he did an average on each column to check the timing.

Aggregating for these four columns, for this data set, it took around two minutes and 50 seconds to do. Then he placed that data in a JSONB field and said, okay, let me run an aggregation against that. What he saw is the query took about eight minutes and 40 seconds to execute, so about four times slower. So clearly an example of there is a performance hit when you're using JSONB. Now, what is interesting about that is it actually seems to have to decode the JSONB for each aggregation, which is definitely a burden to have. But he goes into an analysis of why that difference exists. He did mention one way that you could potentially speed things up a little bit more is using the new lz4 compression that exists in Postgres 14 to be able to get things in and out of TOAST faster. But if you want to learn more, you can definitely check out this blog post.

The next piece of content- "Safer Application Users in Postgres". This is from blog.crunchydata.com, and they're talking about a situation where someone sent them a message that said, quote "We deleted our database". So basically, help! Then they helped them do a restore. What this application is talking about is you should probably configure your database so that your application users cannot delete data, so they can't drop tables, they can't truncate tables, and even later on, they say, even allowing deletions to happen. So the way that they propose setting something up is you create an administrative user that basically is going to own the database or the schema, and then you create an application user and with your administrative user, you go ahead and create a particular schema. They called it "Prod" here, and then they granted usage on that schema to the myappuser.

Then they created a table with the admin user, and then with the myappuser, they tried to delete it and they weren't able to. But what they did do is they granted SELECT, INSERT, and UPDATE on all tables in the schema "Prod" to myappuser. Now you could also choose to give them CREATE TABLE ACCESS or ALTER TABLE ACCESS because a lot of application frameworks rely on an application user to make changes to the schema. Or maybe you want to create a separate user for doing these types of schema changes because normally an application if they're not doing a schema change or a schema migration, they only need SELECT, INSERT, and UPDATE. Now in terms of deletion, how they address that is that basically you never delete any data with the application user.

What you do is you mark it for deletion. So for example, you add a deleted column to essentially all of your tables, that is a timestamp, and then when you want to delete something, you just update that timestamp to the current date and you flag it that this is essentially considered deleted now. And then in your select queries, you basically show all the data where deleted is NULL. So that's a way to handle deletions without actually deleting the data and not giving that permission to the application user. It's up to you if you want to run your database in this way, but it is definitely a more protective way to run it. But if you want to learn more about the process that they set up for doing this, you can definitely check out this blog post.

The next piece of content- "5mins of Postgres E5: Finding the root cause of high CPU utilization in Postgres with EXPLAIN and the Lunux perf command". This is from pganalyze.com and they are doing a review of the "A Hairy Incident" post that we covered in last week's Scaling Postgres episode. They look at it in more depth and talk about a few other blog posts that were linked to it and related to it and discuss the situation in general. I actually think it's a good second viewpoint to look at this incident and what happened and I encourage you to check out this episode as well.

The next piece of content- "The significance of LSN in PostgreSQL System". This is from highgo.ca and they're talking about the log sequence number as it relates to WAL in terms of identifying exactly where you are in the WAL. Now he discusses the reason he was examining this as potentially looking at a shared storage solution for Postgres. So it's an interesting use case of why he was looking at this, but if you're interested you can check out this blog post.

Next piece of content- "SQLcl to transfer data from Oracle to PostgreSQL or YugabyteDB". This is from dev.to in the Yugabyte section. This is talking about a very simple way to transfer data from Oracle to Postgres. Basically, SQLcl is a command-lined way to export data to a CSV format. So he does that but then does an awk command to make a few changes to make things compatible with Postgres and then loads the data into Postgres. So if you're looking for a probably quick and dirty way to do that, you can definitely check out this blog post.

The next piece of content- "Queries in PostgreSQL: 1. Query execution stages". This is from postgrespro.com and this is a series of blog posts talking about the internals of PostgreSQL. So this particular blog post talks about query execution in terms of parsing transformation, planning, the execution, as well as the extended query protocol that allows preparation and parameter binding. So if you want to learn more about the internals of Postgres, definitely check out this blog post.

The next piece of content- "PG_TIMETABLE V4.4 IS AVAILABLE IMMEDIATELY!". This is from cybertec-postgresql.com, and this is a scheduling tool for Postgres, and it operates outside of Postgres. However, it uses the Postgres database to keep track of scheduled jobs in Postgres. So if you want to learn more about the update to this tool, definitely check out this blog post.

The next piece of content, the PostgreSQL person of the week is Henrietta Dombrovskaya. So if you're interested in learning more about Henrietta and her contributions to Postgres, definitely check out this blog post.

The last piece of content, we had another episode of The Rubber Duck Dev Show this past Wednesday. This episode was on "How To Get Started Coding". So if you haven't started your development journey yet and are trying to determine where to start, maybe you'd like to check out our show.

episode_image