background

Boundless Text, Revoked Permissions, Index Bloat, Hardware Performance | Scaling Postgres 182

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss having boundless text fields, revoking public schema permissions, less index bloat in PG14 and comparing hardware performance.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Postgres: Boundless 'text' and Back Again". This is from brandur.org, and he's talking about the situation with Postgres: you can use a varchar data type, a varchar with a particular limit, or a text data type, and there are no performance implications from any of those choices. Basically, any of those work equally well in terms of performance. In other database systems, you have to pay penalties for one over the other. Like maybe variable character is less sufficient than a simple character field, but with Postgres, it doesn't make a difference. Now, because of that, a lot of people who use Postgres find freedom in that where you can just say everything is a text field and forget about it. But then you run into problems of end users really exploiting that boundless capability.

He's mentioning here a story from Stripe from 2018. Now, I don't believe this is with regard to Postgres specifically, but he's talking about a case where they had a boundless ability to insert data without limits into a Mongo database, it looks like. It started having issues with, he says, crashing HTTP workers, tying up database resources, et cetera. So basically, they had to try and find those limits. But those limits were exceedingly large. So they worked around the solution and had some exceptions for certain customers. Then he's talking about a new position where he equally saw that the database had no constraints on a lot of different fields, like text fields for example, and he didn't want to repeat the same issue. So what he's kind of advocating for is that even though there's not a performance difference, just for management of the database sake, you want to set some type of limits.

So one area he's considering is one type of length for short strings like names, addresses, email addresses, et cetera. Have a little bit of a longer description, say 2,000 characters for things like descriptions, and then a size for really long text blocks of 20,000 characters. Now he's saying even though you have these limits in the database, of course, you want to make these limits at your application layer as well. Because a lot of times you can present a more friendly message than if you just ran into a database constraint. Now I actually agree a lot with this because I know in my experience using Ruby on Rails, at one of the versions, they changed how they created the database migrations to remove the varchar limit. So it used to have a default limit of 256 characters for any variable character field.

Now, of course, you could adjust that up or down with the migration, but by default that was a limit. I actually appreciated that. But now there's essentially no limit, so you would have to impose your own limit if you want. So part of me did like the limits that were set by default historically. Now he was thinking about some different ways maybe you could implement this as a standard and he's thinking about using DOMAINS. So you could create, say, a text_standard DOMAIN that has a particular limit, a text_long, or a text_huge DOMAIN. But the problem is it kind of hides what those values are if you're using these DOMAINS. So you're not quite sure. But anyway, I felt this was an interesting blog post and discussion and if you want to learn more, you can check out this one.

The next piece of content- "Waiting for PostgreSQL 15 - Revoke PUBLIC CREATE from public schema, now owned by pg_database_owner". This is from depesz.com and he's talking about Postgres 15, not 14, which is going to be released this fall. But in the next version of Postgres, there is a patch done which is a pretty big change in that you can no longer create objects in the public schema by default, you have to be a database owner to do it. So this is best seen for the example that he gave. So he created a database called X, then created a user called Test. Then created another database called Test, where that owner he just created, called Test, is the owner of it. Now when that user, Test tries to create a table in the X database where it's not an owner, it fails. So it is unable to create objects in essentially the public schema of this database, but because it's the owner of the Test database, it can easily create a table in that database. So this is a pretty big change that may unfortunately break some applications, but overall I think it's a worthwhile change for security improvement. So if you want to learn more about this, you can check out this blog post.

The next piece of content- "INDEX BLOAT REDUCED IN POSTGRESQL V4". This is from cybertec-postgresql.com and he's talking about the case of index bloat. So basically when you have data that's being updated and there are indexes that are impacted by that data, those indexes need to be updated. If a row gets removed, for example, or a row is in the index, if that row needs to be removed because it was updated or an older version, that index entry needs to be removed. If not, you can cause some table bloat. And if you update the same row multiple times, you can actually get page splits and a larger bloated index due to that happening. Now he mentions that there are some features that Postgres already has that handle some of this. So the first feature is HOT updates. He mentions HOT tuples, but I think he means HOT updates because HOT means heap-only tuple updates.

So basically if you're going to be updating data that does not impact an index, you can basically just update the tuple and not have to impact those indexes. So the indexes don't need to be altered. So that helps avoid table blow. The other instance, which as he says here quote "When an index scan encounters an entry that points to a dead tuple in a table, it will mark that entry as 'killed'". Therefore it doesn't have to scan over the entry and then it can delete some of these entries when the index page is full to avoid a page split. But what version 14 does is what they're calling a bottom-up index tuple deletion. It, quote, "...is able to delete index entries that point to dead tuples right before an index split is about to occur". So basically helping you to avoid page splits in the index which will reduce bloat.

Now, they have an example case here where he created a table with three columns. The first is an ID with a primary key. The other is another integer that they're not going to change, it's unchanged, and it won't be updated. Then another integer that they're going to be changing. So he did a lot of iterations of updates of the changed column as demonstrated here below. Then he looked at the size and the level of bloat in the indexes. You can see, first off, is the unchanged index. You could see it's eight times larger in version 13 compared to version 14. So at the top is version 13. This is the size of the unchanged index and you can see it's an 8th the size of the index above.

Plus, its bloat in version 13 is at 95%, whereas it's at 60% in version 14. So this unchanged index with these additional features is able to avoid these page splits due to all these rows being updated, as long as it's not being impacted by that update. It also looks to even help out the primary key here, which does get accessed periodically. So in version 14, it is a little bit smaller and there is less bloat about 13% compared to 35% on version 13. So definitely a great improvement if you're going to have a lot of updates happening within a table to help manage and control index bloat. So if you want to learn more about it, you can check out this blog post.

The next piece of content is "Aurora PostgreSQL db.r6g compared to db.r5 with YBIO". This is from dev.to in the AWS Heroes section. Now, I don't cover a lot about Aurora in Scaling Postgres, but I thought this post was interesting because it gives another example of hardware differences between the Graviton processors that Amazon offers to the intel processors that they offer. Because that's the question, would this make sense to move to the Graviton processors and what is their performance like, given the cost. Now, Aurora is based upon Postgres but it's not Postgres, of course.

I think this is an interesting post to look at from a hardware performance perspective. Now there was a tweet by someone who was saying they had some performance issues with the Graviton processors but not the intel ones. So this post walks through a performance comparison between those two platforms as well. And basically his results, it was a little bit slower but on the whole pretty equivalent and the Graviton processors offered a 20% cheaper option. So if you're evaluating different hardware, maybe you want to check out this post to see what his results were for the comparisons. Of course, you would always want to do your own comparison as well. This just gives you another data point.

The next post- "How We Went All In on sqlc/pgx for Postgres+Go". This is from brandur.org and this is talking about client utilities for accessing Postgres, specifically in Go. So in terms of the Go language, he mentioned seven different libraries that can do PostgreSQL connections and each have their own way of handling it. Essentially, think of them as ORMs for Go for accessing Postgres. Now, what he ultimately decided on was SQLC which basically, using comments, denotes the query you want to send against the database and then in the comment, it becomes a function or a method you can call to execute this query. So I thought this was an interesting post exploring different ways of accessing Postgres with Go. So if you're interested in that you can check out this blog post.

The next piece of content- "PSYCOPG3 INITIAL REVIEW". This is from rustprooflabs.com and this is talking about another client library, this time for Python. He's looking at migrating from psycopg2 to psycopg3. And basically, his assessment of the migration was easy and he thinks the connection pool rocks. So they've made a number of improvements in version 3 with a lot of improvements it appears for connection pooling. But overall in all of his analysis, he saw great performance gains in general, particularly when the connection pooler was used.

There's a companion post called "THINKING PSYCOPG3" from varrazzo.com where he discusses its development and the changes he wanted to make with version 3. For example, using PGexecParams instead of just PGexecute with the params already merged in and the number of improvements to the connection pooler that were talked about and also doing a lot of async work as opposed to a lot of things happening synchronously. Now, this is a little bit of an older post, but if you want some more background in terms of client libraries working with Postgres, maybe you want to check this out.

The next piece of content- "Using jq to get information out of pgbackrest". This is from fluca1978.github.io and this is focused on using JQ which is a way to parse out JSON from the allows you to parse that and pull out information in a command line utility. Now, I actually have explored something similar because AWS's client, like if you're wanting to do backup or snapshots or things of that nature also requires a way to parse JSON unless you're forcing the output to be text. So I thought the JQ was a great tool to be familiar with if you're working with Postgres and need to do some JSON parsing. So if you want to learn more, you can check out this blog post.

The next piece of content. The PostgreSQL person of the week is Akshay Joshi. So if you're interested in learning more about Akshay and his contributions to Postgres, you can definitely check out this blog post.

The last piece of content, we did another show of The Rubber Duck Dev Show. This week, we covered "Programming with Dates, Times & Zones". So if you want some more long-form, developer-based content, definitely check out this episode.

episode_image