Go Faster, GIN Indexes, Collation Stability, PG14 & Beyond

In this episode of Scaling Postgres, we discuss going slower to go faster, GIN indexes, collation stability and features of PG14 and beyond.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. So our first piece of content is "Slow things down to make them go faster". This was presented at Postgres Build 2021 and from the website vyruss.org. There's basically a brief summation of the presentation, but they present the YouTube video as well as a link to the slides here. Now, this was a very good presentation and it talked a lot about how to make Postgres go faster in terms of not slowing it down with your application. So looking at locking contention and ways to work better with Postgres's process model. What's slowing things down to make them go faster is that sometimes you can throw so many connections onto Postgres that its limited CPU resources have trouble context switching between all the different connections because each connection is its own process in Postgres. They presented this example here where you have 100 connections, you get a TPS (transactions per second) of around 1,500, and a latency of 52 milliseconds.

Whereas if you ramp that up to 1,000 connections, given a particular instance size, which is r5.8xlarge, the TPS drops by almost a third to about 1,200 TPS. And the latency gets significantly longer, over ten times longer at 668 milliseconds. So this is kind of the problem that he's talking about. When you try to throw too many connections at Postgres, you can result in this hindered performance. If you actually reduce the number of connections, you can actually get more throughput. Now, how you handle this is basically using a connection pooler. So for example, pgBouncer is a popular one that I tend to use and what he mentioned in this talk, but there are others you can use as well. But it basically slows down those connections so you can actually run faster. But he also covers some other points, including trying to avoid explicit locking, and maybe using serializable use cases more, but definitely a great presentation. I suggest you check it out.

The next piece of content- 'Understanding Postgres GIN Indexes: The Good and the Bad". This is from pganalyze.com and they're talking about GIN indexes. Now, the thing to keep in mind with a GIN index compared to the typical index, which is a B-tree index, for every entry in a table, there is a corresponding entry in a B-tree index. So it's essentially a one-to-one relationship. It has to look it up in the index and then it knows where to find the rows in the heap. Whereas a GIN index is a little different in that it stores multiple locations by value. So basically the index is set up per value, which is what makes it good for text searching as well as JSONB fields that have multiple values within the structure of the field. So you can think of a typical book index as how you would think about a GIN index. So you have a particular value it knows about and it knows the locations where to find the information for that value.

And what they say here is a single row in a GIN index can be represented in many places within the tree. The next thing they talk about is the use cases I mentioned. They talk about full-text search on how you can use the GIN index to do full-text searching using particular operators. You can also do similarity searches using the gin_trgm_ops flag when you create that GIN index and that will allow you to do similarity searches against a table using that index. They did mention, of course, that you can use a gen index on JSONB columns and they showed some use cases for that where you can discover the value of a particular field of data you have stored in there. You can also use different operators like jsonb_path_ops to look for JSONB WHERE and find a particular value contained within it.

So you could use the containment operator for those use cases. They even mentioned you can use a B-tree index on a JSONB field using this type of syntax where you're targeting a specific value you want to store in that B-tree. That's an alternative to using a GIN index. Now, sometimes you're going to want to combine these, so a lot of times you may want to look at a given customer ID that they mentioned here as well as a particular set of data within a JSONB field. There are a couple of different ways you can do that, but one of the great ways to do that is using an extension called btree_gin which allows you to combine a B-tree index and a GIN index at the same time and they show a use case of doing that here. Now, one of the downsides of the fact that you're storing the index by value and it references multiple locations is that it has expensive updates, as they mentioned here.

Because once that value changes, you may have to go and visit all those different locations where that value is referenced to update that information on where to find it. So how it handles that is it actually defers some of these updates and it maintains, as they mentioned here, a pending list of those deferred changes that need to be done. Now, what triggers this deferred update is you have a gin_pending_list_limit that is, by default, at four megabytes. So if it hits that during an index update, it knows it needs to do the work to commit them to the actual index instead of the pending list. Another option is gin_clean_pending_list function, which of course empties that out and updates all of the index entries. Then running an autovacuum on a table with the GIN index. But this can also lead to some performance problems, they mentioned a problem that GitLab had with this recently. They talk about the situation that they encountered with it.

However, there are some ways to handle the disadvantages of a GIN index. They list basically six different ways here that they propose you could do that. One, you could reduce the gin_pending_list_limit and this may actually allow these flushes to happen more frequently and faster as they mentioned here. Two, you could increase the limit so that maybe it doesn't happen during prime times, but you better control when that happens by making the limit larger. Three, you could turn off fastupdate which basically no longer stores things in a pending list. It just does the work as it comes in. So basically, as they say here, you're quote "...taking the overhead of each individual INSERT/UPDATE" at the time it happens as opposed to deferring it. You could tune autovacuum to run more frequently is their number four. Number five, you could explicitly call the gin_clean_pending_list function more frequently instead of using autovacuum. Or six, you could drop the GIN index. But this is a great blog post explaining GINindexes and some potential downsides of them and of course their advantages. So if you're interested in that you can check out this blog post.

The next piece of content- "Collation Stability". This is from rhaas.blogspot.com and he's talking about an issue of collations with regard to indexes and that by default, Postgres uses the operating system libraries for handling collations, but it does have support for ICU-based collations. Now these collations are important for ordering indexes and unfortunately, if your collations get off like you do an operating system upgrade, even though you don't change the Postgres version, it could cause the correlations to change if the underlying libraries change significantly. Then when you go to search for data using an index, you're going to get invalid results from it, which is a big issue. Unfortunately, what this article mentioned is that it says, quote "...the system Thomas Munro designed to detect changes in the collation version had enough problems that it had to be reverted".

So this was a change that they're going to at least notify you that your collations versions changed which may impact your indexes. Unfortunately, there were some issues with it and as clearly they had to remove it. So this post doesn't have a good plan for handling this. I mostly wanted to showcase this to let you know that this is an issue that exists. Some way to mitigate it is basically using a separate library like ICU collations, so using those instead of your built-in operating system library, that way you can specifically version control when that happens. Basically, if you run into this and your indexes stop working after some sort of an upgrade, the solution is to basically re-index your text-based indexes so you wouldn't need to reindex like a Boolean field or an integer field or things like that. It's basically things that will be sorted like your text-based fields. But if you want to learn more about these correlation issues, definitely encourage you to check out this blog post.

The next piece of content- "PostgreSQL 14 and beyond". This is from postgresql.fastware.com. This is a really long post, talking about all the different features that have been introduced in Postgres 14, as well as some future features coming in Postgres 15. So what are some things that are being planned in the community at this time? So in terms of the big features they are talking about in 14, they're talking about the logical replication improvements, additional SQL features, data corruption fixes, indexing changes, extended statistics, as well as vacuuming. He follows up with performance improvements that were done. Now, in terms of 15, they again mentioned a lot of different things.

They're working on a lot of improvements, again focused on logical replication to make that better server-side compression and backup technology that would be great to see to make backup smaller. Improvements in automatic switchover failover of primary to replicas. Improvements in hash indexes, a shared memory stats collector to give more relevant information about the state of the shared memory. Changes with regard to parallelism, in terms of parallel inserts, copy from, and things of that nature. So if you want to learn more about all the updates that came to Postgres 14 as well as potential future improvements, definitely check out this blog post.

The next piece of content- "PostgreSQL vs Python for data cleaning: A guide". This is from blog.timescale.com. This article talks a lot about data scientists using Python to prepare their data, particularly cleaning it before they start their analysis. Well, this shows you a way you could do it actually within PostgreSQL and it should be faster and more efficient because you're working right within the database as opposed to using external Python tooling to be able to do that kind of cleaning process. So if you want to learn more about this process, definitely check out this blog post.

The next piece of content- "PRIMARY KEYS VS. UNIQUE CONSTRAINTS IN POSTGRESQL". This is from cybertec-postgresql.com. They're talking about the difference between designating something as a primary key versus just creating a column that is a unique constraint. Basically, the only difference is that the primary key is NOT NULL and the UNIQUE constraint can be NULL. Although you can, of course, specify it to be NOT NULL as well, but that's basically the only difference.

The system does have a special designation for a primary key, so that is something that exists and it was mentioned later in this blog post or in the comments, I believe, that some tools rely upon that primary key. They'll give specific messages if one doesn't exist or it needs to be there in certain cases. So you do want to use a primary key generally with every table that you have because there are some tools and utilities that rely upon it. But in general, it's identical to a UNIQUE constraint that is NOT NULL. But if you want to learn more about that, you can check out this blog post.

Next piece of content- "PostgreSQL 14 Database Monitoring and Logging Enhancements". This is from percona.com. They mentioned a number of the different enhancements that are directly related to monitoring and logging. So the main area they're talking about here is the query identifier being now unique across the different statistics tables. So you can make a note of it in the log, you can make a note of it in EXPLAIN as well as the pg_atat_activity table. The next thing they mentioned is you now get IO timings for autovacuum and auto-analyzing and there's also additional connection logging that you can enable. So if you want to learn more, you can check out this blog post.

The next piece of content-" Transition Tables in Incremental View Maintenance (Part II): Multiple Tables Modification case". This is from yugonagata-pgsql.blogspot.com and he's talking about the new feature that's being worked on: Incremental View Maintenance. This would allow you to create a materialized view that essentially keeps itself updated. So this particular use case is looking at a view where you have multiple tables that the view encompasses and how it keeps both of those tables up to date. So if you want to learn more about the status of this new feature, you can definitely check out this blog post.

Next piece of content- "PostGIS Day 2021". This is from blog.crunchydata.com and this blog post basically mentions a number of different presentations that were done on PostGIS Day. They talk about PostGIS for neuroscience, talking about different climate tools as well as different extensions in Postgres, and different types of mapping projects. So if you want to review these presentations, check out this blog post.

The next piece of content- "Tricks for Faster Spatial Indexes" is also from blog.crutchydata.com. They're talking about how if you add randomization to your data before creating indexes that are relevant to PostGIS, particularly the GIST indexes, you can actually get performance improvements. This blog post explains why that is the case. The performance was about 11% faster, so not a huge improvement. But if you could do this simple randomization before applying these indexes, you'll get a speed boost. So if you're interested in learning more about that, check out this blog post.

The next piece of content- "PERMISSIONS REQUIRED FOR POSTGIS''. This is from rustprooflabs.com, and they're talking about the permissions required to get PostGIS installed and running. So if you're interested in that, you can check out this blog post.

The last piece of content. The PostgreSQL person of the week is Fabien Coelho. If you're interested in learning more about Fabien and his contributions to Postgres, definitely check out this blog post.

Go Faster, GIN Indexes, Collation Stability, PG14 & Beyond | Scaling Postgres 194

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript