background

Postgres Releases = Reindexing | Scaling Postgres 290

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss my Postgres Performance Starter Kit, new releases of Postgres, removing JIT, and a lightweight message queue.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Hey, we got some new releases. Unfortunately, you may have to do some reindexing. But before we get into that, I hope you, your friends, family, and coworkers continue to do well. Now, before we get into this first topic, I actually wanted to make an announcement that I created a free mini-course. I'm calling the PostgreSQL Performance Starter Kit. Now, it's called a starter kit because it's very introductory and as you can see here, it's basically an intro to SQL, an intro to pg_stats_statements, and an intro to explain, and it's less than 30 minutes of content essentially. 
Now, this is not the big course that I'm planning for performance optimization, but this is for developers who don't have as much experience with these tools if they want to get up to speed before taking the course that I'm planning. So if you're a regular viewer of Scaling Postgres, I don't think you'll learn very much from this course, but feel free to share this with your developer friends in case they want to start learning more about PostgreSQL performance optimization. Now with regard to the course, I have a lot of information about that, but I'm going to put it at the end of this episode, so be sure to wait till the end to find out the details about the course.

But to get started, the first piece of content is "PostgreSQL 16.1, 15.5, 14.10, 13.13, 12.17, and 11.22 Released!".  This fixes three security issues as well as 55 bugs, predominantly in Postgres 16. And the first thing they mentioned here is that this is the final release of PostgreSQL 11. So if you're still on Postgres 11, you're going to want to upgrade as soon as possible, essentially. So the security issues are a memory disclosure in aggregate function calls that have a CVSS base score of 4.3, a buffer overrun from integer overflow in array modification that has a base score of 8.8, so the most severe, and then the last one is role pg_cancel_backend can still signal certain super user processes, which has a score of 2.2.

Because most of the Postgres processes, this won't affect them from what they're saying here, but it may affect extensions that you're using, and they have a number of the bug fixes and improvements listed here, so feel free to look that over. But of course, the most important part is the changes that will require reindexing. So first all your gist indexes need to be re-indexed. Secondly, B-tree indexes that use the interval data type, and thirdly, BRIN indexes that use date timestamp easy and timestamp data types and a minmax_multi opsclass need to be re-indexed. So be sure to check your database and reindex those when you upgrade to these versions and they put a little reminder here, you can reindex them concurrently in version twelve.  But make note, I think it's not until version 14 that you can actually reindex concurrently partition tables as well. But check out this post if you want to learn more.

Next piece of content- "Django: Maybe disable PostgreSQL's JIT to speed up many-joined queries". This is from adamj.eu and he ran into an issue in his Django application where they had a crazy join here joining to I can't even tell how many tables. Maybe it was ten. But basically, the JIT compilation of the query was taking 99.7% of the runtime, which is insane. And with all of these joins, apparently there weren't that many rows. And he says down here, even though they had few results, this query took 3 seconds to JIT compile as he mentioned, but milliseconds to execute. But that's kind of crazy. And he said he's using pgMustard and they recommended either reducing cost estimates or disabling the JIT compiler. So he opted to do that because it was easier. And what it looks like he did, he changed his database connection options to turn off the JIT compiler, and that individual test that he was running that took 3 seconds, now took milliseconds.

So that definitely fixed the problem. But of course, what I'm thinking when I look at this is you can turn JIT off in a particular session. So that's one option, or for a particular user that connects to the database, you could turn it off for that purpose if you need to run this particular workload on it, or rethink everything. The reason that it was doing all of these joins was a feature in Django doing multi-table inheritance, and the query joins all these subclass tables. So apparently there are all these subclasses having to do that. So I would question whether you want to use the application of the database in this fashion. I mean, if it works for this use case, fine. But seeing this is a bit of a warning flag to me, because what other potential performance problems could come into play down the line? But check out this blog post if you want to learn more.
Next piece of content- "PGMQ: Lightweight Message Queue on Postgres with No Background Worker". This is from tembo.io, and we've discussed this in previous episodes of Scaling Postgres, this PGMQ extension, and they're saying it is a self-regulated queue, so it has no background worker to check or update statuses. So basically, it is a queue system that uses FOR UPDATE and SKIP LOCKED to be able to lock a given entry so no one accesses it, as well as skip any that are already locked. So you're just moving down the line with each item in the queue.

But what's new to me is this self-regulating queue in that they're using a visibility timeout. So apparently, once something is grabbed from the queue, it's set to be invisible from all other workers, essentially for a period of time. So the entry is always in the queue, it's just marked as invisible once it's picked up, and then that message is archived or deleted once the job is finished. If it fails for any reason, it just becomes visible again for presumably other workers to pick up. So with this implementation, they don't have to have a background worker checking the state of things. So I found this pretty interesting. Check out this blog post if you want to learn more about it or there's a link to the extension right here.

Next piece of content- "Tuple shuffling: Postgres CTEs for Moving and Deleting Table Data". This is from crunchydata.com. This blog post describes exactly what it's doing. So he's showing examples of using CTEs to move data around. Now the benefit of this is that it all happens within one transaction. So if you wanted to archive some data, for example, you could do this in a delete statement, and they're just deleting data that hasn't been modified in a year. You use the RETURNING function to return all the columns from what was deleted, and then you insert that into the archive and you're just selecting all those columns from the deleted to put in there, assuming the schema is the same. And he says you can also filter the data that you're actually inserting into the archive by using a WHERE clause as well. So here, you can only archive the important data from what was deleted.

He shows an even more complicated example where he's deleting from multiple tables and inserting it using UNION ALL into the single archive table. But just keep in mind, this all happens in a single transaction. So if anything should fail within any of this statement, it'll just do a rollback and no changes will have been made. So that's a big benefit of this. Next, he goes into updates. So here he's updating some balances and returning the modified rows from this query as a part of the CTE. And with this data he actually inserts it into an awards table. So once a certain amount is triggered with a balance update, they earn certain awards. Then he looks at partitioning, where he's actually deleting from the source rows and inserting it, based upon the category into partition tables. Most of the partition tables I deal with have a lot of data, so I couldn't imagine using this in this use case, but this is just an example of how it works. So check out this blog post if you want to learn more.

Next piece of content- "Ruby on Rails Neighbor Gem for AI Embeddings". This is from crunchydata.com and the primary author of the pgvector extension, Andrew Kane, also writes a lot of Ruby Gems, which are libraries for the Ruby language, because I think that's what he does at Instacart. And he's developed a new gem called Neighbor. Why? Because it helps do nearest-neighbor searches. So if you're familiar with the Ruby ecosystem, you just set this up in your Gem file to get the library, and it makes it really easy to do database migrations and rails with the pgvector extension. So here you can see you're defining a vector that you're calling embedding and what the size limit is, and it even understands the index type. So here it's creating an HNSW index with the particular vector 12 ops opsclass, as well as giving you options to define embedding in your model and then being able to find the nearest neighbors using euclidean or cosine. So if you use Ruby on Rails, this is a great library to add to your environment if you're using the pgvector extension. So check out this blog post if you want to learn more.

Next piece of content. There was another episode of Postgres FM last week. This one was on "Data model trade-offs". So Nikolay and Michael were talking about different trade-offs of how you structure your data. Should you put a lot of columns in one particular table? Should you move columns into a separate table based on index usage, data usage, how it's being queried, and how far should you normalize or denormalize? Really, a lot of it is. It depends on your use case and where the hotspots are in your application. But I found this a great episode. You can go ahead and listen to it here or you can watch the YouTube video here.

Next piece of content- "Introducing Dynamic PostgreSQL: How We Are Evolving the Database to Help You Sleep at Night". This is from timescale.com and pretty much this blog post can be summed up by this graph here and the statement "Buy the Base, Rent the Peak". So what they're doing is offering a dynamic computational load with a particular range. So you purchase a base amount of performance and it will never drop below that. So, so many CPUs, for example. But then if a load increases to a certain level, it will dynamically ingest and give you more compute resources up to a particular peak. And once the load is subside it will drop back down to your base level. They did this because they saw a lot of customers that purchased their services to handle the peak load, which is typically what you do with databases unless you have this kind of dynamic capability.

And this is something Timescale has done. Now this, to my understanding, is for their timescale cloud product. So I don't think it's inherent in their timescale open source solution. I think this is part of their timescale product. And what they state is that, quote "...customers running production workloads will save 10-20% using this solution when migrating from AWS RDS for PostgreSQL and 50- 70% when migrating from AWS Aurora Serverless". So they definitely have a lot of information covering this new capability. I encourage you to check out because this could offer a novel way to help you reduce costs based on your use case. So check out this blog post if you're interested.

Next piece of content- "USING MULTIPLE TRIGGERS ON THE SAME TABLE IN POSTGRESQL". This is from cybertec-postgresql.com and this is talking about trigger ordering on the same table. And the rule, he says right here is that "PostgreSQL first executes statement-level BEFORE triggers, then row-level BEFORE triggers, then row-level AFTER triggers, and finally statement-level AFTER triggers". And if you have multiple triggers of the same type, it just does it alphabetically. So he shows a very quick example here where he adds three different BEFORE INSERT triggers and two AFTER triggers and applies them in reverse alphabetical order.

So C, B, A and C, B. But when you actually insert it into the table, the triggers are fired first, the BEFOREs. So there are three of the BEFOREs, two of the AFTERs, and it's always in alphabetical order— A, B, C, B, C— not based upon how they were applied. Now he puts a note down here, quote "...PostgreSQL execution order diverges from the order decreed by the SQL standard", so that's different. So PostgreSQL does it alphabetically. But if you want to learn more, definitely check out this blog post.

Next piece of content- "SQL Scoping is Surprisingly Subtle and Semantic". This is from buttondown.email. So if you're looking for a brainstorming puzzle, I think this probably fits the bill. So he's proposing looking at a select statement such as this and saying, 'Okay, what should the result of this query be? And then what's this one and this one and this one?'. And he actually just has some text here that you have to wade through before you get to the answer. And then he goes more in-depth about it. So I won't spoil it here, but if you want to learn a little bit more about how Postgres handles some of these things, definitely check out this blog post.
Next piece of content- "Writing a storage engine for Postgres: an in-memory Table Access Method". This is from notes.eatonphil.com and this is a tour de force blog post. This is super long as you can tell, and he did a lot of work to actually create an in-memory table access method. So if you want to go through his process and try to understand how he did that, and maybe you want to play around and create your own, definitely encourage you to check out this blog post.

Now talking about my course that's coming up. First I wanted to say that I really want to make this course special, so it's not going to be like a book or necessarily another online video course you may have taken. Now, it will still be a video course, and the lessons will be similar to how they're done in the free mini-course I have here. But there's going to be a live component to it, so I plan to release modules weekly and have QAs along with them. I might even have some live lessons, I haven't determined that yet. But don't worry, anything that's live will definitely be recorded and a part of it. My plan is to create as real as possible, multi-terabyte database with billions of rows for testing and training purposes.

So I really want to see where the hang ups are and ways to make certain queries faster. Now, the planned course start date is the end of January, so probably January 29, 2023, which is about ten weeks away. However, I am looking for some early adopters to sign up for the course and provide feedback as I develop it. And if you want to be an early adopter, I will be offering a 50% off discount for the course and details about this offer will be available on Black Friday coming up, as well as Cyber Monday. For those of you outside of the US, that is 11/24 and 11/27. So about a couple of weeks away. But if you are interested in the course and you haven't signed up for my newsletter yet, please go ahead and do that because I'll probably be announcing some things there as well.

episode_image