PostgreSQL 16 Beta 1, Rust Functions, Partitioning Memory Problems, Tags & Arrays | Scaling Postgres 267
Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the release of PostgreSQL 16 Beta 1, creating Rust functions with PL/Rust, memory problems related to partitioning and prepared statements, and modeling tags with arrays.
- PostgreSQL 16 Beta 1 Released!
- Postgres 16 Release Notes
- PostgreSQL 16: part 5 or CommitFest 2023-03
- More Memory, More Problems
- Postgres: on performance of prepared statements with partitioning
- A Postmortem on High Memory Usage with Prepared Statements
- Build high-performance functions in Rust on Amazon RDS for PostgreSQL
- Tags and Postgres Arrays, a Purrrfect Combination
- PGSQL Phriday #009 Invitation: Making Changes
- 5 Ways to Get Table Creation Information in Postgres
- Huge Pages and PostgreSQL
- Tuning huge pages in Postgres
- How to Secure PostgreSQL: Security Hardening Best Practices & Tips
- ChatGPT plugins now support Postgres & Supabase
- Using v4 of the Postgres H3 extension
- Florin Irion
- Monetizing Your Content Part 2 With Seb Wilgosz | Rubber Duck Dev Show 86
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 16 Beta 1 Released!". This is from postgresql.org. They show a number of the feature highlights down here. There are a lot of areas with regard to increased parallelism for certain queries that it executes and aggregate functions. And it also mentions "PostgreSQL 16 can also improve the performance of concurrent bulk loading of data using COPY up to 300%".
I tried to remember an article that covered this and there's no reference that I see. I looked in the release notes and I don't see anything mentioned, so if you know what they're talking about here, maybe comment or hit reply if you're on the Scaling Postgres mailing list. I'd be interested to know what they're talking about because I have a particular use case and I would be highly interested in knowing how to get improved copy performance.
They talked about the replication enhancements in terms of being able to sync from a standby, which you couldn't do previously, as well as other performance improvements. They mention enhancements to the developer experience, different security features, monitoring and management, et cetera. So feel free to check out this post as well as the release notes. So this is the beta version of the release notes here that goes into more detail.
And the next piece of content related to this is postgrespro.com, which released a post talking about the CommitFest that happened in March of 2023. So some of these features may actually have ended up in Postgres 16. But check out these pieces of content if you're interested in that.
Next piece of content- "More Memory, More Problems. The single Postgres setting that almost took down our database". This is from medium.com, specifically from The Prefect Blog. But what this post is actually talking about is that using partitioning in prepared statements can result in memory problems. So this basically is a post-mortem of an incident that happened for this organization. And they were seeing high memory utilization and you could see as high as 70-80%, which is really high, and then trying to figure out what the cause is.
Now it wasn't your typical suspects, but what it ended up being is that they were implementing partitioning and the framework that they were using was using prepared statements. And that resulted in excessive memory usage due to caching of plans. So basically, Postgres runs a statement, I believe, five times, and on the sixth iteration, it creates a generic plan. The more partitions you have, it can result in poor performance.
But again this is a post-mortem, so I definitely encourage you to read the whole thing to understand all the different things they tried and what they suspected or not suspected. But what caused the memory droppage is they actually added telemetry to queries and it actually busted the cache of their prepared statements. So basically they were all custom plans essentially, and that's what caused this. So ultimately, the way that they resolved it is just by changing the plan cache mode to force a custom plan.
And with their partition tables, that helped with their memory usage issues. And to see this in some more detail, this is the post to look at from amitlan.com. And I believe we covered this in a previous episode of Scaling Postgres. And you can see here this wasn't a memory issue but a latency issue on how you increase the number of partitions with prepared statements and it kept going up. And you can see here the comparison between the generic plan, where latency keeps increasing. Whereas the custom plan was pretty flat. But he actually did a patch to correct this.
As you can see generally, the generic plan is more efficient, at least in terms of this latency benchmark than the custom plan but you can't see that because of the issue with the generic plans. But this hasn't made it into Postgres yet. Now I also encourage you to check out Lucas's perspective on the subject with his "Five minutes of Postgres" where he talks about a post-mortem on high memory usage with prepared statements.
And he said that patch actually didn't make it into 16, but he's hoping that it makes it into Postgres 17. So definitely something to keep in mind if you're using prepared statements and partitioned tables. A lot of my clients, because they use PGBouncer in transaction pooling mode, actually have already cut off prepared statements. So there shouldn't be as much of the risk of that. If you've already cut them off for your framework, hopefully this type of issue wouldn't affect you. But definitely check out these pieces of content if you're interested.
Next piece of content- "Tags and Postgres Arrays, a Purrfect Combination". This is from crunchydata.com. And they're talking about a scenario where you're wanting to tag something in your database. And in this scenario, they're talking about cats and you have a cats table and then you have a cat_tags table that links to a tags table. So this way you can uniquely see the tags in the system but yet linked to two different objects like, in this case, cats.
And this is typically how you would set up a relational model but this has a number of downsides. Number one- it gets pretty large in that this is almost half a gigabyte of data for 1.7 million cats and 4.7 relationships. And querying it, when you're looking for information on a single cat, it takes under a millisecond. But if you're wanting to look at the cats that have a tag or two, it goes from half a second into almost a full second when you're looking for cats that have just two or three tags.
And the queries get kind of cumbersome too if you look at this. So the next thought is could we use an array? In this case, he wanted to use an integer array, but still keep the tags table. So still, you have a unique reference of all the tags that are in the system. But then it's just an array on the actual table that tells you what tags a particular cat has. Now this is smaller, less than half the size, and of course, you do need to place a GIN index on this array.
And the queries are relatively performant when looking at a number of tags. For example, finding a cat with three tags is seven times faster than the previous version. But the SQL is still relatively complicated to use in my opinion. I mean it's probably fewer lines of code. But then they said well what about just using straight text in the array? So using a text array to list each tag. Now there was a little bit bigger than the integer version, but not by much, and finding out what tags a cat has, it couldn't be simpler than this query here.
You basically list out the tags that are in there. And then finding the count of cats that have three particular tags on it is super easy to write. So I really like how simple it is to write these queries. The size is about the same, much smaller than the relational, typical way you would do it. And the performance is equivalent to the other array. So it's basically much faster than doing it the relational way. So in my opinion, this looks like a win all around. And what he lists down here is that it's faster to query, smaller to store, and simpler to query, so all wins.
The main disadvantage from the text perspective is that you no longer have unique tags in the system. But you could probably easily construct this by maintaining a separate table that you don't have to join to every time. It just has a record of tags that have been added. Yes, it would be something additional to maintain but do you even really need that for some use cases? But definitely, I like this perspective on creating tags on tables.
Next piece of content-"pgsql phriday #009 invitation: making changes". This is from di.nmfay.com. And basically, since we're coming up at the end of the month, we're getting ready to start the next PGSQL Phriday. So the topic for this one is "database change management" or database migrations, or she says, "schema evolution". So how do you impart changes to your database to the schema structure? So that's what this blogging event is all about and if you're interested, go ahead and follow the instructions and make your post by Friday June 2nd.
Next piece of content- "5 Ways to Get Table Creation Information in Postgres". This is from crunchydata.com. And I would probably rename this post to "Five ways to try to get table creation information". Because there's no great way to do it and in each of these scenarios, there are downsides to all of them. But he's pretty thorough in trying to find ways to find this type of DDL information. The first one he says is Logging. So if you have your log statements set to at least DDL, you can at least see that a table was created.
Of course, the downside of this is- how often do you purge your logs? So if you're looking for a table that's existed for a while, do you still have the logs that record when it happened? But this is one good way to track changes to the scheme in the database. The second one is the Operating System File. So you can actually look in the operating system to see the dates on which things happened. But the downside of this is that, if a table is rewritten, it creates a whole new file.
So if you have done VACUUM FULL or a CLUSTER on it, or what if you have a replica that you set up and you failed over to that new replica, I'm sure those dates are no longer going to be accurate. But it goes into some more detail about that. The next one he talks about is System Catalogs. So this is basically looking at different tables and trying to draw conclusions based upon the OID as to when the table was created. Then maybe get a sense of when they were created based upon that. The next one is Parsing WAL Files.
So again, this is, if you've kept them around, these are things that frequently get archived and then ultimately destroyed. But if you have them, you can use the tool pg_waldump to be able to scan through those files. And then the last thing he mentioned is Backups. So if you've kept your backups around a while and you have a catalog to be able to scan through and look at the different files and when they were created, you could potentially do it that way. But if you want to get more details about that definitely check out this blog post.
Next piece of content- "HUGE PAGES AND POSTGRESQL". This is from cybertec-postgresql.com. And they're talking about setting up Linux huge pages on your system for Postgres. And this post goes into a lot of depth about huge pages, why they exist, and how to determine if they're set up on your system. So they have some queries to look out for here as well as how to set it up. Now they say that because of an enhancement to 15, it makes it super easy to do the calculations to determine how many you need. Basically, you just take the Postgres binary, pass it into the shared buffers you're going to use, the data directory, and then pass in C with the shared_memory_size_in_huge_pages and it outputs how many you need to set.
Then you need to set the huge pages parameter in postgresql.com. Normally it is on try, so you could choose to turn that on if you want to make sure they're definitely getting the huge pages. So once you enable your huge page settings on your operating system, you can check it using this proc/meminfo command here. And you can set them up to persist on a restart and after you've configured your PostgreSQL settings, now you can actually start the database. So this will require a database restart, if not a stop and a start, to make sure that this is enabled.
And they follow up by talking about being able to also configure min_dynamic_shared_memeory, which is available in 14. So if you want to enable that, it may require a few more huge pages that you need to configure. And lastly, they show what can you do if you're earlier than version 15 to set this up. So definitely check out this blog post if you want to set that up, as well as the next "Five minutes of Postgres" by Lucas called "Tuning huge pages in Postgres". And this is on the pganalyze.com website. So he goes into more detail about this in his piece of content as well.
Next piece of content- "How to Secure PostgreSQL: Security Hardening Best Practices & Tips". This is from enterprisedb.com. And this is a pretty tour to force post of going through all the different levels of security you can set in Postgres. So he talks about physical access when you're actually connecting to it, the different ways to connect, either within the operating system or exterior TCP/IP socket. Talking about the firewalls that you would need, talking about encryption in transport, this would be setting up TLS typically, defining authentication, and setting up who can connect by a pg_hpa.com.
And determining where those connections can come from, what users can connect, and how they can authenticate themselves, either using passwords or other types of single sign-on systems. And then talking about roles and then what rights and privileges each role has, setting up password policies, as well as defining access control, so what objects do these roles have access to. And then following up, of course, with role-level security. So a very comprehensive blog post of everything related to securing your Postgres installation.
Next piece of content- "ChatGPT plugins now support Postgres & supabase". This is from supabase.com. And they've created a retrieval plugin for ChatGPT. So basically, this allows you to use a local Postgres instance as a private data source for ChatGPT to query as a part of it answering questions. So they go through the process of setting this up and getting it connected to ChatGPT. If you're interested in that you can check out this blog post.
Next piece of content- "USING V4 OF THE POSTGRES H3 EXTENSION". This is from blog.rustprooflabs.com. And he's talking about the H3 Postgres extension, which is something that Uber developed for doing a hex grid and PostGIS. Well, he discusses his experience with V4, where he says quote "unfortunately they changed every function name used in my original blog post". So basically, this is a review of it using V4. So he goes into that in this blog post so check this out if you're interested.
There was another episode of Postgres FM last week. This one was on "Corruption", where they discussed the various types of corruption, how they can come about, and what to do and not to do if you come across it. So you can listen to the podcast or check out the YouTube video down here.
Next piece of content. The PostgreSQL person of the week is Florin Irion. If you're interested in learning more about Florin and his contributions to Postgres, definitely check out this blog post.
And the last piece of content, we did have another episode of the Rubber Duck Dev Show this past Thursday afternoon. This one was on "Monetizing Your Content Part 2 With Seb Wilgosz". So if you produce content as a developer, we discuss how you could potentially monetize that content. If that's of interest to you, we welcome you to check out our show.