background

Tidy Vacuum, Dropping Roles, Merge Command, PgBouncer Tutorial | Scaling Postgres 209

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to optimize vacuum, how to drop roles, the new merge command in PG15 and a pgbouncer tutorial.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PG Phriday: Tidying up with VACUUM". This is from enterprisedb.com, and this is a very good post talking about how to configure vacuum to follow up with the post we had last week on Scaling Postgres. Now, this is a follow-up to their post about XID wraparound and how to deal with that. Well, this goes into the configuration of vacuum. The first section he talks about is "Releasing the Throttle". So basically, how do you get a vacuum to do more work? Because by default, it's configured to impact the system very minimally. So it can work on a variety of server or machine types. But these are the different parameters you can adjust to make it vacuum faster.

Now, the number one adjustment that I tend to do is varying the vacuum cost limit, the autovacuum_vacuum_cost_limit, because once you increase those up, then more work can be done per vacuum. And how vacuum works is essentially it has a set of tasks it has to do, and each of those tasks is assigned a cost. So here's the vacuum_cost_page_hit, page_miss, and page_dirty. And once those costs reach that limit, then it pauses for the delay, so it stops doing work for a time. Well, if you increase the limit, you can also adjust the cost to be lower. However, I intend to increase the limit and minimize the delay.

You can get vacuum to do more work in a given unit of time. They also say that older versions of Postgres had the delay set to 20 milliseconds. But with recent versions, as of Postgres 12, they moved the default to 2 milliseconds. So a much shorter delay, given the faster storage systems today. Then he goes into the section "Expanding the Workforce". So by default, you get three autovacuum workers to vacuum your different tables. And I usually leave it at that, maybe I push it to five, but not so much more than that. It also depends upon the customer's tables in their database, because if they have one huge table and then a bunch of smaller tables, then 3 may be sufficient.

But if you have a lot of partition tables and none of them are really large, then more autovacuum workers make more sense. But the thing to keep in mind is that adding more workers doesn't make things go faster because they use the global settings that you set up here for the limits and the cost delays. And then the next section they cover is "Factoring in Thresholds". So this triggers when vacuum should process a table. You can set the scale factors, and the vacuum thresholds to be able to determine how often tables in your database should be vacuumed and you can also set this per table. So those are settings that you can typically do to fine-tune when a particular table gets vacuumed.

The last section he covers is "Ahead of the Pack" and basically defines at what point freezing should be done. Now, by default, this is at the 200 million-row limit, but with a very active OLTP system, he suggests putting this up to say, a billion or at least something north of that 200 million. You definitely want to avoid the 2 billion limit, but you should be able to move it, give it a little bit more headroom and not have to freeze as often. So this was a great post about how to configure vacuum and if you're interested, definitely suggest you check it out.

Next piece of content- "HOW TO DROP ROLE OR DROP USER IN POSTGRESQL". This is from cybertec-postgresql.com and basically, roles own objects in the database. And when you try to drop a role, it won't drop those objects because there are objects that are dependent upon it. Now, with other parts of the system you can do a CASCADE, but why doesn't that work? He mentions here that because roles are cross-database in a Postgres data cluster, so therefore, you can't go into each database and delete all those objects or it doesn't support that. Because as he says here, "An SQL statement can only affect objects in the database to which you are connected".

So you can't really drop a role that then drops all the objects in multiple databases. Then he talks about how you can avoid trouble with DROP ROLE and what he advocates is setting up group roles. So these are separate roles that own all the objects in the databases, the schemas, the tables, whatever you're creating. These group roles own those objects and then the user roles are just members of those roles. So essentially the users of the system never really own anything, so they could easily be dropped if you need to. Although, he says there are two commands you can use to drop or assign roles more easily. One is DROP OWNED BY and this drops all the objects owned by a particular role. Now this works because it's per database, so you basically have to go into each database and run DROP OWNED BY a particular role to drop all the dependent objects of that role.

You could also do a REASSIGN so you could REASSIGN OWNED BY a particular role to reassign objects to say, a group role. And he talks about the different ways you can do that here. But he did say there are a few objects left behind by DROP OWNED BY. One is databases that are owned by the roll, but those can easily be addressed by transferring ownership or deleting the database if it's not necessary, as well as table spaces. But again, you should be able to reassign ownership or drop those separately. Then he said but what if I still get an error when I try to drop the role "Postgres"? I don't know when you would want to do this, but he says there is a way you can do it, I guess if you want to do it. But this was another great post and I highly suggest you check this one out.

Next piece of content- "Waiting for PostgreSQL 15- Add support for MERGE SQL command". This is from depesz.com. So this is a SQL standard command that's being added to Postgres called MERGE. So this is huge. You can think of it like INSERT ON CONFLICT DO something like do an update or do nothing on steroids. So basically, it enables you to do a statement and have a bunch of conditions. It could either do a DELETE, UDATE, INSERT or even do nothing. So they have an example here. When something matches and this condition matches, do an update.

When something matches, then do a delete. When something is not matched with an additional condition, do an insert and when it's not matched, then do nothing. So this is kind of like an else. So it allows you to define a modification that should be done to the database and define conditions under which it will be inserted, deleted or updated. And he goes through additional examples here. So this is huge and even he says huge thanks to all involved, definitely. So this looks like it's going to be getting into Postgres 15. Now they have separate docs on the MERGE command here that goes into more detail about it and includes some additional examples. So if you're interested in that, definitely check out these pieces of content.

The next piece of content is a YouTube video and I was actually directed to this by one of the listeners of Scaling Postgres, recommending me to check out this YouTube channel. And this is the High-performance Programming YouTube channel. And on about a weekly basis they've been posting videos about Postgres, at least more recently. This video they released this week was a "PgBouncer Tutorial" and they showed graphically how it works, as well as go through the commands to set up PgBouncer. Then they even did a scaling test where they showed much higher performance using PgBouncer and avoiding the overhead of creating connections and dropping connections by talking to Postgres directly. So if you're interested in that, you can definitely check out this piece of content.

So the next set of posts are all from depesz.com and they're talking about new features coming in PostgreSQL 15, specifically around JSON and they're adding a bunch of features to Postgres that support the SQL/JSON standards. "Waiting for PostgreSQL 15- SQL/JSON constructors". So this allows you to construct JSON, JSON_ARRAY, JSON_ARRAYAGG, JSON_OBJECT, and JSON_OBJECTAGG. Now a lot of the functions to do some of these things already exist, but these are the SQL standard functions that have been added and he goes through the process of doing that.

The next post is " IS JSON predicate" and this allows you to test if is JSON a value, is JSON an array, is JSON an object, is JSON a scalar, et cetera. So it lets you do these types of tests. And again, he goes through some examples in his post below. The third post is on "SQL/JSON query functions". So this lets you query JSON_EXISTS, JSON_QUERY, and JSON_VALUE. So you're able to evaluate these with these functions and he gives some further examples of how to use these functions. So if you're interested in these new SQL standard JSON capabilities coming to Postgres, definitely check out these blog posts.

The next piece of content- "Parallel commit in postgres fdw". This is from highgo.ca. This is another post about something coming in Postgres 15 which is the ability to do a parallel commit. So if you're doing a commit and you have multiple remote databases that you're connecting to via a Postgres Foreign Data Wrapper, there is a parallel commit function that enables you to commit to these in parallel. And it looks like it does give you a little bit of a speed boost because of that. So if you're interested in that, you can definitely check out this blog post.

The next piece of content- "How Postgres Chooses Which Index To Use For A Query". This is from pganalyze.com and this is an in-depth post that goes deeply into the internals and looks at a lot of source code to follow through the path of how Postgres chooses which index to use for a query. So if you want to know how Postgres does it at a detailed level based on the source code, this is definitely a blog post to check out.

The next piece of content- "Queries in PostgreSQL: 3. Sequential Scan". This is from postgrespro.com. This is another in-depth article about how sequential scans work within Postgres. So again, if you're looking for a more detailed blog post about how the internals of Postgres work, this is another excellent post to check out.

The next piece of content- "Ultimate Guide to Citus Con: An Event for Postgres''. This is from citusdata.com and this post is about the upcoming Citus Con which is April 12, 2022. That covers, of course, their Citus product, but a lot of Postgres content as well. So they cover the three keynotes and then multiple different talks about Citus as well as all of the different Postgres talks. So if you're interested in learning more about the content at this conference, you can definitely check out this blog post.

The next piece of content- "Identify PostgreSQL Performance Bottlenecks With pg_stat_statements". This is from timescale.com and they're talking about how this has been recently added to their Timescale Cloud account, but this is also applicable for Postgres as well. I always recommend clients enable pg_stat_statements and use it because it's the best way to find out what queries are using the most time in your database. They show you some of the commands that you can use to run or evaluate the different queries and the different rows that are returned, how to identify long-running queries, and how to examine your hit cash ratio for different queries.

But then, interestingly, they also discuss looking at queries with a high standard deviation. So looking at averages is one thing, but you might miss some queries that with some data are incredibly long. In other words, there's a wide deviation from what the average is compared to what the longest-running instance of that query is. And they have a technique you can use where you're dividing the standard deviation execution time by the mean execution time to give you a coefficient of variance. And this is a way for you to again identify those queries that aren't just in and around the mean, but have a wide variance. So they may also give you an example of some slower queries. But if you're interested in learning more, you can check out this blog post.

Next piece of content- "Installing Crunchy Postgres Operator v5 on EKS". This is from b-peng.blogspot.com, and they're talking about installing the Crunchy Data Postgres operator version five. This is for Kubernetes, and they're installing it in conjunction with Amazon's Elastic Kubernetes service, which is EKS. So if you're interested in the install process for that, you can definitely check out this blog post.

The next piece of content. The PostgreSQL person of the week is Claire Giordano. If you're interested in learning more about Claire and her contributions to Postgres, definitely check out this blog post.

The last piece of content, we did have another episode of The Rubber Duck Dev shows this past Wednesday. This one was on "Redundant Infrastructure on the Cheap". Basically, how can you put together redundant servers without necessarily relying on a platform as a service to deliver your solution inexpensively? So if you're interested in that type of content, we welcome you to check out our show.

episode_image