background

Postgres Releases, Postgres Distributed, Privilege Template, Real-Time Dashboards | Scaling Postgres 266

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss new Postgres releases, EDB Postgres Distributed, a privilege template and real-time dashboards.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continued to do well. Our first piece of content is "PostgreSQL 15.3, 14.8, 13.11, 12.15, and 11.20 Released!". This is from postgresql.org. And as you can tell, there are new versions of Postgres. This addresses over 80 bugs and two specific security issues. The first security issue is a "CREATE SCHEMA... schema_element defeats protective search_path changes". And the second is "Row security policies disregard user ID changes after inlining".

They definitely say the versions affected are from 11 and higher, but they are rather old and previous versions probably had these security issues as well. So yet another reason to upgrade to a supported version. And then they also mentioned here that PostgreSQL 11 is at EOL (end of life) at this point, so definitely if you're still on 11, you want to move up to a more recent version. But if you want to learn more about this and all the different fixes that were made in these releases, definitely check out this piece of content.

Next piece of content- "Defining Extreme High Availability in a Postgres World". This is from enterprisedb.com. And they're talking about something that just appeared on my radar today but that is EDB Postgres Distributed. Now, EDB, or Enterprise DB, has a lot of custom solutions built on top of, in conjunction with, Postgres and I definitely think more of them as enterprise-related solutions.

And because I tend to focus on just open source Postgres, I usually don't address all the different products that they have available. But this one caught my eye. Because of what it's doing and what it's offering and kind of a different way to think about handling high availability with Postgres. Now I've actually listed four different blog posts from Enterprise DB because there's references to previous articles that will be listed up here and in the show notes.

But I will say if anyone from EDB is watching, I haven't covered as much EDB content because of some of the recent changes to the website, it's hard for me to discern what is new content and what is old content. So that's just some user feedback with regard to that. But the purpose of this post is discussing high availability in general and an ideal solution. And they have here where you have a single Postgres listener, so it is decoupled from each instance. Basically, the application talks to the listener and the listener determines

Which instance is primary or the replicas and if there's any sort of failover, the Postgres listener just follows and knows which instance to write to, which one is the primary. And I really liked this concept and thinking about it this way. Because essentially, the Postgres cluster becomes an actual cluster as opposed to just talking, really, about an instance is how you normally use the Postgres cluster terminology.

And you have the secondary interface to talk to whichever database is available. So they're talking about this theoretically, but they've actually built Postgres Distributed to handle this. And they have something they call the "Postgres Distributed Proxy". And that's essentially what the application talks to. And they've actually built a lot more intelligence into it to be able to handle its own consensus store, or I believe it's using a raft protocol, to handle it. And basically, ETCD is no longer a requirement for building this type of solution.

And what I particularly found interesting that's discussed here is that they're not using physical replication to keep primaries and replicas in sync. It's all using logical replication. And that has a lot of advantages because physical replication or binary replication requires that the versions be very specific for each of the databases like maybe you could have the replica B version ahead or two of the primary, but they definitely can't be on different major versions.

And there are different things where you can lock a table that will impact replicas when doing streaming replication. They're very coupled as it were. Whereas logical replication is very loosely coupled. So you can have logical replicas that are entirely different database versions. You could have a version 13-1 and a version 15-1, and they're replicating between each other. So their distributed solution uses logical replication. Now I'm not quite sure how they've gotten over some of the performance issues that I've seen with logical replication.

Like I'm trying to do logical replication of a super large multi-terabyte table, and it is super slow, trying to get it in sync. So I don't know how they're necessarily dealing with that. So this concept of using logical replication is super interesting and the second post goes into some of the advantages of doing database maintenance operations. So for example, you can do an automatic failover/switchover rather easily. And their system is set up to do that using notifications, so there's no real concept of needing to promote. You can actually do REINDEX/VACUUM FULL operations on one of the nodes and then it will eventually catch up again because it's using logical replication.

You're able to patch systems independently and do major version upgrades in place of particular instances that are within the cluster. So this definitely opens up a lot of different options and actually, I've really wished an option like this was available in Open Source Postgres because I find this the ideal path moving forward. I mean there may be some downsides I'm not seeing, of course, but I highly encourage you to check out these four different blog posts from EDB to kind of see what they're working on with Postgres Distributed.

Next piece of content- "PostgreSQL Basics: A Template for Managing Database Privileges". This is from red_gate.com. And they're talking about normally when you create a user and that user creates objects, it is the owner of those objects. And you can get into a situation where all these different users of the database have all these object ownerships and it makes it quite difficult. What if that person leaves? How do you transfer object ownership? That just becomes a big issue to deal with. But there's a way to set it up to make your privileges more consistent. And they actually have a template that they're following to do this.

So the first thing they're going to do is set up specific group roles. So groups will own objects as opposed to general users. So they create a DDL group (ddl_grp), which is basically a data definition group for creating all of the data objects in the system. They create a data modification group (dml_grp), which clearly will be modifying inserts, updates, and deletes of the data. And then also a read-only group (read_only_grp), although if you're using a recent version of Postgres, you don't necessarily need this. You could use one of the built-in roles to do it.

But with those roles in place, you then grant users to those roles. As you can see, the dev_admin has rights to do DDL commands. The dev1 and dev2 can just do modifications, they can't create any new objects. And then you assign privileges to each database and they say you can do it database by database or you can actually modify a template1 database to do this. So the first thing they do is they revoke all in the database of your choice from the public.

So basically, you prevent anyone from connecting to it unless they explicitly granted the connect option. And then also revoke create on schema public from the public, but it prevents anyone from creating objects in the public schema. Then for the three groups that were created, the DDL, DML and the read-only group, you grant CONNECT and TEMPORARY. So basically, because users are part of these groups, they can now connect to this particular database.

And then here we get down to the individual grants of existing objects. So they grant USAGE and CREATE the DDL group for that particular schema and then they grant all privileges on TABLES and SEQUENCES to the DDL group as well with regard to the DML group. As well as the read-only group, they grant usage on the schema, not the create option, that's only on the DDL group. So the DML group gets usage of the schema as well as granting SELECT on all the tables in the sequences. And you're probably going to want to grant insert, update, and delete to the DML group as well. I didn't see that here.

And those are for existing objects for future objects. You do that by using the ALTER DEFAULT PRIVILEGES command. And they gave two options here, I personally liked the second option. And for the DDL group, because that's the one that's going to be creating all the objects in this particular schema, you say grant these rights, SELECT, INSET, UPDATE, DELETE to the DML group. Grant SELECT to the read-only group. And then grant the different sequence permissions as well. And with that, that's a basic template that gets your database set up in a secure fashion using groups.

And users aren't going to be owning those objects. The only thing to keep in mind when you're doing a database migration is you're always going to want to set the role to be the DDL group because that's the group that needs to be able to create these objects. But if you're interested in learning more check out this blog post.

A second post related to that is from cybertec-postgresql.com And that is "POSTGRESQL ALTER DEFAULT PRIVILEGES- PERMISSIONS EXPLAINED". So they go into more depth of explaining how Alter Default privileges work. Again, it's creating privileges for future objects. And they explain the difference between the different defaults and actually setting things like the FOR ROLE. So this is the role that is going to be creating the objects. But they explain it more in-depth here as well as the IN SCHEMA clause, what that means when it's present and when it's not. So if you want more clarity on that, you can definitely check out this blog post.

Next piece of content- "Builder Customer-Facing Real-Time Dashboards with Postgres". This is from crunchydata.com. And they are talking about exactly this. You want to track some sort of metrics analytics and have some sort of events pipeline. Collecting data and running reports off of it. Well, they show you how to do this with stock Postgres. Basically, they come up with a metric table here with the different columns that may be of interest to you. And they give a variant that shows different events you're maybe capturing instead. So a different type of schema.

And suggestions for what indexes to place on there as well. And then you may not want to do this at first but eventually, at some point, you may want to actually partition that metrics table and then they show an example of doing that. And they partitioned it by the occurred_at. So you'll have so many days or so many months, however you choose to partition it, tables that you can just delete when they're no longer needed. Oh, and there's one important thing that they mentioned is that when you're doing this type of analytics database, they actually suggest creating a separate database for this purpose, not using your existing transactional database to do so.

So they suggest creating a new one to do that. And then they give some metrics down here about the metrics pipeline that they created and how it's working for them. And that, for example, a single day of metrics, I believe for a specific customer, returns in 1.5 milliseconds querying 10 days worth of metrics for a cluster returns in 30 to 50 milliseconds. And then a full 30 days view of metrics is around a hundred milliseconds. So maybe that can give you some guidance on how this would work for you as well. So check out this blog post if you're interested.

Next piece of content- "Practical AI with Postgres". This is from crunchydata.com as well. And they've basically talked about how you can use ChatGPT to do Postgres development work. So for example, he gave an example where he asks ChatGPT: "Can you help me generate a schema for a multi-tenant CRM map in Postgres?" So that created a tenant's table. User table accounts, contacts, deals, and apparently, a few more tables as well. But they give each of the columns and the foreign keys, it didn't list indexes here. But now this is not perfect. Everyone will probably want to change a little something about it.

He makes some comments about what he would change here, but then he says the great thing about ChatGPT you can then ask it quote "Can you help me generate some sample records for the above schema?". And boom, here it does for you. And he said at first they only gave him 2 tenants but then he asked for 10 tenants in it and it gave him a lot more. Then they cover indexing and I'm assuming this wasn't a great response because it returned a lot of indexes. Again what I tend to do is primary keys and foreign keys, start there and then see where you need to go from there.

And then reporting, say you want to run reports, you can actually ask ChatGPT for suggestions on how to do that. And maybe even just asking questions like how do I round a date to a nearest date in Postgres? And it correctly identifies the date_trunc function. So these are just different ways of doing your Postgres development work where learning Postgres, you can use ChatGPT to help you along the way.

Next piece of content- "Limitations in Postgres Index Definition". This is from ongres.com. And they're talking about different ways that indexes can't be used. So the first way they're talking about is that if you're trying to do an expression index, so this is an index that is based upon a function, that function needs to be immutable. So if it's not immutable, you're not going to be able to create an index on it.

The second one they are talking about is system columns. So you can't really create indexes on system columns. These are the columns that are normally not shown but they help manage the row within Postgres. You can't create indexes on it. The third part is a system catalog. So you can't create your own indexes on system catalogs. So won't let you do that. And lastly creating indexes on foreign tables, you can't really create new indexes on foreign tables. But if you want to learn more check out this blog post.

Next piece of content- "Working with Time in Postgres". This is from crunchydata.com. And they're talking about different time functions and how you typically store time, how you can store the time zone as well, and convert between different time zones. And then they go into all sorts of different functions to return and modify time as well as talking about intervals and time ranges. So if you want to learn more about time in Postgres, check out this blog post.

Next piece of content- "Overview of ICU collation settings". This is from peter.eisentraut.org. And this is a very comprehensive post about all the different settings you can change when you use ICU collation. All the different ways you can customize it. So as you can see, there are a ton of settings for collation. So if you want to learn more about that definitely check out this blog post.

Next piece of content- "MapScaping Podcast: Rasters and PostGIS". This is from blog.cleverelephant.ca. This is a very quick post, but he's referencing a podcast "Rasters In A Database?", where they're discussing the issue of how much data for a graphical information system should you be storing in the database.

And basically a lot of times "put it in the database" is the right answer. Particularly he says if it's "replacing a pile of CSV files...". But if that data is a collection of GeoTIF images, having that file in the database will, as he says quote "be slower, will take up more space, and be very annoying to manage". So it's probably best to use pointers in the database to reference those files, but check this blog post if you're interested in that.

Next piece of content- "Experimental load balance setup based on shared storage using pgpool and neon serverless". This is from highgo.ca. And this is a post about doing exactly what it says, using PGpool in neon compute nodes where neon separates compute from storage. So basically, they're using shared storage to be able to have multiple instances of Postgres. So if you want to learn more about how to do that, definitely check out this blog post.

Next piece of content- "What's new in Citus 11.3 & Postgres for multi-tenant SaaS workloads''. This is from citusdata.com. They actually have three posts talking about the Citus 11.3 release. Again, this is the Microsoft project where they are doing scale-out PostgreSQL. So if this product is of interest to you you could check out this post. Their second post is on Django-multitenant, which is a tool to build scalable SaaS apps and the third is on Postgres and Citus or tenant monitoring with Citus and Postgres with citus_stat_ tenants. So these are all enhancements that have been added to the most recent version of Citus.

Next piece of content. There was another episode of Postgres FM last week. This one was on "ChatGPT x PostgreSQL". Much like the blog post we looked at earlier in the show, this one talks about whether you should use it. If so, what for and some things to be mindful of. So you can listen to the podcast here or look at the video here.

Next piece of content. The PostgreSQL person of the week is Ryan Booz. If you're interested in learning more about Ryan and his contributions to Postgres, definitely check out this piece of content.

And the last piece of content, we did have another episode of the Rubber Duck Dev Show this past Thursday afternoon. This one was on "Leveling Up For Juniors With CodeWithJulie". So we discussed when you're a junior developer and you're in the process of leveling up your skills, what are some things to be mindful of. So if you're interested in that type of content, we welcome you to check out our show.

episode_image