background

Postgres 16 RC1, Bi-Directional Replication, All About Parameters, Foreign Keys & Polymorphism | Scaling Postgres 280

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the release of Postgres 16 RC1, implementing bi-directional replication, different ways you can set up Postgres parameters and how to handle polymorphism and foreign keys.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Well, I'm starting to plan out how to potentially do interviews, but I would like to get your feedback on who you would like me to interview. Would you be interested in talking with developers who use Postgres on a regular basis? Would you be interested in talking with people who actually commit to Postgres? Or representatives from a company to learn more about a technology? Go ahead and let me know in the comments or if you're a part of the email list, you can simply reply to my weekly email. Let me know who you think would be good to interview.

But for this week the first piece of content is "PostgreSQL 16 RC1 Released!". This is from postgresql.org. And there's only one change from Beta 3 where they changed a performance regression in COPY. So it pretty much should be unchanged from Postgres 16 Beta 3. And it looks like, if there are no issues, it could be made generally available as soon as September 14th. So that's only like two weeks away. That's great news! But if you're eager to get your hands on Postgres 16, go ahead and download the release candidate and give it a try.

Next piece of content- "Bi-directional replication using origin filtering in PostgreSQL". This is from postgresql.fastware.com. And this is something that was mentioned last week in Scaling Postgres by a cybertec-postgresql.com blog post. They were talking about their favorite things from Postgres 16 and they were talking about something leading to multi-master replication. And I haven't really heard that much about it but this post actually goes into it. So basically when you set up logical replication, you set up a publisher, the subscriber, and apart from setting up certain filters on the columns and the rows, all that data just gets transferred.

But there's a new feature where they're tracking origin, which is to say 'Did this data originate from the local system or did it come in from logical replication from another node?'. And now when you create a subscription, you can specify whether you want that origin to be ANY. So it just copies all the data, that's the current default. Or you can specify NONE, which means quote "...the subscription will request the publisher to only send changes that don't have an origin". So these are changes that originated on the publisher system and did not come from anywhere else. And they have some diagrams to explain that better and actually show you the syntax of how you could set this up. But where it gets interesting is you could set up multi-master replication doing this.So you have a Node 1 primary and a Node 2 primary, you set up a publication on it and you set up a subscriber to subscribe from the publication on Node 2.

And then on Node 2, you set up a publication and you set up a subscriber there to subscribe to the changes from Node 1. And you set the origin to NONE for both of these. So basically, the only changes that get transferred from Node 1 to Node 2 are those that originated on Node 1. And he has the syntax necessary to set up a structure like this. So you create publications on each of the nodes and then you set up the subscriptions, again specifying origin equals NONE. And then shows how you can insert data into one and it won't do an infinite loop replicating itself. And he shows how you can even add a third primary node. But because this is so new, there are definitely limitations. So the first is adding that a new primary node that has existing table data is not supported.

So my assumption is it has to do the full copy from whatever it's replicating from to collect all the data. But then the second limitation is that copy_data=true and origin=NONE when copied directly from the publisher, it's not possible to know the true origin of the data. And then that follows up with cautions and notes where he talks about, watching out for conflict issues, watching out for certain restrictions with regard to this and handling backups and restores. So it's definitely early days but it is very interesting that the Postgres core could support a full-featured multi-master setup. So these are like the first stages to get that set up. So if you're interested in that, definitely check out this blog post.

The next piece of content- "PostgreSQL Parameters: Scope and Priority Users Should Know". This is from percona.com. And they're talking about all the different parameter settings that you can make in Postgres and at the level that you can make it. He actually goes through 12 different levels which I definitely haven't used that many. But let's go ahead and see what they've listed here. So the first has compiled time parameter settings. So at the point at which you are compiling, you can make certain changes. I've never used that because I perceive them as more low-level controls. The second is data directory/initialization-specific parameter settings. So these are the things when you're creating your Postgres cluster that you can change. One they mentioned here is data_checksums.

You can also set the encoding and locale settings at the same time. So I definitely do this when I'm setting up clusters. Third is PostgreSQL parameter set by environmental variables. So when you're starting Postgres you can actually set environmental variables like PG data. I really haven't used this a lot if at all. Four is a configuration file. So this is the one, of course, I use the most, the postgresql.conf file. Although as they're talking about here, I generally like to include separate files and directories to do the configuration and override what's there. As well as the postgresql.auto.conf file which is set by the ALTER SYSTEM command. And you generally don't want to manually change that. Just use the ALTER SYSTEM command to make the changes to do that. Five is a command line argument to postmaster.

So this is on the command line where you can pass in some of these configuration variables. I haven't had the need to do this but they say that various utilities take advantage of this. For example quote "...Patroni passes some of the most critical parameters as a command line argument". So you can do it this way. Six are database level settings. So this is altering the settings of a specific database. Here they're altering the database and setting max_parallel_workers_per_gather to four. So you can have that set per database. And then they talk about context. So what that means is that some settings can't be set at the database level, the user level, or the session level. They can only be made for the whole cluster.

So an example of that is max_connections. That only makes sense with the scope of the full cluster or changing TCP/IP settings. Seven is user-level settings. So for a user, you can set specific parameters. Eight is a database user combination. So you could say this particular user when using this database, gets these particular configuration parameters. Nine is parameters by the client connection request. So when you make a connection, you can specify some parameters in the client that you're using. Ten is session-level settings. So in your console when you're connected via PSQL, you could just say set this certain configuration variable. Eleven is transaction-level settings. So within a transaction, you can set local particular configuration settings.

And twelve is object-level settings that you can actually do as a part of a function. So you can specify what configuration variables should be set for a particular function when it's run. There are a lot of different places to set these, but the main place to go to understand where something is set is definitely the pg_settings system view. You can see here where they're using pg_settings to identify many of the different configuration variables and it has a source telling you where the variable was set as well as if it was from a file and what line number of the file. It's very useful. But if you want to learn more, check out this blog post.

Next piece of content- "CONDITIONAL FOREIGN KEYS AND POLYMORPHISM IN SQL: 4 METHODS". This is from cybertec-postgresql.com. And they're talking about polymorphism, where you have, call it a parent class, of a vehicle and then from the vehicle, you have a child class of motor vehicle. But that's also a parent to a subclass, this is also a car. A bicycle has a parent vehicle but not a motor vehicle. And when you want to reference that in the code, so you have a customer here that has a contract, you can just specify the vehicle. All of these different types of objects are vehicles because that is the ultimate parent class. Now the problem with this is if you want to implement foreign keys. Because this relationship doesn't really work with foreign keys.

So as they show down here, you have a contract table and you have the vehicle ID, but what should it reference? Should it reference the bicycle table, or the car table, because these are the ones that actually contain data. They actually don't have a vehicle table. Well, this blog post gives you some solutions on how to deal with this because you cannot directly write a conditional foreign key. This means that if there was a vehicle type and the type was a car, then this references the car table. Or if it's a type of bicycle then reference the bicycle table that doesn't really exist. So how do you handle it?

So the first method they propose is modeling conditional foreign keys with multiple foreign keys. It's basically your contract table that would have an entry for bicycle ID for the bicycle table and car ID for the car table. Of course, if you have many more tables, you're going to have to have many more columns to deal with that. And you'll have to set constraints to make sure that at least one is not NULL, which they're doing here. Second method of dealing with it is storing all reference data in a single table. So basically, you just have a vehicle table and it has all the different columns that you need.

Now if you still want to reference the car table and the bicycle table, you could do it through views that they're showing here. And of course on your contract table, basically, it just references the vehicle ID and a common vehicle table that's up here. Method three is still using a single table but for all of those additional columns that are specific to cars or bicycles, you would put in a JSONB field. So that gives you fewer columns, but it does make the SQL not as clean, have individual columns, and also it's harder to maintain data constraints for each column. The fourth method they propose is pointing the foreign key in the other direction and "faking it". So actually setting a reference to contract from the car table or the bicycle table.

But as they say here, that's not really right. And then lastly, they do mention that PostgreSQL does have a table inheritance. And you could try implementing this but it doesn't really work. But of course, an option they didn't mention here is just not to use foreign keys when having a polymorphic relationship like this. Or if you really need it, maybe you fake it with triggers or constraints to maintain that referential integrity. But for example, Ruby on Rails has the ability to support polymorphic relationships and it maintains an ID and a type column for your models or classes that are set up that way. But of course, when you are using that feature, I'm not using foreign key relationships with it. But if you want to learn more, definitely check out this blog post.

Next piece of content- "Open Source Bait and Switch: Licensing and Beyond". This is from momjian.us. And he's talking about an issue where companies start off building tools that are fully open source. And then as the company has success and the tools that they built have success, eventually they start changing their licensing model to a less open-source one. This post is actually a response to something written by Peter at percona.com where he's talking about "Open Source Bait and Switch: Licensing and Beyond". He's talking about a company called Akka and he had a lot of issues while transitioning from an Apache 2.0 license to a BSL license.

Whereas basically the source code is available but there are a lot of restrictions on how you can use it. And he definitely talks a lot about this issue. Now related to this, Peter Eisentraut released a blog post called "All kinds of licenses" talking about all the different projects and different licenses that are available. And he references the change that HashiCorp recently made where they're moving from a more open-source license to the BSL as well. So this is definitely a resource to look at some of these. And then this article from thenewstack.io talks about HashiCorp's licensing change as well. Linked to this is a discussion where members of the community are actually branching Terraform, which I think is their primary tool or product, to make it independent of the company.

So here's a link to that- opentf.org, where they're going to be forking Terraform and maintaining its open-source nature, as opposed to following HashiCorp, which has changed to a business source license. Now you might be wondering what all this has to do with Postgres. And the reason I find this interesting is because Postgres says it is open source at this time. It's a community that maintains it. But I think we always have to be aware of these different licensing changes that happen particularly at companies. We need to be aware that this could happen as well if those companies decide to change their licensing paradigm. But I thought this was an interesting development and wanted to share it with you.

Next piece of content- "Scaling Vector Data with Postgres". This is from crunchydata.com. And he has a number of recommendations on how to start working with larger datasets of vectors or AI work. His first recommendation is a physical separation of the data. So his recommendation is if you have an. Application databases actually create a separate database to store all of your vector data. Because you are unsure how much you'll be using or what the load is going to be. It's going to be a very dynamic changing environment so go ahead and keep that in a separate database. He says if you really must access it from your application, you can always use a foreign data wrapper or just a separate connection to access that data.

Second is definitely considered performance before you scale up your hardware. So take the time to tune queries and to get indexes right. Pgvector, the extension, just came out with an entirely new index type in different ways that you can set up. Thirdly he talks about categorizing and separating data appropriately. They even have an example through their Postgres AI tutorial that you can walk through. It talks about best practices in terms of separating and catheter categorizing this data. The next area he covers is caching because once you have that particular result, you can go ahead and cache it and just redeliver as opposed to having to cache it anew each time. There are two different ways to do caching.

One is pre-caching which is caching at storage time and post-query caching which has caching after running a query for a customer. And he talks about different ways to handle that. The last recommendation is dimensional reduction. So for example, OpenAI has vector embeddings with 1500 dimensions. But if you're able to reduce that dimensionality, you can actually get much better performance. And he goes through the process to do that. So if you're using AI and you're starting to scale up your use of vectors, definitely encourage you to check out this blog post.

Related to that, the next piece of content is "PGVECTOR 0.5.0 FEATURE HIGHLIGHTS AND HOWTOS". This is from jkatz05.com. And he's talking about the pgvector 0.5.0 release and he primarily covers the new index type HNSW, which is Hierarchical Navigable Small Worlds. So we talked about this in a previous episode of Scaling Postgres, but one thing I'll emphasize is that when you're talking about the type of embedding data you get and in that post, generally ivfflat indexes were faster but had a lower recall. But when you're talking about open AI-style embedding data that you would get back, the new HNSW index actually has been much better performance than the ivfflat and seemingly higher recall.

Although it does take a while to build it, he said. So definitely, I think that's the main feature of this release. He talks about the index-building options that are available as well as how to use the new index and pgvector. The other thing you mentioned is that it also has improved performance on distance functions. So these types of calculations are like Euclidean distance or cosine distance. The Euclidean had a rather dramatic increase in performance, maybe one-third faster. Whereas the cosine was a little bit negligible, I mean it was an improvement but not by much. But also someone said they saw a greater performance on ARM64 systems. And the last big feature here he mentions is the parallelization of ivfflat index builds. So he shows you how to configure and do that. So definitely check out this blog post if you want to learn more.

Next piece of content- "Data Redundancy With the PostgreSQL Citus Extension''. This is from percona.com. And Citus is known for scaling out Postgres. So you could shard your data and place sharded tables in specific nodes to scale out your database and give it better performance. But he says you could also turn it around and give greater redundancy by doing this as well. So he says, quote "... one can tune a hardware RAID for either performance or redundancy. The same can be said for Citus data sharding".

So here's an example where he set up four Citus nodes and maintained two copies of each sharded table across the nodes. So if you lost any one node, you would always still have your data. So you can think of it like a RAID 10 four-disc array. This is essentially your entire database system becoming redundant. And he actually goes through code to set this up in LXD containers as a proof of concept. So I find it super interesting that you can do that. And if you're interested in it, definitely check out this blog post.

The next piece of content, there was another episode of Postgres FM last week, this one was on "Decoupled storage and compute". So they are talking about products like RDS Aurora, Google Cloud AlloyDB, and Neon. They're separating the storage layer from the compute layer and whether this is something that Postgres should develop the capability for. And I actually always thought this would be cool to do. For example, I have some clients that have multi-terabyte databases and the issue when you want to scale up and down is that you generally have to duplicate your storage.

So every new replica you want to add, you have to add that whole storage solution. It would be great to have one storage and then be able to increase or decrease the amount of compute nodes needed at a given point in time. Now your storage system would also have to dynamically be able to handle increased in decreased load with regard to that. But I think that would be an interesting capability to have and clearly other products are building this capability like Aurora, AlloyDB Neon, etc. But if you want to learn more about that, you can listen to the episode here or watch the YouTube video down here.

Last piece of content- "Top 10 Postgres Management Tasks". This is from crunchydata.com. The first one is to add a statement timeout. So they give an example of you altering your database and you set the statement timeout to 60 seconds so any statement above that will time out. This protects your database from long-running queries. So it's a great thing to have in place. The second is to confirm you have enough memory and you do that by checking your cache hit ratio. If it's not 98-99%, you should generally consider moving up to a bigger instance size or increasing the memory. But they do note here that for warehouse analytical workloads, you're probably going to have a lower cache hit ratio just because you have so much data for those types of workloads.

Number three check your shared buffers. The recommendation they have here is to have it between 15 and 25% of your machine's total RAM. The fourth is to use SSL/TLS for data in transit. Five is sending up backups. Six is to stay on top of Postgres releases and upgrade frequently. Seven is implementing pg_stat_statements to track your queries. Eight is adding indexes and then nine is checking for unused indexes. Then ten is to review your connection settings. So how many connections are running to the database and are you running out of them? But check out this blog post if you want to learn more details.

episode_image