Partitioning, Logical Replication Upgrade, Columnar Compression, HAProxy Connections

In this episode of Scaling Postgres, we discuss partitioning, logical replication upgrades, columnar compression and HAProxy connections.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. The first piece of content is" Webinar: PostgreSQL Partitioning [Follow Up]". This is from 2ndquadrant.com. This is a webinar about partitioning with PostgreSQL, particularly talking about a number of the new features that came with 12. Most of them I would say are probably related to performance, but also some others are related to supporting more foreign key capabilities and things for enhancing referential integrity. This was a great piece of content. Now it's a little long. The presentation or the webinar replay was over an hour and a half, I believe. There were some questions at the end, but it was filled with a lot of content.

It explains partitioning, how you can set it up, things that work well, and then some things that don't quite work yet. For example, being able to do a reindex. That command doesn't apply to the partitions. You can run them individually on partitions but not concurrently on the partition table. When you create an index that doesn't happen concurrently on the partition tables, you would have to do that manually. This presentation just has a lot of great information. I definitely encourage you to check it out. And you can just click the link here to view the complete recording. You do have to fill out a contact form, but then you're able to watch it immediately. So I definitely encourage you to check it out.

The next post is "UPGRADING POSTGRES MAJOR VERSIONS USING LOGICAL REPLICATION". This is from cybertec-postgresql.com, and they had a previous post talking about different ways to do version upgrades with PostgreSQL. This one uses logical replication and the first thing it discusses are the benefits, one being minimal downtime required. Basically, once you have a system set up and replicated, you could choose to switch over at any point that you want; it's flexible, so you can target just the data you want to replicate over. So essentially it's a logical replication, not a binary direct replication. It's safe in terms you can always go back to the prior data if you need to, and it also gives you a lot of opportunities to test and make sure that things are okay.

Now, one con is that it takes quite a few steps to set it up. It's always per database. This way of doing an upgrade as opposed to per cluster. It could take a long time for big databases that are highly active and large objects need to be exported and imported manually. So it goes through the process. Turning on logical replication on the old primary system, making sure all the tables have a primary key, or if not needing to set REPLICA IDENTITY. It has a little query so you can identify if there are any tables that match that criteria, set up the new database, and then basically you need to export the schema as well as the roles. So one way you can do it here is using a pg_dumpall with the globals only and a pg_dump for a specific database with schema only.

So you need to get that base over there. During the process of doing this type of upgrade, you kind of want to keep your schema, ideally in a static state. Then you create a publication on the old DB, create a subscription on the target database, and they say warning just make sure you're aware this creates a replication slot on the old DB. So again, if there's any connection that breaks down, this could start causing WAL files to grow if the replication stops for any reason. But the primary database keeps running. So just keep that in mind. Then check the replication process, and make sure there are no errors. The replication slot is active and all tables are actively replicating on the subscriber side. They mentioned here that pg_subscription_rel.substate should equal R for all tables.

Then go through your own verification process and switch over time. Now they have a particular case here where they're shutting down the old instance, starting the old DB in read-only mode. I don't think this is necessarily a requirement. It's definitely the safer way to do things to make sure that you have all the data because you could choose to start using the new database at whatever point you want, but you'll have to work out that process of how you do the actual switchover. Do you want to have downtime? How can you mitigate to make that as short as possible? But once you're done, you just drop the subscription on the new DB. So this blog post walks over the process of using a logical replication to upgrade your database using this method. So if you want to explore this possibility, definitely one to check out.

The next post is "Building columnar compression in a row-oriented database". This is from a timescale.com blog and basically, they've made an enhancement where they've added native compression to TimescaleDB, and their technique for doing it has resulted in a 90% to 96% compression, which is pretty dramatic. Now Postgres, of course, is a row-oriented relational database system. It doesn't really store data in a columnar method. So I was kind of interested in what they did here. So even if you're not interested in TimescaleDB, I particularly like looking at the different techniques that they mention here. So I'm going to scroll down to some of the more technical details. Basically, you typically have this kind of row-oriented data, a timestamp, a device, a status code, and then a temperature for this set of data. But what they actually do is they clump rows together and then store them in an array format. So this is storing the data by column.

So all the timestamps are converted into essentially like a one-row array and then all the devices are and the status codes are and it makes reference to probably the placement in the array to know kind of how that equates to the row. So this will always be in the first position of the array in this example. By doing it this way, it says that they can actually use specific compression for each data type to dramatically get high levels of compression. So for example, this is compressed using a particular method for timestamps or this for integers, or this for floats because these are all floats, and also just reducing the number of rows reduces the overhead per row of storing data. And they actually use some other techniques like setting ranges for mins and maxs and actually storing a lot of this data in a TOASTed compressed way. So for example, they talk about some of that here where you have the timestamp, temperature, and they have like the min and the max timestamp.

So if you're doing queries it can basically pull out this information much faster. And then they had some time performance benchmarks looking at this and the compression did very well and even a lot of cases performed better than uncompressed. So I found this blog post really interesting and fascinating. So we've covered other blog posts that talk about using ZFS file systems or other types of compressed file systems to reduce the size of the database as well as in some cases boost performance. But this is another very interesting method that's being used, actually converting the data into a more column-based format. So if you're interested in learning more about that, definitely a blog post to check out.

The next post is "PostgreSQL Application Connection Failover Using HAProxy with xinetd". This is from percona.com. This is a follow on from their simple application failover using libpq features. And they're talking about basically connecting through HAProxy that will then route you to the read-write primary or to read-only replicas. You can also use the load-balancing features of HAProxy. It uses it in conjunction with xinetd, which is an extended internet service daemon that can listen to requests on custom ports and respond to requests by executing custom logic. So in this case, they have this bash script that basically gets called to check the status of a database system. With this query, it checks the status of the currently running database and

says is it in recovery? If it's in recovery it returns a 206 status code, which means it's a standby. If it's a 200, it's classified as a primary, or if it's another value, then it returns a 503 service unavailable. This actually gets run to check the status of the database so it makes sure it can route it to the proper location. They set up HAProxy to use this xinetd. So if you're interested in using HAProxy as a means to manage connections between their primary and the replica and potentially do failover activities, definitely a blog post to check out.

The next post is "VIRTUAL, COMPUTED COLUMNS IN POSTGRESQL 12". This is from rob.conery.io. So again, this is talking about the new feature-generated columns and this is another developer's take on it. Here, he's primarily interested in using it for doing text search through JSON. So he's actually taking JSON and then extracting, say, the title and wanting to put it into a generated column field. So extracting the title from the JSON blob that he received and putting it in this title and it goes a bit further and converts it to a tsvector. It mentions some things you need to be aware of where you need to specify like English or another language because you need to be cautious when using functions that the results are immutable. If this language can change, this function is not necessarily immutable, it can be changed. Even using a function here such as concat doesn't really work when using generated_always, you actually have to use, I'll call it the concat or the append symbol to bring text together if you want to search. In his case, you want to search both the name and the title. So, a relatively simple post talking about generated columns for a text search use case.

Now, related to that, he has a second post called "FINE TUNING FULL TEXT SEARCH WITH POSTGRESQL 12". This goes into more in-depth. He continues to use generated columns but he's actually going into more depth of doing a search. He has a scenario here where he specifies three different individuals or wants to do different types of searches for things like a conference that's going on. So he goes about talking about parsing out the data and how to use different dictionaries to get different search results. He goes into applying waits for keywords as well as also going into rankings. But this is a very comprehensive post that covers a lot of detail in terms of getting to learn more about full-text search. What's also interesting is the discussion in the comments. A few individuals are talking about the benefits of storing the data using generated columns or maybe doing it dynamically with expression indexes in PostgreSQL. So if you're going to check out this post, definitely be sure to check out the comment section as well.

The next post is "Monitoring PostgreSQL Databases Using PMM". This is from percona.com and PMM is a Percona monitoring and management open source tool set. So it comes as a docker image and then they go through the process of installing. It looks like it uses Grafana and perhaps Prometheus to be able to monitor your PostgreSQL instance and it goes through the process of installing, and setting it up. Different permissions are required. So if you are looking for another monitoring tool for PostgreSQL, definitely another one to check out.

The next post is "Benchmark Partition Table - 1". This is from mbeena-learnings.blogspot.com and basically, with all the different partition features, they've actually altered pgbench and I believe this is for PostgreSQL version 13. So this is a patch where they're adding partition capabilities where you can dynamically add the methods and different ones to that pgbench benchmarking tool. This goes over doing some tests and seeing some of the results. So if you're interested in using pgbench in the upcoming version 13, definitely a blog post to check out.

The next post is "Postgres CASE Statement Basics By Example". This is fromblog.panoply.io and it's a very simple post that covers the CASE statement. It is basically a conditional statement within PostgreSQL and the SQL standard in general. But if you want a refresher on that, definitely a blog post to check out.

The last post is "How to query with PostgreSQL wildcards like a Pro. This is from medium.com. Again, this is another simple post talking about different ways that you can query using the Wildcard operator. The % symbol as well as the underscore. The percent is any character and this is usually one character as well as covering things such as LIKE, ILIKE, and using a lower function to be able to search the database ignoring CASE. So if you're interested in that type of content, definitely a blog post to check out.

Partitioning, Logical Replication Upgrade, Columnar Compression, HAProxy Connections | Scaling Postgres 88

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript