Sharding, Federation, Linux HugePages, Performance Tuning

In this episode of Scaling Postgres, we discuss one method to shard a database, federating your database, the benefits of configuring linux HugePages and performance tuning.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Herding elephants: Lessons learned from sharding Postgres at Notion". This is from notion.so. They're talking about sharding their Postgres monoliths. So they did have a monolithic database, one single database running their application, and they decided to shard it into multiple databases. More specifically, 32 physical separate databases. So they talk about the process they went through and their experience with it. They store information from customers and they use a concept of blocks, files, and spaces to store them. So the first decision was how they wanted to shard it. They looked at a few different technologies such as Citus for Postgres, which is the well-known extension for scaling out Postgres across multiple instances. They also looked at Vitess for MySQL, but they weren't necessarily happy with the clustering logic.

They said they wanted more control over the distribution of their data. So they actually decided to do an application-level sharding. So basically, the application needs to know what shard to go to to get the right data to deliver it to the customer. Now, they basically decided to shard it at the block level and everything that is connected to it. So it looks like spaces or workspaces are associated with blocks, and discussions are comments so all of those tables moved into this sharded database solution. They decided to partition it by their workspace ID. I'm assuming this is similar to either an account ID or a customer ID or maybe one level down more granular like maybe a customer can have more than one workspace ID. But that's basically how they decided to partition it and they wanted to plan ahead to have sufficient capability to scale out in the future. What they decided on, as I mentioned, is 32 physical databases at this point, but they actually created 480 shards.

Now, the reason they did this is because of how many factors get to 480. Like all of these different numbers are a factor of 480. So it gives them a lot of choices on how to split out their sharding into however many databases. Here's an example of where they went from this monolithic structure to 32 physical databases. Each physical database has 15 logical shards and it looks like they sharded it based on the schema. So there are essentially 15 schemas in each of the physical databases. Now, one of the main reasons they said that they started exploring sharding is because they were having issues with auto vacuum keeping up as well as having a risk of hitting the TXID wraparound. But they don't really mention it up here about using partitioning because partitioning is a way to break up your tables so that they can be vacuumed faster and ideally avoid the TXID wraparound.

Now, that's not to say they would eventually have to shard anyway, but I wonder if they actually considered partitioning first and then, if that got to a point where it didn't work all that well, they fell back to sharding, because again, sharding is a lot of work to do. They did mention partitioning when they were deciding what they were doing for sharding and they decided not to do any partitioning at all. Basically, the application decides what exact schema and table to contact. But even before sharding, I wonder if the partition solution was explored at all. I didn't see that mentioned here. And then this blog post goes into great detail about how they actually did the migration. So at first doing double writes to both systems, doing a backfill of the sharded system, doing a verification process, including interesting things like having separate staff members do the verification versus who wrote the migration or the backfill to ensure there wasn't just copying code over for the verification.

In other words, it was an independent verification. Then, of course, planning the switchover process. They talked about the way that they kind of synced up the data between them and they were considering writing paths to both databases, but they were a little bit concerned about inconsistencies coming up with that. They considered logical replication, but ultimately it looks like they decided against that because they were having problems with the initial snapshot step. But interestingly, Postgres 14, that's one of its advantages to really speed some of this up. Now that's not to say it wouldn't have worked in other ways, but that's an interesting thing to think about. Then the third way that they ultimately went with was an audit log with a catch-up script to basically capture everything in a log and then send it to the Sharded database to catch it up. But this is a great post that goes into a lot of detail about how they decided to go with application level Sharding versus some other solution for their database growth. So I definitely encourage you to check this out.

The next piece of content- "Multi-Cluster Postgres Federation on Kubernetes with Postgres14 Using GitOps". This is from blog.crunchydata.com and this looks to be basically a refresh of a post that was previously linked here called "Active-active federated Postgres cluster on Kubernetes". So it essentially uses the newer version of the Postgres Kubernetes operator that they developed along with using some Postgres 14 specifics. But essentially the post is the same where you have three regions: an east region, a central region, and a west region. Each database writes locally and then sets up logical replication to copy those changes to each of the other databases in the cluster, essentially. So something would be written to the central cluster locally, but then logical replication would send it over to the east and west and these are partitioned so there's no chance of a conflict between things.

So for example, this is part of the setup for the Postgres operator. But here they get into creating the structure and essentially you have one hippo's parent table, but then you have partitions for each area; east, central, west, and default. Basically, data gets UPDATED or INSERTED and it gets placed into the default table. But then a function runs as a trigger that then moves it essentially into the appropriate partition. So this is a pretty interesting technique and you don't need to use Kubernetes for this and you don't need to use Postgres 14. Basically, you need to use a version that supports logical replication and you can implement this. So it is a pretty interesting technique. If you want to learn more about doing it with Kubernetes, you can definitely check out this blog post.

The next piece of content- "Why Linux HugePages are Super Important for Database Servers: A Case with PostgreSQL". This is from percona.com. They're targeting the solution for these Linux huge pages as a way to avoid out-of-memory errors in Postgres. Now of course there are a lot of other benefits in terms of using less memory, which may have advantages for caching and things of that nature, but they go through and demonstrate the problem here. So they have a particular size system with 192 memory and 40 CPU cores. They show the configuration here and they start running pgbench. You can look at the memory as their pgbench, or I should say sysbench is running.

You can see that the memory goes from 71 GB available down to 1.5 GB and then it actually starts using swap space at that point and the page table size has grown to 25GB from an initial size of 45 megabytes. So that's pretty huge. Now this also makes some of these processes pretty large, which makes them a target for the out-of-memory killer if it's going to kill something. But they're using HugePages as a potential solution to this. So first they find out what the VmPeak for the Postgres process is to be able to configure HugePages to an appropriate size and then they go ahead and set it up. They go ahead and restart the database and they can tell from stopping and starting the database there's

not too much of a difference between the memory. But then once he starts running this for many hours, you can see it starts at 38GB and then it ends still at around 38-39GB. So it's not using all of that memory. He says you can see that the page table size remains the same. So it's around 62 megabytes as opposed to getting up to 25GB without using Linux HugePages. So this is a pretty impressive improvement in memory savings that can be used for other things as well as hopefully reducing the probability that the outer memory killer would be used in Linux. So definitely a piece of content. I highly suggest you check it out.

Next piece of content- "Webinar Recap: Tuning Tips to Maximize Postgres Performance". So this is from enterprisedb.com and there is a linked YouTube video here that you can take a look at. It's about an hour in length, but it goes through all sorts of performance improvements or things you should look at when tuning Postgres. It talks about hardware, talks about VM settings, talks about the operating system configuration, talks about a number of different Postgres parameters you can set, and finally ends up with query optimization. So this was a great piece of content to keep track of when you need to optimize your Postgres installation.

The next piece of content- "Tune sorting operations in PostgreSQL with work_mem". This is from aws.amazon.com. They're talking about configuring work_mem and how to find an appropriate setting for it so that it's not using so much disk. So more work_mem allows memory to be used for more sorting operations. So if you have work_mem set too low, he shows some examples where you're producing more temporary files. But the thing that I tend to look at in an EXPLAIN ANALYZE plan is the fact that you're getting an external merge on the disk. So this can be an indication that your work_mem is too low. It may also mean that you need an index. He does cover this here too.

There could be query optimization or index optimizations that would be more ideal than altering work_mem. You also need to be careful of setting work_mem too high. For example, he set it here to 1.4 GB. So you definitely wouldn't want to set that on the database configuration itself. He set it actually in his individual session that he was connected to do the query. So if you do need to increase work_mem for certain queries, this could be a good solution to set it per session. But generally, you want to avoid setting it too high and use all the memory of your system in Postgres for typical connections that don't need a lot of performance. He does cover some of this in the post below. So if you want to learn more details about setting work_mem for your system, definitely check out this blog post.

Next piece of content- "SHOW TABLES IN POSTGRESQL: WHAT'S WRONG WITH IT?" This is from cybertec-postgresql.com. They're talking about SHOW TABLES, which is a command you can use in MySQL to look at all the different tables in a database. Well, Postgres doesn't have this because it's not an ANSI or SQL standard way of doing it. Postgres has a couple of different ways that you can do it though. The main way that I use, that he mentions first here, is being able to use the psql client because there you can use the slash commands to get a list of different tables.

So you can use \dt to get a list of the different tables, or you can look at them from a particular schema or for a particular table or even get extended details about them. You can also always go to the system tables by doing a SQL query to either the pg_class_table or even the pg_tables. Then lastly, there's also the ANSI standard way there's an Information schema that you can query. And looking at the information_schema.tables view or tables, you can get information about all the tables of the system. So these are some different ways that you can query this schema of your database using Postgres

Next piece of content- "USING QUERY ID IN POSTGRES 14". This is from rustprooflabs.com, and they're talking about a new enhancement to Postgres 14 where they have made the query ID more visible and in more places of the system. Essentially, you can think of this as a UUID or Universally Unique ID for queries. So you can identify queries across pg_stat_statements or across pg_stat_activity to trace the behavior and activity of certain queries. So if you want to learn more about how you could potentially do that and even log it, you can check out this blog post.

The next piece of content- "Backup Label in PostgreSQL". This is from highgo.ca. This explains the Backup Label file and how it works in PostgreSQL specifically for doing usually exclusive backups. So you would run a pg_start_backup and a pg_stop_backup and then backup the data files in between that. So it talks about the purpose of the backup label and how it's used. So if you want to learn more, you can check out this blog post.

The next post is a short one on "MULTIRANGE TYPES IN POSTGRESQL 14". This is from postgresonline.com. They explain how you can use multirange types in Postgres 14 and how you can use different functions like Array Aggregate or array_agg as well as unnest to be able to present different multirange types.

The next piece of content- "Installing PostgreSQL on OpenBSD''. This is from fluca1978.github.io. He's talking about installing Postgres on OpenBSD, so if you have an interest in doing that, you can check out this blog post.

The last piece of content, the PostgreSQL person of the week is Sebastiaan Mannem. So if you're interested in learning more about Sebastiaan and his contributions to Postgres, definitely check out this blog post.

Sharding, Federation, Linux HugePages, Performance Tuning | Scaling Postgres 187

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript