10-100 Times Higher Latency With Distributed Postgres

In this episode of Scaling Postgres, we discuss how distributed Postgres solutions can result in 10-100 times higher latency, building incremental backup, DBMS of the Year and understanding explain.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Did you know that some distributed Postgres solutions can increase latencies up to ten to a hundred times? We'll talk about that in today's episode, but I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "An Overview of Distributed PostgreSQL Architectures". This is from crunchydata.com. What I first noticed about this is it's from Marco Slot, and he mentioned that he is now working at Crunchy Data, whereas before he used to be at Citus Data which then was acquired by Microsoft. So that's a pretty interesting move. I also saw that Crunchy Data has a hosted solution called Crunchy Bridge, I believe, and now it supports Citus. So I wonder how much of a hand he had in that. But this particular post is exactly as it says, an overview of distributed Postgres architectures, and specifically he's looking at things such as network-attached block storage. An example is EBS on AWS. So as a post attaching a disk directly to an instance, it uses a network-attached storage solution.

He talks about read replicas, which are an out of box capability from Postgres. The next one he covers is DBMS-optimized cloud storage. So this is like Aurora or AlloyDB at Google, where there's more of a separation between the storage solution and the database engine itself. Talks about active-active. So multiple primaries, which is something that BDR does in terms of bi-directional replication. He talks about transparent sharding, which is something Citis does, and I think a few other products too, and then talks about distributed key-value stores that have SQL bolted onto them and he uses a Yugabyte as an example. Now what he first establishes here is that the fastest way to get Postgres is on a single machine because you are eliminating so much latency compared with these other distributed solutions. So as he says, quote "There's virtually no network latency on the database layer, and you can even co-locate your application server".

So if you have one server that's running the application and talking to the database, you can't really match the speed of that any other way. Particularly if you're directly attaching high-speed NVMe storage directly on that server, as opposed to having to go over the network. And he says here, quote "Millions of IOPS are available depending on the machine configuration". He also says "Disk latency is measured in microseconds" So that's definitely the fastest way to go. He doesn't say this here, but I would definitely say it's not the safest way to go because the reason that people choose these distributed architectures is for the protection of their data and higher availability of the data. And he does talk about that, of course. So that's being able to set up replication so you can copy data to another location or distribute it and partition it and protect it and make it available and not vulnerable to a single machine going down, and now you have no application anymore.

He specifically talks a lot about latency, because if you look at a lot of these distributed systems, the number one disadvantage of them is the higher latency for setting up a distributed Postgres system. And how latency really hurts, particularly on OLTP systems. But the first one he covers here is the network-attached storage. So this basically separates the database engine itself from the storage area. So you do have to communicate across a network to access your storage. Plus he says, quote "The block storage service internally replicates to 2-3 storage nodes". So this definitely gives you higher durability, as he says here, and higher uptime to avoid failures of a particular one. But it comes with pretty big cons with higher disk latency. For locally attached storage, maybe it's around 20 microseconds or so, but it can be 1,000 microseconds for something like EBS. And even the IOPS is significantly impacted. So you go from millions to tens of thousands.

Although of course, you can pay a lot more money to get maybe up to 100,000. I don't know what the current limit is, at least on AWS, but you can definitely spend a lot more money to solve this. I don't know how much more money impacts this. I know they have different tiers or different versions of their storage systems where they're looking to improve this, but it's just a reality of physics when you separate things in this way, there's a latency in its communication. The next area he covers is read replicas. So this is having a primary system and creating replicas of the data that maybe you want to query for read-only purposes. And of course, there's latency inherent in that. And if it gets too extreme, depending on how your application works, you can run into problems. And he gave an example of a shopping cart. So if a customer purchases something and then goes to look at the shopping cart, if the purchase is sent to the primary database, but you're using a replica to determine what's in the shopping cart.

If there's significant latency with the replica being updated, someone looks at their cart and they don't see the item that they just placed in it, that could be a problem. So as he says here, that's pretty much eventual consistency. So eventually you can read-your-writes. The next error he mentions is DBMS-optimized cloud storage. Again, this is things like Aurora and AlloyDB where they've created basically a whole new storage solution to handle database changes. How he explains it as opposed to a standard installation where you're writing WAL to the storage system, the write-ahead log, but you're also writing to the data pages in the database files. So when changes are made within the database system in memory, it runs a checkpoint to save those changes to the disk. Whereas he says these DBMS-optimized solutions actually just write the WAL and pass the WAL around to different storage areas for these solutions.

And sending this WAL all around definitely increases the amount of writes that happen, as does the latency as well, because frequently you're copying this over availability zones in AWS's case. So he mentioned some of the pros of this as well as some of the cons. So for example, write latency is high by default, and of course, you have the high cost and pricing of all that write activity essentially. Now, one of the biggest pros I see with this particular solution that I've seen some clients run into with just using standard replicas with these solutions is that you can reuse the storage. So there's one storage system essentially, and you can put as many compute nodes on top of it as you want, whereas with the typical Postgres read replica solution, every replica has its own copy of the data, whereas they're using a common storage pool with these solutions, to my understanding.

The next area he covers is active-active. So this is where you have multiple primaries, each receding reads and writes. In that, you're going to inherently have latency because something's updated here and updated there and it takes a while for it to be put in sync. Then you have collision problems too, because what if they choose the same value to update by one in his example on both servers at the same time? Well, what should the value be? How are those conflicts managed? You need to be very careful with how you program your application if you're going to be using an active-active solution. And this solution, again, is basically eventual consistency. Eventually, you'll be able to read the writes that have been done, but there's no monotonic read consistency and a possibility that updates might conflict after a commit, which is not really possible in the other solutions that have been mentioned. The next one he mentioned is transparent sharding. So this is the Citus solution, basically distributing your data across multiple tables. Now, of course, this comes with a lot of advantages, but again, the major disadvantage is you have high read and write latency because you're sending queries to a particular node.

It must send them to other nodes that do the computation and then send it back over the network to the node that requested it to present the answer to the application. Then he mentions here data model decisions have a high impact on performance as well, but the pros are it definitely allows you to scale your reads and writes in terms of CPU and IOPS, so that's a benefit. Lastly, he does cover distributed key-value storage with SQL. I'm not really going to cover that too much because I think the architectures are too dissimilar to Postgres to make a good comparison. It feels a little bit like apples to oranges, even though they speak the Postgres dialect. But I found this post super interesting. Thinking about all these distributed Postgres solutions does give benefits, but one of the primary disadvantages that you'll see in all these is performance, specifically with regard to latency. So definitely a great blog post. I suggest you check it out.

Next piece of content- "Incremental Backup: What to Copy?". This is from rhaas.blogspot.com and last week where we mentioned that there's been a commit in Postgres 17 for an incremental backup solution. This allows backing up just what has changed. So you don't have to do a full backup every time. Now you can just back up what has changed since the last backup. Robert Haas is actually the committer of this particular patch, and he goes through his thought process of how he designed it. I found this super interesting. So first he considered how general file copy tools identify what's changed, and they may use file sizes, modification times, or checksums to identify which files have changed. But that really wouldn't work for Postgres because what he wants is a block-level solution. So he needs to identify the modified blocks.

He said he might have figured out a way to do some checksums with regard to that, but he also didn't want to rely too much on timing, specifically the timing that the database server is at to make some of these decisions. And he does admit that, quote "... pgBackRest has had success with these kinds of techniques...". But he chose a different path. The next way he considered is basically keeping a bitmap of what blocks have changed. And he mentions that pgtrack does this. He didn't choose this particular approach because he was a little worried about quote "...code complexity, storage space requirements, crash safety, and performance". So what he ultimately used was PostgreSQL's write-ahead log to identify the modified blocks. Now, using the write-ahead log to do this, he knew we couldn't use the full write-ahead lock because that would be way too large of a thing to track.

I literally have some customers whose WAL files in a given day exceed the size of the database by maybe up to two times. So you can have a really high amount of WAL generation. So just trying to read through that to see what's changed is a no-go, basically, he said. But he thought, could we just get the information we need and store that? Because he says, quote "We only need information about which blocks were modified and which relations were truncated". So he doesn't need to know how many times something was changed, when it was changed, or the order. He just needs to know whether it was or not. He was able to define a very small amount of data from the WAL that would be needed to be retained for this purpose. And that is the WAL summarizer. So we covered this incremental backup last week, and you have to turn on a WAL summarization process in order to be able to do incremental backups. And that's what the WAL summarization process does.

Basically, it stores these summaries of what blocks have changed in the files under pg_wal/summaries. And he says, quote "You get one file per checkpoint cycle...". Basically, when you want to do a backup, you just request the backup and you specify the backup manifest file and it compares that with the summaries to determine what needs to be backed up in that incremental backup. He did have to deal with particular issues like what if a relation is truncated and then a new relation is created with the same name, how to handle some of those potential issues and conflicts. He discussed that here. But a great blog post explaining the process and he definitely encourages everyone to test as soon as they're able, because it would be great to identify and get this working really well because as he says here, quote "...this feature is not guaranteed to appear in PostgreSQL 17...". But I think a lot of people would very much like for it to be there. So if you have the ability to test out these changes with the developer version, I'd go ahead and do it.

Next blog post is also from rhaas.blogspot.com and this is "Incremental Backups: Evergreen and Other Use Cases". So here he's theorizing on how people may use incremental backups. I actually covered a lot of this last week. Maybe you have a full backup once a week and incrementals every day, or maybe you have full backups every day and incrementals in between, and then the WAL files to bring you up to a particular point in time. So he's interested to see how people are going to be using this new capability. But he also mentioned something here is you don't have to keep a ton of incremental backups. What you could do is if you have a full backup in this example on Sunday and then create an incremental on Monday, you can use pg_combinebackup to combine Mondays with Sundays to create a new Monday full backup. So now this is your new full backup. Then you can do another incremental backup, and if you want to consolidate them, you could use the same procedure. But these are some thoughts on how it might be used.

Next piece of content- "Waiting for PostgreSQL 17 - Add support for incremental backup". This is from depesz.com and this is a post that goes into detail running the commands to get incremental backup working. Now, because I covered this last week in a separate blog post, I won't cover this here, but you can check this out for a different person's perspective on using it.

Next piece of content- "PostgreSQL is the DBMS of the Year 2023". This is from db-engines.com and this is great being recognized yet again. What I found interesting, they have the last winners for the last ten years, and PostgreSQL has won four years out of the last ten years, which is an impressive feat because other database systems only won at most two years out of the ten. So that's pretty good.
Next piece of content, there was another episode of Postgres FM last week. This was on "EXPLAIN", and this is a solo episode by Michael where he talks all about the explain plan and the best approach to use this command to understand how your queries are performing. So you can definitely listen to the episode with this link or watch the YouTube video down here.

Next piece of content- "Create an AI Expert With Open Source Tools and pgvector". This is from percona.com and if you're looking to build your own artificial intelligence solution, your own expert on your proprietary data, and you don't want to necessarily use something like ChatGPT or unincorporated. This blog post walks you through how to set up your own expert system using a fair amount of Python, along with Postgres and the pgvector extension. Now in their examples here, they are using their Percona operator and Kubernetes. But as long as you have Postgres and the pgvector extension, you should be able to mimic what they're doing here. But they have code for all the processes of building and setting this up. So if you're looking to do some AI work, definitely encourage you to check out this blog post.

Next piece of content- "Best pg_dump compression settings for Postgres in 2024". This is from kmoppel.github.io and he did some tests with various different compression solutions with pg_dump, gzip, lz4, and zstd, and his ultimate conclusion was that "zstd indeed should be the new default pg_dump compression method for most datasets". That was a conclusion I came to, I think six months to a year ago when I started using zstd before they were incorporated into the Postgres tools. So basically I would pipe pg_dump or pg_base_backup into zstd because it can use multiple CPUs in parallel. Now he has his results down here.

I had a little difficulty interpreting some of what he mentioned here, but you can definitely review this yourself. I know when I was looking at it a while ago, it seemed like lz4 was very fast for smaller sets of data. So maybe if we want to compress WAL files, lz4 might be faster. Zstd took a little bit of time to rev up, as it were. So it worked really well on really large data files like a backup would be. The benefits that we found are really good compression ratios, and really good speed compared to something like gzip. And the third big thing is being able to use multiple cores. Now his tests here, he said he didn't use multiple cores, but that's a huge advantage of zstd compared to stock gzip. But check out his blog post if you want to learn more about this comparison.

Next piece of content- "My PostgreSQL wishlist". This is from ryanguill.com and I found this fascinating. It's basically, I don't know if it's 20 or 30 different wishes that he wants to come to Postgres. It ranges from I wish I could set a column as insertable but not updatable. So basically make it immutable. So I know some frameworks they have created_at dates for every table they make. Well, that is an example that created_at once its set should not be changed. The other one is I wish I could set a column to be generated on update and basically that's what the updated_at column is typical for frameworks that are supposed to be updated with every data change to the table. Now you can use triggers for this, but his wish is for something that you don't really have to use explicit triggers, and that there's some sort of feature even if it uses triggers internally. He has a whole set of wishes around foreign keys and JSON, some things that seemed really hard to do. I wish that we could control the default order of columns without having to rewrite the table. That sounds really hard to do. But I thought this was pretty interesting, and a lot of interesting features that you hmm, what if.

And for the last piece of content, there are actually two blog posts. So if you use Ruby or Rails, Andrew Atkinson was actually on two different podcasts discussing his book coming out about Postgres and Rails. So you can check out "Remote Ruby- Unleashing the Power of Postgres with Andrew Atkinson", or you can check out "The Rails Changelog- #014: PostgreSQL for Rails Developers with Andrew Atkinson".

10-100 Times Higher Latency With Distributed Postgres | Scaling Postgres 298

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript