Scaling Lessons, Synchronous Commit, Connection Management, BLOB Cleanup

In this episode of Scaling Postgres, we discuss 5 years of scaling lessons, working with synchronous commit, managing connections and cleaning up BLOBs.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right, I hope you, your friends, family, and co-workers continue to do well. Our first piece of content is "Lessons Learned From 5 Years of Scaling PostgreSQL". This is from onesignal.com, and they're talking about their experience scaling up and managing a 75 terabyte database stored across nearly 40 servers. So that is definitely quite the database. cover different sections: bloat, database upgrades, XID wraparound, replica promotion, partitioning, and sharding. Now, they first go into a higher-level overview of the data, and basically, they have two dominant tables they talk about. One is their subscribers table, which looks like they have tens of billions of rows, and a notifications table. They had to do a fair amount of updates from what they're saying here. So the notifications table does have a heavy update load, and the subscribers write-heavy in terms of INSERTS and UPDATES. So of course, when you have a lot of UPDATES or a lot of DELETES, you're going to experience bloat.

So they talk about table bloat as being an issue. What you need to do is, of course, set appropriate values for vacuum to make sure they get taken care of. Now, they don't go into detail about their particular autovacuum setup to be able to make sure that their tables experience a minimal amount of bloat. They do talk about tuning autovacuum, but they don't give the exact values that they have. But they do reference a second-quadrant article about it. Then they do talk a little bit about schema optimizations they've done. So one is partitioning some of these large tables so that vacuum can run on each partition, as opposed to taking forever to run on a big, huge table. That's one aspect. The other aspect is actually moving some columns to a separate table, even though maybe it logically makes sense to have them on that table.

If you have information that's not updated as frequently, having it in a second table means that that data doesn't need to be vacuumed and then frozen as often. So that could lead to some benefits. In terms of reducing bloat, they talk about using VACUUM FULL, but again, that locks the whole table. So that's another good candidate. They talk about using the pg_repack, although with indexes, I would probably just do the re-index concurrent capabilities of Postgres. But pg_repack is a good tool to use for tables. They did have some specific issues with one table where they found pgcompacttable, which is another extension, does reduce bloat in a non-blocking manner. So definitely some tools to handle bloat. Now, in terms of database upgrades, they've chosen to go with the logical replication option with pglogical. So logically replicate the data over to new versions so that's how they've handled that and they discussed that here.

And minor upgrades are pretty easy to do with Postgres in terms of just updating the binaries. They talk about the importance of tracking transaction ID wraparound or the XID wraparound, so super important to maintain it, but then, well vacuumed tables should keep that under control. But there are also a few things you can tune, such as the autovacuum_freeze_max_age. Next, they talk about replica promotion. But really I'm considering this, their high availability and how they handle it. They use HAProxy and it looks like they're kind of in the intermediate state of developing this. Then they talk about their partitioning and how they are partitioning right now by 256 partitions and they're considering upgrading it to 4,096 partitions in the future. Then they talk about sharding, as they say here, a natural extension of partitioning. So if you want to get a little bit of experience from someone who's been running a pretty large Postgres installation for five years, definitely check out this blog post.

The next piece of content- "The Power of Synchronous_commit". This is from momjian.us and he's talking about the different setting changes that synchronous commit does. He mentioned that he updated some documentation in October 2020 and was in the November 12th, 2020 release where it needed to communicate a bit more about what that parameter value does. He goes into detail about it because it handles how things get committed at the local level, meaning at a single server level, as well as when you're having replicas involved. For example, he mentions when synchronous_commit is off, then the database doesn't wait for records associated with a commit to be written to the write-ahead log. So it does guarantee data consistency. It just means that everything that's been committed and returned back to the client says, okay, it's been committed.

If the server goes down, there's no guarantee that that commit will make it to the database files, but it will be consistent. He says this is a little bit in contrast to fsync=off, where there's no guarantee of data consistency. Now he mentions the local option for that setting waits for the commit to be durably recorded, meaning the WAL has written it to disk. Now he mentions the other settings handle how replicas will handle the writing. So remote_write waits for one or more standby servers to have their transaction information durably stored. The ON option, which is the default, is similar to remote write but is durable against standby operating system crashes, not just Postgres crashes. Remote_apply waits for the transaction to be replayed, invisible to future read-only transactions. So if you want to learn a little bit more about the synchronous commit setting, you can definitely check out this blog post.

The next piece of content- "Your Guide to Connection Management in Postgres". This is from blob.crunchydata.com. He's talking about different methods of pooling connections to make your Postgres server more efficient, because the more connections that are left open, they can be reused. This results in lower latency to get queries back from the database because you don't have to build up a connection, particularly an SSL or a TLS connection. Now, the first thing he mentions is being able to handle application-side pooling. So this is your application framework handling some of that pooling itself and keeping a number of connections open. So this may be based upon the number of workers you have and maybe they use a common pool of connections to do their work against Postgres.

Having those processes keep the connections open for a time and reuse them will make querying the database through the application framework more effective. Once you get to a certain scale, you may need to implement server-side connection pooling. So this is using a tool such as PgBouncer to pool the connections and use a common set of server connections for all the application connections coming in. Then he mentions the scenario where you can make connections across databases. Some application frameworks support sending particular database queries to certain databases. So he does mention, for example, Rails has some libraries that help with routing via a specific model. So if you want to learn a little bit more about things you could be doing with connection pooling to make your database more efficient, definitely check out this blog post.

The next piece of content is "CLEANING UP A LARGE NUMBER OF BLOBS IN POSTGRESQL. So this is from cybertec-postgresql.com and they're talking about using BLOBs or binary large objects and actually storing them in the database. So one scenario if you're working with binary objects is to store the file and just store a reference to it in the database. The other way is to actually store the binary information of that file in the database so you no longer essentially need that file. Now, one way you can do that is you use the large object import function (lo_import) and give it a file name and it will actually store that binary data in this file in the database and it returns an object ID here. Now, the typical way you would probably implement this is to create a separate table. Then when you do an INSERT into the table, use this large object import function and it will return the object IDs.

Then you can store the reference to the large object in a known table that you have. Now, for an experiment, he loaded about a million of these objects into Postgres. Then he took a look at how these large objects are stored and basically, they're stored as byte arrays. So anything that's a blob is essentially an array of bytes. For example, he set the byte array output to escape. Then when he queries pg_largeobject for a particular object ID, you can actually see the data that's stored in that byte array. Now he mentions removing BLOBs, and that's easy to do using the large object unlink and referencing that object ID. It unlinks and it removes it from the database. However, I'm assuming a client ran into an issue because how do you remove millions of BLOBs at a time?

When you actually try to do a large number of these, you can run out of shared memory. The reason is that it requires a lock to do this, and it's not a row lock, but it's the locks that can be tracked by the max_locks_per_transaction. So you can't remove, say, a million of these binary large objects. So the scenario to deal with that is I would probably just delete the BLOBs in chunks. The other possibility is changing the Postgres configuration, but definitely something to keep in mind if you need to delete these en masse. The last thing he covers is how to deal with orphan BLOBs. Because again, the suggestion is to do an insert here where you can save the reference to the binary large objects. But what if you have an orphan object? Well, it talks about how you can handle them here. So if you want to learn more about that, definitely check out this blog post.

The next piece of content- "Managing Transaction ID Exhaustion (Wraparound) in PostgreSQL". This is from keithf4.com, and it's titled Keith's Ramblings. Much like the previous post, he talks about five years of knowledge and monitoring. This is highly important to do, particularly with a highly active database. He presents a query that is apparently used in the Crunchy Data PGmonitor. But it's a CTE query that actually tracks the transaction ID and how soon you're going to hit that 2 billion limit. Because he mentions a percent towards wraparound and a percent towards emergency autovacuum.

So how soon is it going to hit the mark where it has to do an aggressive vacuum to take care of vacuuming things up? He also mentions the autovacuum_freeze_max_age, for which the default value is 200 million. But if you have a highly active database, you may want to increase this by a bit, but definitely no more than a billion from the knowledge that I'm familiar with. But once you have your monitoring set up, the next consideration is how to fix it if you actually start approaching this problem. Now, one way to do it for the whole database is to just vacuum the whole DB specifying freeze, having a couple of jobs that can take it at one time, and then also doing an ANALYZE as well.

That'll freeze your IDs and allow you to free up transaction IDs. But there is also a per-table fix. So you can actually use this query to determine what databases have a setting that's going to hit the limit sooner than others. Then you can check each table within that particular database to see what is the table that's closest to hitting that wraparound point. And then you could separately vacuum_freeze that table. He also has a technique here where you can do multiple tables at a time if you need to. So if you're looking for some techniques to be able to monitor your transaction ID for wraparound cases as well as how to resolve them, you can definitely check out this blog post.

The next piece of content- "Postgres is Out of Disk and How to Recover: The Dos and Don'ts". This is from blog.crunchydata.com, and the first thing they're talking about with regard to this is to make sure you have a backup. Then they also talk about the scenarios under which you would get a full disk. So one is that the archive command is failing and the WAL is filling up with disk space. The second is that the replication slots are essentially orphaned. There's a disconnected standby, and the result is on the primary, the WAL is filling up. The next scenario is that there are a lot of database changes that generate a ton of WAL that consumes all the available disk space.

So maybe a mass of DELETES connected with INSERTS or a mass of UPDATES is more likely. Then the last scenario is you just literally ran out of disk space during the data and you don't have any monitoring or alerting that lets you know. So the first thing they talk about is what not to do. Number one is to never remove the WAL. Definitely never remove that. Number two, don't immediately overwrite the existing data directory with a restore from a backup. Well, I wouldn't do that either. Don't just resize in place. So I don't know if I 100% agree with this, because there are some ways that maybe this is possible. But they say what you should do is take a file system backup right now. I don't necessarily agree with that. They say to create a new instance with sufficient space. I'm like, okay, that's possible, or fix the underlying issues.

Generally, if you're running on a cloud provider, what I would do is see if you can just increase the disk so you can, like AWS, for example, has a way to expand volumes. I would go ahead and do that. Or is there some other file or some other reserved space on that volume that you can delete at least to get the database back online? Once you get the database back online, you can then potentially delete orphan replication slots, which may be a cause, or you can adjust the traffic so that you're not updating so much and the WAL can be resolved, or you can fix the problems with the archive command. So there's a lot of different ways that you could potentially resolve this. This post does cover the main ways that running out of disk space happens with Postgres. So if you're interested in learning more definitely check out this blog post.

The next post is "FINALLY - A SYSTEM LEVEL 'READ ALL DATA' ROLE FOR POSTGRESQL". This is from cybertec-postgresql.com. They're talking about a new patch that's been added for Postgres 14, which should be coming up this fall, in that you now have a pg_read_all_data role and a pg_write_all_data role. So they really love this feature because he says what tends to happen is, yes, you want to be able to spend time and craft your roles appropriately, but too many people just grant superuser roles inappropriately and when you're doing that, you're actually opening up your server to more risks. So for example, you can run operating system commands on the database server with superuser permissions.

So you want to minimize how many people have that access and these roles are a great way to avoid that. So if you have a data analyst and all they need to do is query the data, you can just give them the pg_read_all_data role. Or if you have an application user that needs to be able to write and read data, you could give them both of these roles. Now, they did talk about some of the workarounds. If you don't have this role and really these are more structured ways to handle it, and they talk about handling the roles and assigning the appropriate permissions to do essentially what these roles are doing. So if you want to learn more about setting this up, definitely check out this post.

The next piece of content is "CItus Talk at CMU: Distributed PostgreSQL as an Extension". This is from citusdata.com. They're talking about a presentation that was done at CMU relatively recently which talked about PostgreSQL and how it does scale out Postgres via being an extension. It covers some of the topics pointed out at the different watch times of the YouTube video, such as when to use Citus to distribute Postgres, using the PostgreSQL extension APIs, transparent sharding, distributed query engine, and distributed transactions. If you want to learn more about Citus, you can definitely check out this blog post and talk.

The next piece of content is "Benchmarking PostgreSQL: Setting up a RAID Array from Scratch". This is from enterprisedb.com and the main subject of this article is setting up a RAID array because they did have an AWS cloud instance and they were moving to an on-premises server to get more consistency out of their benchmarking. And therefore they're using the software RAID tool: mdadm. But even I've used this on Amazon because when your database gets up to a certain size, there are limits to how big of an EBS volume you can have. If you have a database larger than that, then you need to basically stripe across multiple EBS volumes. So even if you're in AWS and your database is large enough, you may be using this anyway. He basically goes through the process of setting up this mdadm RAID arrays for Postgres volumes. So if you're interested in that, you can check out this blog post.

The next piece of content- "How does Postgres handle external parameters?: This is from highgo.ca. So this is an in-depth look at the parameterization that Postgres does with queries. So if you want to get under the covers of Postgres and understand how that works, definitely check out this post.

The next piece of content- "Regression Analysis in PostgreSQL with Tensorflow: Part 2- Data Pre-preprocessing". This is from enterprisedb.com. So if you want to learn more about Tensorflow for doing machine learning with Postgres, you can definitely check out this post.

The last piece of content, the PostgreSQL person of the week is Devrim Gündüz. So if you're interested in learning more about Devrim and his contributions to Postgres, definitely check out this blog post.

Scaling Lessons, Synchronous Commit, Connection Management, BLOB Cleanup | Scaling Postgres 161

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript