Upgrade Struggles, Hyperloglog, PgBouncer Usage, Postgres Scaling

In this episode of Scaling Postgres, we discuss Postgres upgrade struggles, how and why to use hyperloglog, using pgbouncer for session vs. transaction pooling, and methods to scale Postgres.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "A tale about (incomplete) upgrade from PostgreSQL 12 to 14". This is from depesz.com. And he's talking about a process of upgrading a bunch of different clusters for a really big customer. They've been doing it for a while and the projected destination was 14, not even 15, because he said he's been working at this at the very end of the post for over a year now.

So a lot of times, PG upgrade is a great tool to use if you can handle a little bit of downtime, but there are some cases he describes here where suddenly it's not so great. And more recently, I myself have been working on some logical replications for some clients. But there's a whole lot of things you need to be cautious of, and I felt this was a great blog post describing some of the issues you can run into when trying to do an upgrade.

Now a lot of times it goes off without a hitch. But the more complex the type of database you're working with, as well as the size, you can run into issues. Their particular issue here is they had a ton of objects like 400 to 500 tables in each database. And not just one schema but many schemes such that some databases had over 200,000 tables. And the size varies between a couple of hundred megabytes to six terabytes, so there are times like this where things can get complicated.

So he goes over the process of wanting to upgrade from 12 to 14 as well as from Ubuntu 18.04 to 20.04. So the first idea was to use pg_upgrade, but the reason they decided not to do that is because of the low count data changes. Basically, coalition data changes will cause you to reindex all the indexes. So this is exactly why some of my clients that I'm working with were going to do a logical replication upgrade.

So the next step was to try logical replication, but they ran into a problem where they were using too many replication slots with all the tables. It just exceeded their max_wal_senders of fifty, which that's pretty darn high. So how they dealt with that is they broke it up and did a certain number of tables at a time. Which is what I've had to do for some of my clients as well. Not so much because it would run into this specific issue, but because it would take too long to get in sync.

In other words, that one application worker on the destination, we needed more of them to be able to transfer all the data that needed to happen. But doing this he ran into a problem when applying indexes and constraints. So the idea is you just have the table with minimal constraints, basically just the primary keys, and then you apply all the other constraints than indexes and triggers after all the data has been copied over.

And he tried using pg_restore with a number of job options to be able to apply those in parallel. But they ran into problems with that. They could manually add things that failed or you could write your own solution to be able to do this to avoid conflicts when setting up these additional indexes, so you can get around that. But then they ran into another problem where even though there was a slot, the primary database was not retaining WAL. So after much investigation, he actually devised a solution to try and force retention of this WAL so it doesn't lose it.

Now this only happened with certain database clusters, others worked fine and didn't encounter this problem. So not quite sure what the problem is and he's even not sure now, but this is just the solution he had for the Postgres clusters that had issues. But I highly encourage you to check out this blog post, because it has a lot of great information about handling backups and things to watch out for.

Next piece of content- "High-compression Metrics Storage with Postgres Hyperloglog". This is from crunchydata.com. And they're talking about this extension Hyperloglog and what it is great at is getting approximate calculations. So it's not going to be exact but allows you to count distinct values in an approximate sense. And the advantages are it's a lot faster and it's a lot smaller to store. So they go through the process of setting it up in the blog.

Now because it's an approximation you don't want to do any critical calculations with it. But a lot of analytics, or metrics-type data, it's usually acceptable. So some examples he shows here- what pages received the most unique views during the time period? What hour has the highest unique visitors? What refer generated the most page views last week? If you're off by a little bit, it's not going to be a big deal.

So they go through the process of setting up the extension and setting up a dedicated table with the HLL data type. And basically, this rolls up by a certain time period. Here they're doing an hourly roll up and the less granular you are, I think there are bigger advantages in terms of storage size and things of that nature. And he also explains how Hyperloglog actually works. So feel free to review that if you're interested. But of course, I'm interested in the benefits. And as you can see here, it resulted in being about 5% the size of the original table before doing the roll-up.

The query performance went from 1.1 seconds to 80 milliseconds. So that's just about 7% of that time. And of course, the trade-off is in terms of its accuracy. And you can look at these two tables to see how accurate they can be. It seems like most of them are around, maybe, 2-5% off in terms of accuracy. So if that's good enough for you, you can gain the benefits of using Hyperloglog. So definitely check this blog post out if you're interested.

Next piece of content- "A comprehensive guide to PgBouncer/Postgres compatibility". This is from jpcamara.com. Now this is kind of an overview post of another post he did. The other post he did, that is linked here, is about 9,000 words. So it's essentially a little book on PgBouncer. Now what he did is he basically took a look at the PgBouncer compatibility table, which essentially, if you look on the PG bouncer site under features, it talks about the different types of pooling that are available.

Session pulling is basically a one-to-one connection with Postgres. There's no real difference when connecting to it in terms of feature set but you're not going to be able to connect more connections than connections that exist at the database. There's transaction pooling, which that's the main reason why a lot of people use PgBouncers. You can put more PgBouncer connections against fewer database connections because a single database connection uses many PgBouncer connections.

And then statement pooling which, as they say, is the most aggressive option and not many people use it because you have to be very cautious. Things are happening outside of a transaction. But because most people use session pooling or transaction polling, they showed the different features that are available in session pulling versus transaction pooling. So basically, he wrote a very comprehensive guide. This first post has links to his 9,000-word article on it and goes into depth. So basically, this is essentially a mini book- "PgBouncer is useful, important, and fraught with peril".

But you can see, this is very comprehensive, talking about all the benefits and the downsides of using transaction pooling. Because that's what most people use but there are some downsides as you could tell from using it. I can't hope to cover everything here but I did read this through. There are some things that I don't 100% agree with, but there's a lot of good content in it and I definitely suggest you check it out.

Next piece of content- "How Figma and Notion scaled Postgres". This is from pganalyze.com. And this is Lucas's next episode of "Five minutes of Postgres" and here he talks about how two companies decided to scale Postgres. Figma decided to take what I guess would call a microservices strategy. Basically taking tables or groups of tables in the database and moving them out of one database and putting them in another. So basically, he discussed how they did an analysis to determine how coupled tables were to one another. For example, were there joins that exist in between them?

Because if you're going to move those tables out into its own database, you can't really do cross-database joins. So you want them relatively independent. So it's basically its own microservice. Like you can imagine if you're tracking, say events, maybe that event tracking could happen within its own database. And he compared that against Notion, which actually did an application scaling where they segmented their database based upon customers. So many customers were in this database, so many customers were in the other database, and so the same tables existed in every database and you couldn't join to any of them. It's just a particular customer's data residing with one of the scale-down databases.

And this is essentially how Citus works. They do scale out and put particular customers' data all on one shard. But I believe he said that Notion has an application-level sharding as opposed to the database system doing it. But as you're deciding how you want to potentially scale your database, definitely recommend checking out this piece of content.

Next piece of content- "Logical Replication on Standbys in Postgres 16". This is from crunchydata.com. And it is this new feature in 16 where you can set up your standbys to do logical replication. This wasn't possible before. So basically you can have your primary database system that streams to a standby using standard physical replication. And then you do logical replication to your replicants. And again the benefit of logical replications, you can do it across versions. You can only do certain tables or certain types of data to transfer to these replicas.

And then the other benefit, in the event of a failover, those replica servers can still follow that old standby or new primary now. So this still continues to work. Now I keep waiting for the point where the physical replication slot can fail over to the primary as well. And maybe that's coming in a future version of Postgres, I know there's an extension that does it now. But this blog posted a very good job of explaining how to set up logical replication. I definitely recommend you check it out if you're interested in that.

Next piece of content- "WHAT IS A SCHEMA IN POSTGRESQL?". This is from cybertec-postgresql.com. And whenever I think of schema, I think of a namespace. So it's just a namespace for your database. But this post talks about a hierarchy of how your data is set up. Generally, you have your instance. So this is a version instance of Postgres. And on that instance, you can place one or more databases. And within each database, you can have one or more schemers or, as I like to call them namespaces.

And within that schema or namespace, you're going to have many tables. And then those tables of course have many rows. Now how they explained schemas is that they give you a grouping for your tables. And by default, Postgres has a public schema. So any object you create in the new database will be created in the public schema. Now you can just query the table using the table name, or you can also state the schema name, dot, and the table name. So essentially, you were asking for the table that exists in the public schema. The post also shows you how you can create new schemas.

And when you have that, you need to use a different namespace if you're going to contact that table because you can have the same table name in different schemes. Now as opposed to namespacing every table, you can actually set a search path to only look at a particular schema. So then you no longer need to specify the schema with a table name. They have a little discussion about using schemas as well as how to change their name as well as drop them. So if you're interested in learning more about schemas, you can check out this blog post.

Next piece of content-"pgsql phriday #009 roundup". This is from di.nmfay.com. And these are all the posts that happened during pgsql phriday. The first post we'll talk about is "DATABASE CHANGE MANAGEMENT VS PERFORMANCE". This is from pgmustard.com. And I actually liked his 3 essentials for database change management. Number 1- version control with the ability to branch. So as you're wanting to make database changes you want that under version control so you can track what's happening.

Number 2- continuous integration with unit tests. So ideally, set up testing so that you can validate that the application works as you intended or the schema is set up as you intend. And Number 3- automated deployments being the primary way of modifying production. In other words, you shouldn't just be logging into production and doing something. There should be some sort of tooling behind it that actually runs the change or the migration on the database.

Now then he goes into performance and that's definitely something to be wary of as you're making changes. You want to make sure you don't impact performance. And while that's very hard to do, he also talks about his product pgMustard, which helps with that once changes have made it into production and you can hopefully find the root cause with regard to it.

The next post is "Database Change Management". This is from andyatkinson.com. And he's a Rails developer, like myself, and he essentially talks about how Rails migrations, or active record migrations, handled this. From setting up the changes you want to make and how they're monitored and deployed into the system. Now he makes a very important point here is that any modifications you make, you don't want to impact running production.

And he lists a tool here called Strong Migrations that helps to identify things that could cause unintentional locking of your database. So basically whenever you do a database migration you probably don't want to make anything worse. You only want to make things better. Now one thing I didn't see in here, and maybe it's part of what the strong migrations does, is using a lock timeout.

So for the vast majority of migrations, you would want to use a lock timeout to prevent any modification that you're doing from causing a lock queue and blocking things up. So definitely strongly advise using that in your migration toolset if you're not already. One of the questions asked as part of the blogging event was also rolling back and he says quote "I don't normally roll back a schema change." and this is for me as well. I basically roll forward and there's someone else that talks about that more in-depth in the next post.

The next post about this is "10 REQUIREMENTS FOR MANAGING DATABASE CHANGES". This is from softwareandbooze.com. This is a pretty long post and he is listing 10 requirements, so that's much more than the three of the first post. But I believe he is affiliated with a vendor that provides the tool that does this. So that may be why this is much more comprehensive. But me being a Rails developer, that's the only tooling that I think I need. Many of these posts do list other migration tools like Flyway and Squitch, I think. But thus far, Rails migrations have worked for me very well so far. But definitely check out this post if you're wanting something more comprehensive.

The next post on this topic is "ON ROLLBACK". And this is from scarydba.com. So talking about rollback, he actually said that it was super rare that they actually rolled back anything. And what he advocates is using Roll-Forwards, where basically you don't roll a change back, just do a new migration with the change you want to make. So just make it easy to make those changes when possible. So that way you don't have to roll back. You can just roll forward. Create a new change that corrects if anything bad happened, and then correct the data after the fact if you need to.

Next post on this topic is "Database Change Management" from opensource-db.com. And he references a Flyway again as well as Liquibase. And he mentioned six things he likes about Flyway in terms of simple configuration, version control, migration scripts, repeatable migrations, and integration automation and compatibility. So you can check that out if you're interested.

And the last post on this topic is "three big ideas in schema evolution". This is from di.nmfay.com. She's talking about Squitch in terms of orchestration and Graphile-Migrate in terms of Idempotence and then Migra in terms of Comparison, like doing diffs between different schema states. So definitely check out these blog posts if you want to learn more about migrations and change management.

Next piece of content- "SQL Recipe: Compare with Neighbors". This is from antonz.org. And he's talking about the scenario where you want to maybe compare sales from one month to a previous month. Or compare financial results from one period to the same period last year. So any type of those comparisons, you could use a lag() function over a SQL window. Now this isn't Postgres-specific, but he explains how to do it in different ways. So if you're interested in that, you can check out this blog post.

Next piece of content- "TIL- Filling prepared statement placeholders automatically with pgbadger". This is from kmoppel.github.io. And he's talking about the issue when you're looking at prepared statements. Generally in the logs, it will show the parameterized query and then the actual values of those parameters in a second line.

And so you basically have to manually put those parameters into that query before you run it, or you need some sort of custom script to do it. Well apparently, there's a setting in pgbadger that allows you to dump all queries that merge the values into the parameters. So you actually get queries that you can run in your console. So check out this blog post if you want to learn more about that.

Next piece of content- "Using NGINX as a PostgreSQL Reverse Proxy and Load Balancer". This is from highgo.ca. So a lot of times, I've seen HAProxy used as a load balancer. Well, this post explains how to use NGINX to do the same thing. So you can check this blog post out if you're interested in that.

Next piece of content, "CI/CD with Crunchy Postgres for Kubernetes and Argo". This is from crunchydata.com. So if you use Postgres Kubernetes and you want to set up a CI/CD system, maybe you're interested in checking out this blog post.

Next piece of content- "Please welcome Pg_index_watch- a utility for dealing with index bloat on frequently updated tables". This is from dataegret.com. And this is a new utility that actually monitors your database. So similar to what AUTOVACUUM does, but it actually tells you when your indexes are bloated and can kick off indexing jobs. So if you're interested in that, you can check out this blog post.

Next piece of content. There was another episode of Postgres FM Last week. This one was on "Zero-downtime migrations". So you can click here to listen to the episode or you can watch it on YouTube down here.

Last piece of content, the PostgreSQL person of the week is Sharan Foga. If you're interested in learning more about Sharon and her contributions to Postgres, definitely check out this blog post.

Upgrade Struggles, Hyperloglog, PgBouncer Usage, Postgres Scaling | Scaling Postgres 269

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript