Scaling Postgres Celebration | Scaling Postgres 300

In this episode of Scaling Postgres, we look back over 300 episodes, discuss a way to find locking processes, review disadvantages of cloud hosted solutions, and how to partition with UUIDv7.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Welcome to the 300th episode of Scaling Postgres. I can't believe it's been almost six years since I did the first episode. Officially, it was February 27, 2018, and here's what episode one was about, petabyte scale and locking. I think the Postgres version at the time was version 10, so it's been quite a while. What's amazing is looking at how many links used to be in the episodes like this has, I think, eight links. Whereas in preparing for the content this week, I think there are close to 30 links. So even though it's great to celebrate the 300th episode, Postgres itself has come a long, long way. We've had six new versions, and the amount of content produced every week has at least doubled, if not more. So that's incredible. There are so many more companies now that weren't there in the past supporting Postgres or designing Postgres solutions. So that's just phenomenal. But I would also like to thank you, the viewer or the listener if you listen to the podcast of the show because without you, I wouldn't have kept doing it. So thank you so much for that. I greatly appreciate it.

But as you know, I hope you, your friends, family, and coworkers continue to do well. I think that's enough for nostalgia. But what about moving forward? Of course, the thing I'm going to mention is my course, which I said was going to be coming in 2024. I projected the end of January. I'm not going to quite make that, but I can say now definitely it will be open on February 26th, 2024. So in about four weeks or so the course is open date. Of course, this will be available to all the early adopters who signed up for it. But if you're looking to get in on the course, there will be opportunities to do so. The sales page hasn't been prepared yet, but it's very close, and I thought I'd give you a sneak peek of some things that I'm working on.

So my plan was to create as real a database as I could with two terabytes of data in it. As you can see, this is the database I'm working with currently. It is just over two terabytes, and I have a little snapshot here of the different tables in it. So there's an events table for capturing page views and things of that nature that you can see has 5 billion rows in it. There's an email transmissions table with a billion rows. A payments table with 150,000,000, about 20 million users, and about 4,000 accounts. So this is kind of going to be like a software-as-a-service application, and I'll provide more details about what the application is as we move forward into the course open date.

But I thought I'd run a query here really quick because you'll see in some of the content this week people were talking about their really large table being 20 million rows and I was like, I got a 5 billion one. So I just thought to query to see how fast it would return. What I'm doing is just pulling a month of data by looking at a particular account and accounts of the user ID, which is not a part of any index, and uncached, it ran in 138 milliseconds, whereas cached it was at 1.8 milliseconds. And I should say there are, relatively speaking, hardly any indexes on the database yet. So this size will only grow. I don't want to make it too large, but it is over the two-terabyte threshold at this point. I should say that a lot of this is subject to change. I can recreate the database if I need to make alterations as I'm finishing up the course. But that was just a little sneak peek.

Let's go on to the first piece of content, which is "One PID to Lock Them All: Finding the Source of the Lock in Postgres". This is from crunchydata.com. If your system is having issues and it looks like it's because it's locking excessively or it has a lock that is really long-running, these are some techniques to be able to find it. So the first place you want to look is pg_stat_activity and specifically look at the wait_event/wait_event_type to see what's waiting and look for things that are not idle. So you'll see active ones and also idle transaction ones. Here are the results of this table. If we look at the state, you do see an active one and then one that's idle in the transaction, and you can see the one that's active is in a lock wait_event state and it's waiting for the relation. Now it's probably pretty obvious that this is the one that's locking, but if you have a system with hundreds or multiple hundreds of processes, it could be hard to identify this.

So the next thing they propose to find is to find which PID is locking the table, meaning that there's a lock on the relation. So what PID is locking this relation? So you go to the pg_locks table, you put in the process ID of the one that's locked, and look for where it's not granted. It'll give you this row and then you can see the relation that it's trying to get access to but can't. This may be obvious if you look at the query but this tells you right here. Then you query the pg_locks table again using the relation OID where it is granted. You can see this is a process that has that lock and you can look at the PID for it. So this is the one that's holding the lock. Then you can query pg_stat_activity again for this PID and you get the exact process that has the lock and has not released it yet. This was the same one that we saw earlier just because there are not as many processes.

Now he says this is pretty straightforward to do for one example. But you can have chains of locks so something's locked on something else and then something else so you can have a chain of locks. What he's provided is this CTE query to be able to identify what is patient zero, and what is the one that's causing all the locks to happen. You can see it here, the output of this particular query. So I would definitely keep this to use in your own database. You can see all these active states that are locked and the culprit is this one up here and in this case, it's slightly different where they're doing a SELECT FOR UPDATE. So this caused all these other updates and an alter table statement to be locked. Now he then goes on to ways to end the process holding the lock. If you're in the process you can do a commit or a rollback.

If you're outside of it you can try to cancel it or even terminate the whole back end. He talks a little bit about other causes of locks, but the other thing to mention is that the process of finding these locks works great if they're long-running locks but if they're very transient, or they happened 2 hours ago and are not happening anymore, how do you figure out what's going on? Well, that's where you're going to want to turn on log_lock_waits because then the logs can lead you through the same process to identify what the patient zero is or what the culprit is. As well as to minimize problems, especially when you're doing migrations, set a lock timeout. That'll cancel a query if it's unable to do its work because of a locking situation within this particular time period. So that's a must-have for migrations in my experience. But if you want to learn more, check out this blog post.

Next piece of content- "Why Postgres RDs didn't work for us (and why it won't work for you if you're implementing a big data solution)". This is from Max Kremer on medium.com. He said they started using Postgres RDS and he says, quote "When you're storing a large time series table (say 20 million rows) and the size of a table on disk is 10 GB and you want to scan the table...". Basically, he wanted to get a report on throughput, which is very hard to do on RDS the way they've set up things. This is what I was talking about earlier, saying 20 million is large because, of course, I've seen a lot larger. They tried to improve their situation by increasing the IOPS of their EBS volumes, but it didn't really help with the performance that they were looking for. They then switched to Aurora and they were hoping for, quote-unquote "Promises of 5-7 X performance increases".
Unfortunately, their costs went up instead of down, and apparently, they still weren't getting the performance that they were looking for for the cost associated with it. So they went rolling on their own. So they basically set up EC2 instances. They had a master or primary and a standby replica, and it says they were kept up to date using WAL-G. So I don't know if they were using streaming replication or they were using log shipping. So I'm not quite sure about that. They said the" WRITE node could be slow since it's only ingesting data". So apparently there's not a lot of data on a daily basis. "EBS was good enough, but they were actually able to double the throughput using ZFS with compression". And then on the "READ node or the replica they set up ephemeral NVMe drives", as well as ZFS, and I imagine that really boosted the read performance.

But also take a look at their bill. It went from $11,000 down to $2,100. I'm assuming that's monthly, so that's really great savings. That's a fivefold improvement. Also, "Queries that would take hours or timeout altogether run in seconds", so that's pretty incredible. And I must admit for my servers I still just use EC2 instances. I don't use RDS for anything at this point. But I will say with regard to ZFS, be careful because I have other clients that have set up ZFS for the storage with compression, and because there's only one apply worker with Postgres, the replicas are having a really hard time keeping up with streaming replication from the primary, so sometimes it delays significantly. So there may be some cases where ZFS and compression really help things, but it can also cause some issues from my experience. But check out this blog post if you want to learn more.
Next piece of content- "How to partition Postgres tables by timestamp based UUIDs". This is from pganalyze.com and in this "Five minutes of Postgres", he covers a forum post talking about where they used ULIDS to partition a table to get improved performance. That article is here in elixirforum.com and again, he said, quote "We had a very large table (28 million rows)..". So again that reference of about 20 million was really large, so I just thought that was interesting. But they were getting timeouts on querying anything other than the primary key to which I think 28 million is not really large and what perhaps was going on. There was probably an indexing solution to resolve this, but they went ahead and went forward with partitioning by ULID and they described how they did it, and they said queries went from multiple minutes to sub-seconds.

So I didn't see a lot of the detail about what the difference was other than partitioning, because in my experience, partitioning either doesn't give you a performance improvement or it's slightly incremental versus just having a large table. It's really great to do to ease maintenance because vacuum and autovacuum can work much more efficiently on multiple small tables versus a large table. But going back to Lukas's video, he actually shows doing this with a development version of Postgres 17 using UUID version 7, which that UUID has a time component at the beginning and then a random component at the end, and he used range partitioning by day looking at the time component of the UUID and I thought this was fascinating. So I definitely encourage you to check it out because this is probably how we're going to be doing this in the future. I can't wait for UUID version seven to get into Postgres. It looks like it might get into Postgres 17 and he discusses that in the video. We'll just have to see.

Next piece of content- "PGXN creator David Wheeler joins Tembo to strengthen PostgreSQL extension ecosystem". This is from tembo.io and exactly as it says, David, who is the creator of the PG extension network, has joined Tembo with the objective of maybe making that the definitive extension repository for Postgres, or at least putting work towards doing that. They have a few objectives they've listed down here. One, be the canonical source of record for all Postgres extensions. Two, be super easy for developers to publish their extensions in the ecosystem. Three, make extensions easy to find and understand. Next, make extension documentation consistent, standardized, and comprehensive. Next, provide stable, easy-to-use, and comprehensive APIs that downstream developers can use to build other tools and products. Next, provide automatic binary packaging for a wide variety of platforms. Lastly, provide intuitive, powerful interfaces for managing extensions. So this is super interesting to me if we start using something like this for extension repositories to be able to easily install and uninstall extensions. So we'll just have to see.

The next post is from justotherory.com and this is David writing about "I'm a Postgres Extensions Tembonaut". So he talks about his transition here as well.

Next piece of content, there was another episode of Postgres FM last week. This one was on "pgvector", and in this one, Michael was joined by Jonathan Katz, who's a PostgreSQL core team member and has done a lot of work on the pgvector extension. They talked all about pgvector, the two primary indexes that are being used now, ivfflat, as well as HNSW, and where potentially pgvector is going in the future. What I thought was interesting is that people have been advocating pgvector going in the core, but the thing that Jonathan mentioned is that maintaining it as a separate extension at this particular point in time helps it be more nimble compared to being in the core Postgres product. Meaning, Postgres does yearly versions, whereas they can iterate quite rapidly in an independent extension. So given that all this machine learning and artificial intelligence is so new, adaptability and being able to react quickly is highly beneficial. But it was a really great episode. Definitely encourage you to check it out.

Next piece of content- "Extensions for PostgreSQL, how we participate at Adjust". This is from engineering.adjust.com and they indicate down here they have about 30 extensions, most of them private, but 13 of which are public. So they are available. This is basically a directory of all the different extensions that they have available. One I thought that was particularly interesting is fist_last_agg. It basically gives you a function that lets you grab the first row when doing a group by, or the last row when doing a group by. They also have an istore which is basically a key-value store but is strictly for integers only. So that has a unique use case or they have multiple different enums like one set up for country or currency or a device type. So pretty interesting. You may want to check this out to see if there are any of these extensions you want to consider trying out.
Next piece of content- "PostgreSQL 17: copy and SAVE_ERROR_TO". This is from dbi-services.com and this is really great. So normally if you are importing data and there's some sort of error with the data. So for example this data layout gets imported with no problem. This data layout is going to have problems with a few rows here and you'll get an error with regard to it and it will just stop and not import anything. The table will be empty. But if you define a SAVE_ERROR_TO, it will load the data it can and then, I believe, save it to a location. Although in his example he saved it to NONE so I guess he just ignores it. But I assume this feature is supposed to write it to a separate file. Then you can make adjustments only in the file with the rows that have the issue, correct it there, and then upload those correctly. So I think that's a great addition. Check out this blog post if you want to learn more.

Next piece of content- "PostgreSQL is now a CVE Numbering Authority (CNA)". This is from postgresql.org and the Postgres security team can now submit CVEs on their own.

Next piece of content- "POSTGRESQL PARALLEL QUERY PROBLEMS IN JDBC AND DBEAVER'. This is from cybertec-postgresql.com. It was an interesting problem where parallel queries were not working when using this particular tool. DBeaver and if you look down here when they analyze queries from it, you can see two workers were planned but none were launched. So the planner thinks it can go ahead and do a parallel plan, but the execution does not use it. It didn't start any extra processes. There are some limitations for parallel query, but the particular problem with this one is that it uses the JDBC Driver. If you specify a limit at the protocol level here, it actually doesn't give you a parallel plan. The solution to it is to change it to zero and now the parallel plan works. So that's really strange. The solution if you need to limit the results, don't use this feature here, but just add a limit to your query if you're working on that.

The last piece of content- "POSETTE: An Event for Postgres 2024". This is from citusdata.com and is actually rebranded Citus Con. So Citus Con is now POSETTE and it stands for Postgres Open Source Ecosystem Talks Training and Education. So definitely look for more information about that in the future.

Scaling Postgres Celebration | Scaling Postgres 300

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript