background

Redis vs Postgres, Hard Quadrant, Optimizing Joins, Materialized Views | Scaling Postgres 170

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss Redis vs. Postgres, working in the hard quadrant, how Postgres optimizes joins and working with materialized views.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Before we get started with our first piece of content, I wanted to mention that I will be starting a new live show called The Rubber Duck Dev Show. This show will be all about software development. It'll be me as well as Chris, so if you're looking for more developer-focused content, maybe you want to check it out. So the last link in the Scaling Postgres show notes will have a link for you to check out the channel trailer for it. This will be a livestream show where we can accept questions so be sure to join us if you're interested.

Our first piece of content is "Do You Really Need Redis? How to Get Away with Just PostgreSQL". This is from atomicobject.com and he's talking about there are certain use cases where people tend to use Redis but Postgres can also handle many of these use cases as well. The first use case he mentioned is Job Queuing. So frequently you have multiple worker boxes that run asynchronous jobs, typically in a web application, and a lot of times Redis is used for this purpose. However, you could use PostgreSQL and most notably when you do an update and you're selecting the records for it, you can use SKIP LOCKED. So here's an example. Within a transaction using a CTE to basically select the next pending job and do it for an UPDATE SKIP LOCKED. When you do that update to indicate it's running, it will only pull one and won't pull any that are currently being locked by some other process.

Because presumably, you have multiple boxes with multiple processes that are processing jot, it prevents them from stepping over one another. Now he says at some level this does break down with the locking, it may become less efficient, but he says for most of the apps he's worked with this has been sufficient. Now one reason why Redis might have an advantage here is because it doesn't need to worry about stepping on other processes or coordinating locks of the data because it's only single-threaded so it only has one process or thread to worry about accessing the data it needs to work with. The next use case is Application Locks. So you can establish a lock on something and you can use something in Postgres called advisory locks to do something similar in Postgres. I've actually known some clients that have run their own Postgres instance and it's only responsibility is advisory locks.

It didn't really store data for them, but it was a separate instance to be able to utilize this feature. The next use case is Pub/Sub. So being able to publish and subscribe to notifications. PostgreSQL has the listen notify capability, so you could use this in lieu of Redis. So Redis does have a lot of advantages and some of the things he mentioned here are for caching data with particular TTLs and storing and manipulating ephemeral data. I'll also add to this one thing that Redis excels at compared to Postgres: if you want to update a single row or a single count or a total value, it's much better to do that in Redis than to do the same thing in Postgres. Because frequently when you do that update, it's not going to do an update in place, but it's going to create a new row.

Therefore you could have a huge bloat problem if all you're doing is updating a particular value super fast. In that use case, maybe Redis would be better. But the interesting thing is with Postgres eventually being able to support different storage engines when something like zheap is ready, maybe you could create a table that utilizes zheap and then you wouldn't have that type of bloat issue if you just want to update a value really fast. But this blog post was an interesting comparison of Redis versus Postgres and how Postgres has some features that people have typically used Redis for. So I encourage you to check it out.

The next piece of content is "The Hard Quadrant". This is from momjian.us and he's basically talking about the issue where he frequently has to deal with or work with locks and sharing and the complexity inherent in that. Going back to the previous post, Postgres has multiple processes that could be accessing the data at any one time and how it handles that is through locking different rows or locking tables or things of that nature. Whereas Redis is single-threaded, so it doesn't need any locks, it's just a single worker thread that's accessing the data it needs and retrieves it so there's no need to lock anything. So by its nature, Postgres is going to be more complex in that area. He actually references a talk that was given and shows a slide that he feels expresses the reason for the complexity. That slide is here and it shows a four quadrant image looking at two different axes.

One is whether data or information is unshared or shared and whether that data is mutable or immutable. So basically, if you have things that are immutable, it doesn't matter whether they're shared or unshared, there's no need for synchronization because the data is never going to change. But when you're having something that's mutable, like you want to change someone's email address if nothing's shared, for example, like Redis, there's no synchronization. It's pretty easy to handle. But if you have multiple processes that need access to something, then you get a lot more complexity. So it's this shared mutable area that causes a lot of complexity so if ever you can make something more immutable, it makes it easier to work with as well as not being shared. So I thought this was a very interesting concept and I encourage you to check out this blog post as well as some of the talk.

The next piece of content is actually a YouTube video and it's "PostgreSQL Optimizer Methodology". This is from Robert Haas and it was posted on the CMU database group channel. Specifically, he's talking about optimizing joins and how the optimizer works through how to do the proper join for a query. It includes talking about analyzer statistics and ways that you could actually fool the optimizer depending on the queries that you send to it. So this was a very interesting deep dive into how the optimizer works and if you're interested in that, I highly encourage you to check this video out.

The next piece of content- "CREATING AND REFRESHING MATERIALIZED VIEWS IN POSTGRESQL''. This is from cybertec-postgresql.com. He's talking about materialized views. So views, just a standard view is a virtual table and it never stores the data that it reveals. Basically, it runs dynamically every time. A materialized view actually creates a table out of what you are querying. So for example, he has a table and inserted a number of rows into that base table and then he creates a materialized view from that table he created. Now the data in the materialized view is minimal, just two rows. But that is what a materialized view is, it saves that data so it doesn't have to go back to the main table again like it would with a standard view. The thing to keep in mind with a materialized view it's just another table. So for example, you're going to want to vacuum that on a regular basis, you're going to want to analyze that on a regular basis.

So for example, if you've just created it, maybe you want to vacuum-analyze it. Now some of the things he mentioned here is that because it is a standard table, you can also use the USING clause which specifies which storage format you want to use. So right now there is just the heap. But again, in the future, they're talking about having zheap or zedstore to be able to do it. So eventually you could store it in a different format. You can also set different table parameters. Like for example, this one has actually disabled auto vacuum for this table. So that's something you could choose to do. Or you can also create it in its own table space so that's something you can do with a materialized view. Lastly, you can modify the materialized view by doing an ALTER STATEMENT.

Then of course, the most important thing that people tend to do is refresh the materialized view because the materialized view creates static data at the time it's created and it's never refreshed again. So you have to do that manually. For example, you just say REFRESH MATERIALIZED VIEW and give it the view name. But the problem with that is it does lock that materialized view. So typically what you would want to do is do it concurrently. But what he does mention here is that in order to do it concurrently, you do need a unique index on that materialized view. So be sure to do that if you want to refresh it concurrently. So this is a great post all about materialized views. If you're interested in learning more, definitely check out this blog post.

The next piece of content- "Fun with pg_checksums". This is from blog.crunchydata.com and checksums are meant to check for corruption of the Postgres data that's on the disk. Now by default, these are not enabled, unfortunately. In order to enable them, you have to do it at the point at which you create the cluster. So I haven't frequently worked with clients that have had this enabled by default. And right now, even on the different operating system installers, when you go to install Postgres, unless you specify it, they're not going to be enabled. Now, there is some performance impact with enabling them that you'll have to check out what that is.

But if you are concerned about the integrity of your data files, you would definitely want to enable them. This goes through the process of how you can do this. You can enable checksums after your database has been created and running, but you do have to shut down your database and then let it run through all the data files to do those checksums. So on a terabyte-sized database, that can take quite a bit of time. But this post does go through that process of enabling checksums for your database and even tracking the timing using the TIME command to see how long it would potentially take.

Now, they did mention one thing here where they have a patch into Postgres 15 that may allow you to apply checksums in a piecemeal manner. So for example, shut down the database, run pg_checksums for a while, stop it, start the database again, and then try it again later to incrementally build up. But we'll see what happens with that. A great feature for this would be to be able to apply a lock on a given file, run the checksums for it, and then release the lock while the database is running. But I haven't heard of anyone working on a patch to do that yet. But if you want to learn more about pg_checksums, you can definitely check out this blog post.

The next piece of content- "ABOUT CROSS JOIN IN POSTGRESQL". This is from cybertec-postgresql.com. They're talking about cross joins which are a product of two tables when you join them together. So a cross join of these two tables, this with three rows, this with two rows gives you six rows of data. If you use the join syntax of a column in the FROM clause, by default, it's going to do a cross join unless you put a WHERE statement in there. For example, WHERE A ID equals B ID, it'll do a cross join. That can be pretty dangerous if you have a lot of rows in the table.

So for example, because it's a product of the total number of rows in each table, if you were to do a cross join of a table with the million rows, with another million-row table, you would actually get a trillion rows back. So a lot of times a cross join is not what you want. Actually, what they mentioned here is that the safest way to do it, to potentially avoid accidentally doing one, is to actually just use this syntax. So use the inner join syntax or cross join or left outer join, et cetera. That's a safer way to do it to avoid mistakenly doing a cross join. But they do say there are some use cases as they talk about a lateral cross join and joining with a particular variable where maybe you do want to use some cross joins. So if you want to learn more about them, you can check out this blog post.

The next piece of content- "Postgres 14 highlight- Monitoring for COPY". This is from paquier.xyz and they've introduced a new view called pg_stat_progress_copy that actually tracks the status of a COPY while it's in progress. So if you're trying to copy data into or out of Postgres, if it's going to take a particularly long time, you can actually track its progress now with this view. So if you want to learn more about this, definitely check out this blog post.

The next piece of content is a YouTube channel. I'm highlighting the Percona YouTube channel again because again, over the last week, they've posted a number of new videos, some on Postgres. So if you're interested in more video content, check out this YouTube channel.

The next piece of content is "Promoting specified node in Pgpool-II". This is from pgsqlpgpool.blogspot.com. This is a new feature in Pgpool II where you can actually specify a node to become the primary. So historically you would run with a primary and have some standby nodes with Pgpool II, but you could never explicitly tell it to promote a given node. It would automatically choose which node to promote, whereas, with this new feature in 4.3, you can actually specify the specific node you want to promote. So if you're interested in learning more about that, check out this blog post.

The next piece of content- "How (and why) to Become a PostgreSQL Contributor". This is from blog.timescale.com. This post runs through how you can become a contributor to PostgreSQL code development, and he walks through different steps, such as identifying your motivation. Why would you want to do it? Learn about the development process of Postgres, how to identify the first patch that you want to submit as well as how to contribute and avoid common mistakes. So if you're interested in becoming a contributor or learning more about the process, definitely check out this blog post.

The last piece of content, the PostgreSQL person of the week is Josh Berkus. So if you're interested in learning more about Josh and his contributions to Postgres, definitely check out this blog post

episode_image