background

New Postgres Releases, Privilege Escalation CVE, Chaos Testing, High Availability | Scaling Postgres 228

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss new Postgres releases, a new privilege escalation CVE, chaos testing a high availability kubernetes cluster as well as addressing other H/A questions.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 14.5, 13.8, 12.12, 11.17, 10.22, and 15 Beta 3 Released". So quite a few releases. This is part of a regular release schedule, so basically, all the recent versions of Postgres have been updated. They also give an FYI that Postgres 10 will stop receiving fixes on November 10, 2022. So that's coming up rather quickly. So if you're still on Postgres 10, you'll want to upgrade to at least eleven, if not further. They mentioned they resolved over 40 bugs with these releases, as well as fixed this security issue here. And finally, this marks the third beta of Postgres 15, so you can download that and test that out if you're interested. But if you want to learn more, definitely check out this post from postgresql.org.

The next piece of content is related to the CVE that was fixed in these releases in the article "PostgreSQL Extensions Impacted by CVE 2022-2625 with Privilege Escalation". This is from enterprisedb.com and it goes into more depth about this vulnerability. So it's rated a 7.1, which is a little on the high side. So you're probably going to want to patch this as soon as you can. And this post works through the vulnerability. Basically, there are two scenarios. You could have a rogue extension that you install that acquires more permissions than it should, or an existing user could set the stage.

If an extension is installed, you would get a privilege escalation and it walks through more in-depth than the Postgres post about this CVE and the ramifications. Now, if you're not able to patch this relatively soon, there are different mitigation strategies you can use that they go into in this post. One of them is this convenient query to determine if there are any objects that are owned by a user other than the extension owner or a bootstrap superuser. And basically, if they're found, you have to do an investigation to see what security impact that has and they go over some different mitigation strategies to handle it if you're not going to be doing the patching. So if you're interested in that aspect, definitely check out this blog post.

Next piece of content- "Chaos testing of a Postgres cluster managed by the Zalando Postgres Operator". This is from coroot.com and they're talking about the Zalando Postgres operator. So from what I understand, Zolando is a fashion site. So it appears their IT department has developed this Postgres operator for using Kubernetes to manage Postgres installations. And they're using this particular Postgres operator to set up a Postgres cluster. And they're using their tool, Coroot, to monitor it, while they use chaos testing to test different failure scenarios and see how it reacts. So I believe this is how they set up their Postgres cluster using the Zalando operator and they're creating three instances.

One is a primary with two replicas and they have different supporting servers as well as the application. And these lines signify direct traffic. I believe this is part of the Coroot application that's monitoring things. Now, normally I wouldn't look at such a product-heavy post, but this post is really cool from the fact that they do a test like 'All right, let's simulate a replica failure'. So they're going to fail a replica and they use this network chaos toolkit to do that, but then they have a video representation of what happens. So this is what happens when you isolate a replica and you can play it and it shows the different connections dropping from this replica and the active connections from the still existing replica.

So you can see in real-time how the operator handles these failure scenarios. Then they go over what happens when the primary fails. They even have a video on that. They show you what happens if you have excessive replication lag. They simulated that by reducing traffic to the replica by 80% loss, which causes it to delay and fall behind and then the existence of a connection pooler. Then you have a pooler where the connectivity is broken and they even do a major Postgres upgrade to see how that impacts the cluster. So even though this shows a lot about a particular product, it's fascinating seeing how the operator handles these different failure scenarios. And if you're interested in that, definitely check out this blog post.

Next piece of content- "How High Availability Works in Our Cloud Database". This is from timescale.com, and they're going over at a very high level how they handle high availability for their cloud solution. Timescale Cloud is what they call it now. Basically, they're highlighting the feature set that kind of already exists with Amazon AWS. They're emphasizing how they separate their compute nodes from their storage nodes and that if you have a failure of a compute node, you can actually take the storage node and put it on another compute node. Which is how Amazon basically works. And given that the availability of the compute nodes is much lower than the storage nodes, 9 times out of 10, if you have a failure, it's going to be in one of the compute nodes and basically, there's a scenario to replace them and put the storage node on another compute node.

But again, that's pretty much Amazon basics. But how they do that flip over is they do say they are using Kubernetes for some of this. Now, they didn't mention if they're using a particular operator or how they're using it, but they say there may be a future blog post about it so that may have more detail. Again, this was basically a more high-level post. Then they talk about storage and how they have a load balancer that directs where your read and write traffic is going and how it directs traffic between different nodes and it looks like it is using the standard WAL streaming mechanism. And they talk about what happens when a failure occurs and basically doing a promotion and the last area they cover is backups and how they handle that. So if you want a little bit of insight at a high level of how timescale Cloud manages its databases, you can definitely check out this blog post.

Next piece of content- "PG Phriday: Do's and Don'ts of Postgres High Availability Q&A". This is from enterprisedb.com, and there was a post talking about Postgres High Availability, the Do's and Don'ts back at Postgres Build 2021, but there were some questions that weren't addressed, and they're basically going through those unaddressed questions there. One is talking about split-brain scenarios, one is talking about the ordering of starting a replica in primary. What order should you start in them? Could that trigger a failover? Talking about different backup commands when you have a standby that's falling a little bit far behind and potentially causing WAL to accumulate on the primary. Using multi-host connection strings as well as other failover solutions such as pg_auto_failover and talking about that a little bit and then actually some Patroni questions as well for tools for doing high availability. So if you want to learn more about that, definitely check out this blog post.

Next piece of content- "5mins of Postgres E30: Postgres performance puzzle, Linux readahead and tuning effective_io_concurrency". This is from pganalyze.com. And this is the performance puzzle that I covered in the last two episodes of Scaling Postgres. Lukas basically addresses this in his episode as well and actually talks about how it relates a little bit to the setting of the effective I/O concurrency and how that can potentially impact Linux read ahead. Now, I will say one thing that I was a little confused by is that he states one of the reasons for the slowness is that the "...index layout is ineffective".

I was a little confused by that because the index will be an ordered structure of those values. I don't see it necessarily being ineffective. But the reason for the slowness is actually because you scan the index and then you need to go to the heap because, in this performance example, you're counting rows from the heap, not the index. So whether you use the first index or the second index, if you are pulling back the data from that index, it'll be equally fast. The slowness comes when you have a low correlation between the ordering of the index and the heap. So the first index is highly correlated in terms of order with the data on the table. So that's what makes it fast and able to do essentially a sequential scan.

The correlation between the index order in the second column compared to the order on the table is very low. Therefore, when doing a search, you can do a fast sequential search on the index, but then when you actually need to go to the heap, you're doing random I/O all over the place. Now, if you do just an index-only scan, counting the rows in the index is super fast. But it's when you go to the heap that things get super slow because it has to do all that random I/O. So I think the issue is that the correlation of the index to the data in the heap in the slow example is very low, which causes the issue. But if you want to learn more about that, you can definitely check out this blog post.

Next piece of content- "HAVING, a less understood SQL clause". This is from smallthingssql.com. Basically, how I explain HAVING is it's the where clause that happens after a group by because basically, you group a set of data. Then if you want to further filter it and say only show me the rows where the total is greater than 100, that's what HAVING is used for. It's essentially a where clause for the rows that get generated by a group by. Now, in this example, he comes up with a pretty complex example where he has a couple of tables and to get the data he's interested in without using HAVING, he did this very complex 4 union, really a ton of unions to be able to get the data necessary. He filters this big long query into this very simple query by introducing the HAVING clause as well as doing the group by a cube also helped eliminate some of the complexity. But if you want to learn more about that, definitely check out this blog post.

Next piece of content- "Queries in PostgreSQL: 6. Hashing". This is from postgrespro.com and they're talking about hash joins and how they work in Postgres. And these blog posts on PostgresPro definitely are super comprehensive. So if you really want to know all there is to know about hash joins, definitely encourage you to check out this blog post.

Next piece of content, the next episode of Postgres FM has been posted and they've been doing a really good job of posting episodes every week. Now before with Postgres TV and Postgres FM, it was a little more inconsistent, but now for over a month, they've been pretty consistent, having weekly episodes. So for now I'll post the episode link, but definitely check this out. Interestingly, they don't have a link to the Postgres TV YouTube because they are posting these on the Postgres FM YouTube channel as well. But if you're looking for more long-form content, you can definitely check out these episodes. The most recent one was on "Vacuum", as you can see here.

The next piece of content. The PostgreSQL person of the week is Adam Wright. If you want to learn more about Adam and his contributions to Postgres, definitely check out this blog post.

The last piece of content. We did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This episode was on "Open Source Experiences And The Pay Gem With Chris Oliver", and he's done many projects, but one of them is The Pay Gem. So it's a Ruby library that helps you extract away different payment providers and provides a common interface for working with them. If you're looking for a more long-form developer discussion, we encourage you to check out our show.

episode_image