background

Man-In-The-Middle, pg_auto_failover, Lesser Known Features, LZ4 Compression | Scaling Postgres 191

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss new releases of Postgres due to a man-in-the-middle vulnerability, the high availability solution pg_auto_failover, lesser known Postgres features and LZ4 compression.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 14.1, 13.5, 12.9 11.14, 10.19, and 9.6.24 Released!". This announcement was made at postgresql.org. The primary reason is because of two security issues. One is a server that processes unencrypted bytes from man-in-the-middle, and the next is libpq processes unencrypted bytes from man-in-the-middle. So it's both a server and a client issue where their interactions can cause a man-in-the-middle attack. Now, it primarily looks to only affect if you're using trust authentication with a client cert requirement or using cert authentication. So if you're using, say, MD5 or SCRAM authentication, you should be unaffected by this particular vulnerability. But again, they've made changes to it, so it is a good practice to upgrade anyway. They also included a number of bug fixes and improvements. It says most of these affect 14, but they may impact prior versions as well.

Now, related to this, there was another post at enterprisedb.com called "Postgres 'MITM21' Vulnerabilities". That's an abbreviation for man-in-the-middle 21. They go into more depth about this particular vulnerability. They do say it is classified as a high severity, but it's also a high complexity as well. So there are definitely certain things that need to be done to be leveraged. However, this blog post gave a good overview of the vulnerability and its impact. They even include a query here to allow you to check if your pg_hba.conf file has any of these known vulnerable configurations, namely trust with the client cert or just a cert. So definitely encourage you to check that out.

Also related to this, there was also a YouTube video that was released called "Securing PostgreSQL From External Attack". This is from the EDB YouTube channel. So this kind of speaks directly to mitigating this sort of attack because ideally, your PostgreSQL server should be on its own internal network and only speaking to very specific servers. If you need to make any sort of connection to it, you should be going through some sort of proxy or bastion to be able to speak with it. So basically having a well-locked down system and network, and this is about a 43-minute presentation talking about some ways to do that.

The next piece of content- "PgBouncer 1.16.1" is released. This is also related to the security fix because pgBouncer essentially enables you to connect to it as Postgres and then connect to a Postgres server. Well, essentially they had to make some updates to be able to handle the security fixes in PostgreSQL, presumably. So you can check this piece of content for additional information about that.

The next post is "An introduction to the pg_auto_failover project". This is from tapoueh.org and is talking about pg_auto_failover and they're on version 1.6.3 now. Now, I find this interesting because I would say this is probably the simplest way to set up Auto Failover where it tries to automatically handle which is the primary which is the replica. It basically lets you set up a High Availability configuration, ideally with three or more instances. But this goes into the details of the project such as its early history developing the prototype and kind of the reason why they did it. Basically, it was a customer demand for people who are using Citus and it goes into a little bit of the architecture, how it works, why the way it does, and how it is different from some other distributed consensus systems like Patroni and for example, etcd.

So it's a little bit of a different implementation and as a result, it is simpler and less prone to some issues because the thing about High Availability, to have the automatic failover work they're using something like Patroni and etcd. There's a lot of sophistication you have to be aware of to make sure that the system stays up and reliable. This might be a simpler alternative, this pg_auto_failover. Definitely a long blog post that goes into a lot of the history and some of the architecture of it. So if you are looking for High Availability solutions, definitely check out this post to see if this might fit your needs.

The next piece of content- "Lesser Known PostgreSQL Features". This is from hakibenita.com and like many of his posts, this is a very long post. But I kind of think of it as it's 18 mini blog posts that are all put together about all sorts of different things that you can accomplish in PostgreSQL. Some of these are kind of lesser-known. I'll just list through some of these. One is how to get the number of updated and inserted rows in an upset. There's a particular way to do that. How do you grant permissions on specific columns? So some of these are ways to do things as this first one is.

It's not necessarily a feature but you can use the XMax column to check this. That's what this one does. The second one is to let you know where you can grant more than just permissions to a table but at an individual column level. Ways to do matching against multiple patterns. Finding the current value of a sequence without advancing it. Using \copy with multiline SQL. So a lot of quick hits, I would say, of different features that PostgreSQL offers. I highly encourage you to check out this post and to just skim through it. It doesn't take that long but you may discover some features and capabilities you were not aware of. So definitely encourage you to check out this blog post.

The next piece of content- "What is the new LZ4 TOAST compression in PostgreSQL 14 and how fast is it?". This is from postgresql.fastware.com. In version 14, lz4 compression has become available as an alternative to the built-in PostgreSQL TOAST compression. So the only level of compression this does is on TOAST files. So it's basically where the amount of data you want to insert a row exceeds the limit of that row. What then happens is it spills over into the TOAST file and by default, it does compress it, and it uses the pglz compression to do it. But now lz4 is available as an alternative compression algorithm in version 14. As you can see, you can describe a table using the \d+, and it actually gives you the compression output for a given column.

In addition, you can alter columns to change the different compression algorithms, so you can alter a column and set the compression to lz4. But you do need to be aware that lz4 compression needs to be built in when you're compiling and packaging it. Now, I tested this on Ubuntu, and the Ubuntu PostgreSQL 14 package does include lz4 compression by default. So you don't need to do a separate build step to get that. But if you do set it per column, it just compresses all new data with that new compression because you can actually use this function, pg_column_compression, to let you know what a given row of data is compressed with. So you can see row one had pglz compression, but when they changed that column, the next row that was inserted was at lz4 compression.

So I imagine if you want to apply this to a large table, you may need to essentially update all of the data in this table to get the new level of compression. But the interesting part of this post is they also go into performance, and they checked out size comparisons. So comparing the table size between them. What they discovered was that lz4 is slightly larger than the default compression of PostgreSQL, but the advantage is in its compression speed and decompression speed. So, for example, they checked INSERT performance, and you could see that the lz4 compression is very close to uncompressed in terms of INSERT performance, but it's up to five times faster than the default compression.

So that could result in a huge performance gain if you have very large pieces of information you're inserting that need to go in the TOAST. They then checked SELECT performance, and again, there was an advantage with the lz4 compression in terms of speed. Then they tried multiple clients. And still, the lz4 compression had advantages at both the INSERT level and the SELECT level. So it definitely seems from these tests that lz4 has a big advantage in terms of speed when working with TOAST compression. So if you want to learn more, definitely check out this blog post.

The next piece of content- "PostgreSQL 14s enable_memoize For Improved Performance of Nested Loop Joins". This is from blog.jooq.org and he's talking about a new feature in PostgreSQL 14 and he included a Twitter post that someone made that said "Wow, memoize in PostgreSQL 14 makes one of my queries using join lateral 1,000 times faster". Okay, so the question is what is this so he goes into what is memoization. It's basically whenever you have a given function and a given input, if it keeps receiving the same output, you can basically cache that value for given inputs, essentially creating a hash map so that you can very quickly give an answer to a particular question rather than going in and computing it every time. So it kind of is a form of caching. Now, by default in Postgres 14, this is turned on. So he checked it and the current setting was on. If you do an EXPLAIN plan where you would potentially use this with a nested loop, you can see the memoize in the EXPLAIN plan where it's being used.

Now he checked some different benchmarks, did a generic run and he didn't see a significant variation. There was a slight advantage with the memoization but not much. So then he tried benchmarking a lateral join and with that, the performance with it off was at about a 3.5, and then with memoization on it went to about 1 second. So that was a pretty significant boost in performance. I mean, it's not 1,000 fold like that Twitter thread suggested, but this was a pretty significant speed up and maybe at larger data volumes it could lead to an even greater improvement. Now, he did try a correlated subquery which essentially gives you the same results as a lateral join, but he didn't see the same advantage. So it looks like there is some optimization that would need to know to use memoization in these particular cases when you're querying in this fashion, but definitely an interesting feature being added to version 14. And if you want to learn more, check out this blog post.

The next piece of content- "Generating more realistic sample time-series data with PostgreSQL generate_series()". This is from timescale.com and this post is a follow-up post to generate_series where this one they're focusing on creating more realistic numbers. So realistic distributions of numbers, realistic text as well as sample JSON. Now, a lot of times for generating sample data I'm using developer tools like in Ruby, there's Faker, in Python I think there's something similar. Now, I imagine these solutions are not as robust as those, however, I bet the performance of generating a lot of data would be a lot higher with these solutions compared to using some of these other libraries. So if you're interested in doing that, you can check out this blog post.

The next piece of content- "Multifactor SSO Authentication for Postgres on Kubernetes". This is from blog.crunchydata.com and kind of following up on the man-in-the-middle attack. This is a potential solution that could have sidestepped that. Basically doing multifactor authentication. That's where you say you have a certificate, but you also have to provide a password as well, such as through SCRAM or MD5. Now, this blog post explains how to set up that type of dual authentication mechanism, something you have a certificate and something, you know, a password, but it does it for Kubernetes. So if you're interested in learning how to do this for Kubernetes using their Postgres operator, you can check out this blog post.

The next piece of content- "Postgres pgagroal connectionpool". This is from dev.to. Basically, this is another connection pooler that's similar to pgBouncer. So it enables you to do connection pooling for Postgres, but its focus is on performance. And this blog post explains a little bit about using it, particularly on Red Hat Enterprise Linux Distributions. So if you're interested in learning more about that, you can check out this blog post.

The next piece of content- "FIND MISSING CROSSINGS IN OPENSTREETMAP WITH POSTGIS". This is from rustprooflabs.com, and they're talking about a 30-day map challenge that's going on in November, 2021. For day five, there was a particular submission and he completed it using OpenStreetMap data as well as PostGIS and QGIS for visualization. So if you want to learn about how we did that and his implementation, you can check out this blog post.

The next piece of content, the PostgreSQL person of the week is John Naylor. If you're interested in learning more about John and his contributions to Postgres, definitely check out this blog post.

The last piece of content. We did have another episode of The Rubber Duck Dev Show this past Wednesday. This episode was on how to start testing if you really don't have any tests for your project. So you can check out this video or the podcast audio.

episode_image