background

Postgres 15 Released, File Systems, Connection Pooling, Secure Connections | Scaling Postgres 237

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the release of Postgres 15, the performance of different file systems with Postgres, options for connection pooling and how to secure your connections.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 15 Released!". This is from postgresql.org and they've released Postgres 15. They mentioned some of the main improvements made down below. They talk about improved sort performance with some benchmarks showing speed-ups of 25% to 400% based upon the data type sorted. Improved performance benefits with window functions with things like row number, rank, dense rank and count and SELECT DISTINCT can now be executed in parallel. They talked about a number of Foreign Data Wrapper improvements as well as the new encryption options for all files as well as pg_base_backup. Now on the development side, they talk about the new merge command, new regular expression functions, creating views that query data using the permissions of the caller as opposed to the view creator, and also a host of logical replication improvements. So definitely check out this blog post or any of the other content that we've been covering on Scaling Postgres if you want to learn more about all the new features in Postgres 15.

Next piece of content is "Be Ready! Public schema changes in Postgres 15". This is from crunchydata.com and they're talking about where a general user can no longer create objects in the public schema because it was deemed too much of a security risk. So if you have a general user and you just go to create a table, you're going to get permission denied for schema public. This makes it not as easy to get started with Postgres, but it's relatively easy to get started with a general user. If you want to do that, you could always use an admin user to create new objects. Or you could do it the way he describes here, where essentially you've created your user, and he creates a user here. You create a schema for that user and authorize that user to use it.

That means it becomes the owner of that schema essentially. And then when you create tables, it gets created in that schema. But you don't have to specify the schema beforehand because it's already part of the default search path. And he also shows you how you could share that data with other users. So basically grant usage on that schema for that user to another user and then grant SELECTs on tables. Or maybe you want to grant INSERT or other types of privileges for other users to the objects in that schema. But he says the easiest way to do this is by using roles. So you create a role, grant the privileges to that role and then you assign users to that role. So definitely check out this blog post if you want to learn more about that.

Next piece of content- "A Look at Postgres 15: MERGE Command with Examples". This is from crunchydata.com and he's talking about the new MERGE command which basically allows you to take two data sets and merge together the changes. He says one primary use case for this is if you're collecting a bunch of sensor data in batches, he uses the example where you place that sensor data in a temporary table and then merge it with the main permanent table. So he shows you exactly how to do that for this particular use case. So if you want to learn more about that, definitely check out this blog post.

Next piece of content- "Postgres vs. File Systems: A Performance Comparison". This is from enterprisedb.com. Now, if Postgres 15 wasn't just a release, this would probably be the first piece of content because this is a great piece of content, doing a lot of analysis using two different types of systems, hardware systems that is, to determine what difference file system storage makes with Postgres. So for example, a lot of people are just using ex4t if you're using a Linux system. But zfs is also becoming more popular now and btrfs has been around for a while. So basically, they used all these different file systems and then used pgbench to run different metrics against them to see what difference the file system makes with Postgres. Because Postgres, unlike database systems like Oracle, doesn't write directly to the hardware. Postgres relies on the operating system's file systems to write to the disk.

Now, this is a super comprehensive post with a lot of graphs, as you see here when I'm scaling through it. But overall, after looking at all this, the conclusion I came to is that ext4 is still the most performant file system you can use with Postgres. But they did have problems with btrfs, so you probably don't want to use that. Zfs was interesting. It is getting close to the performance of ext4 and because you can turn off full-page writes with it, you can actually have a much more consistent performance level of your disk compared to the checkpoints that are pretty visible when using ext4. But zfs still was not quite up to the performance of the ext4 file system.

In some use cases, they were showing it was half as fast when doing select queries of a small data set. That's not great, but in a lot of the other performance areas, it was much closer to ext4. Now, I do have some experience with a client who actually chose zfs as the underlying file system with Postgres, but they did that because they wanted the compression that zfs offered. And of course, when you're adding encryption on top of it, you're adding even more overhead to it, potentially impacting performance. But this was a great analysis and I definitely recommend you check this out to help you make the decision about what file system you want to use under Postgres.

Next piece of content- "PGBOUNCER: TYPES OF POSTGRESQL CONNECTION POOLING". This is from cybertec-postgresql.com, and the first thing they're talking about is why would you want to use it. Well, basically whenever you create a new connection in Postgres, it forks a process and that has a lot of overhead associated with it. So what a lot of people do is implement a connection pooler such as PgBouncer because they can very quickly create connections. It doesn't use a process-based model to do that and it maintains a back-end connection to Postgres. So you can think of it kind of like a proxy. He shows an example where he has a query where he's simply doing select one. When he processes it using pgbench over 10 seconds and ten concurrent transactions, he gets 295,000 transactions per second. Now this is without pooling at all. It's essentially creating ten connections and doing the performance run. But here he says, 'Okay, create a new connection for each query'.

Now upon doing that, the transactions per second go from 295,000 down to 3,000 or 3.7 thousand. So a 98.7% drop in performance. So it's huge, that overhead of creating new connections for each query, essentially. Now you're not always going to do that, but it just demonstrates how important not having to create a connection every time you need to use the database is and the argument for using a connection pooler like PgBouncer. Then he goes on to talk about the different types of pooling that PgBouncer can do. One is session pooling where essentially each connection gets its own session. So it's essentially a one-to-one; you connect to PgBouncer, you connect to Postgres. It's simply an intermediary. Then there's transaction pooling and basically, PgBouncer grants client connections coming in per transaction. So this potentially lets you have more clients than you have Postgres connections because you can have a many-to-one relationship.

If there's any latency going on or if there's nothing actively happening with a given connection, PgBouncer can give it to another connection that's coming in. So it helps you utilize more of your resources and you can have more connections on the client end. And I've seen as high as 5-10x the number of client connections to Postgres connections. So you could, for example, have 500 or 1,000 client connections going through 100 Postgres connections. And then lastly is statement pooling where basically each statement coming in could go to a different connection on Postgres. Now I haven't seen statement pooling used that much. I have done some session pooling, but for performance reasons, transaction pooling is the usual go-to if you want pretty good performance. It's not as much as statement pooling, but you can run into issues having transactions being spread across connections, for example. But if you want to learn more about that, definitely check out this blog post.

Next piece of content- "Don't make databases available on the public internet". This is from tailscale.com, and this is in reference to the post that we discussed last week where bit.io was discussing how many Postgres instances are open to the internet and not asking for SSL encryption. So one of the things is just don't make it available on the internet as the title suggests, but their solution to that is to actually use a proxy system. So for example, this is a VPN solution, Tailscale, and if you're in the VPN's network, you don't have to worry as much about configuring certificate SSL and using encrypted connections if you're going through presumably their VPN network. So basically, you don't have to be as stringent with the Postgres client. But this new open-source software they released, pgproxy, handles the Strict TLS and encrypted connections to the destination Postgres server that's being hosted somewhere. So they've released this proxy essentially as open-source software.

Now, interestingly, you could probably do the exact same thing with PgBouncer as we just discussed. So you could, for example, set the client connections to not require SSL and not use certificate authentication if it's a secure internal network. But then the server connections that PgBouncer makes, you can specify it to require SSL and to use verify full to avoid man-in-the-middle attacks. Another solution is just using any kind of VPN to talk to the ultimate server that you're working with. This could be using things like IPsec as well. So there are scenarios where you potentially could have Postgres open on the Internet, but you need to be very careful with your security if you're going to do that.

Now, another solution addresses this in the next piece of content. "VPC Peering: From Zero To Hero". This is from timescale.com, and they're talking about a feature of AWS called VPC Peering, where you can actually have two accounts set up so that they communicate with each other as peers and virtual private networks essentially. And they talk about how to set that up with their Timescale cloud service. But presumably, other providers offer this as well. So if you have your other resources in an Amazon VPC, maybe you could use another hosted provider's VPC and Peering to be able to securely connect them without going over the Internet in an unsecured fashion.
Next piece of content- "PGSQL Phriday #001 Wrap-up and review". This is from softwareandboos.com, and he's reviewing the community blogging that happened, and he talks about each of the blog posts that were done. These should have all been covered in last week's Scaling Postgres episode, so feel free to check that out as well if you'd like.

Next piece of content- "Six findings we rely on when managing PostgreSQL indexes". This is from knock.app and they're talking about how their engineering team thinks about and uses indexes. 
Now this is a more basic post, but in the interest of time, I'd actually like to recommend watching Lukas' take on it because this is the episode he covered this week in his "5mins of Postgres E39: Postgres 15 release, and 6 tips for managing indexes". And this can be found on pganalyze.com.

Next piece of content- "Don't Forget to Tune Stats Collector for PostgreSQL 14 and Older". This is from percona.com and he's talking about how with Postgres 15, they moved the Stats Collector from essentially writing files to disk to being stored in memory. But every version from 14 and older is essentially writing to disk. And there's actually a recommendation that if you want the most performance out of your system and basically avoid statistics impacting disk writes, you can actually use a RAM-based file system for it. Two recommendations they have is for RAMFS or TEMPFS and ultimately it looks like they recommend TEMPFS and they walk through how to set that up for your Postgres installation. So definitely check out this blog post if you want to learn more about that.

Next piece of content- "Partitioning in Postgres, 2022 edition". This is from brandur.org and he's talking about how far partitioning has come. Basically, he references years ago when before declarative partitioning came out, a partitioning solution and all the work that had to go into it was overwhelming. But with all the improvements that have been made in the versions over the years, he says it was quite easy to convert a large table into a partition table and he goes through the process of doing that here. But he does talk about some drawbacks. One is you can't concurrently create an index, which is a big issue I wish they would resolve at some point soon.

But there is a way to get around it when you're adding indexes is that you can create a concurrently on the child and then at the end just create the normal index on the parent and it should see that the index has already been created on the children and it'll just create it relatively quickly. He does say there is an outstanding patch for this and he hopes it's going to get in Postgres 16. The other drawback is not being able to support unique indexes across the whole table. So you can create unique indexes within a partition, but it doesn't operate on the whole parent table essentially.

Now with regard to that, the next post, "Global Index, a different approach" addresses exactly that point. When you're working with partition tables, again, you can't have a unique index across all of those tables and this is a bit of an issue and they've discussed different approaches and he says 'Hey, this may be a different approach'. So there's work definitely being done by highgo.ca here on trying to figure out how to do a global index for partition tables. So if you want to learn more about that, definitely check out this blog post.

Next piece of content- "5 Years of Postgres on Kubernetes". This is from thenewstack.io and this is from, I believe, a founder of CrunchyData. And CrunchyData, I think, made one of the first, if not the first Postgres operators to use in Kubernetes. And he talks about the five-year experience of doing it and all the changes that have been made, especially with the most recent version. So if you are looking to use Kubernetes with Postgres, definitely recommend checking out this blog post.

The second blog post related to that is "Run PostgreSQL in Kubernetes: Solutions, Pros and Cons". This is from percona.com and this is essentially a review of five different operators for Postgres that use Kubernetes. They review the CrunchyData PostgreSQL operator, the CloudNative PG operator from EnterpriseDB, the Stackgres operator from OnGres, the Zalando Postgres Operator, as well as the Percona Operator for PostgreSQL. So as you can tell here, there may be a little bit of a bias in this review because it's from percona.com, but nevertheless, definitely recommend looking this over if you're looking to start using Postgres with Kubernetes.

Next piece of content- "Postgres Full Text Search vs the rest". This is from supabase.com and in this post, they compared several different search engines to Postgres's Full Text Search. They compared MeiliSearch and OpenSearch, which is the open-source variant of ElasticSearch, I believe. SQLite FTS, and Typesense. Now he goes through the whole process of how they did the measurements, but I'm looking down here for the ultimate results and as you can tell, Postgres is in green and you can see it holds its own with essentially all the other search engines. Not quite sure what the issue is with OpenSearch here in terms of the latency of what's returned, but they show the number of results that were returned for given queries and the average latency for them.

You can see Postgres sometimes gets up there in terms of average latency, but overall pretty good showing compared to these other dedicated search engines, except for potentially the SQLite solution. So it's pretty impressive. Now, what they did say is that some of the other search engines handled misspellings much better. So there were some details in the actual results of the search where they excelled. So it wasn't just a pure speed decision. But of course, we should mention that Supabase does run, I believe on open-source Postgres. So there may be a bit of bias in this comparison, but nevertheless, an interesting set of results if you would like to check them out.

Next piece of content- "How to Set Up Streaming Replication to Keep Your PostgreSQL Database Performant and Up-to-Date". This is from enterprisedb.com and this post does exactly that. It walks through setting up streaming replication including asynchronous, as well as synchronous replication. So if you're interested in that, you can check out this blog post.

Next piece of content, there was another episode of Postgres FM this week. This one was on "PostgreSQL 15" and its new release. So if you want to learn more about that, definitely check out their show or click here to watch the YouTube video.

Next piece of content. The PostgreSQL person of the week is Christoph Berg. If you want to learn more about Christoph and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was on "Live Streaming Tools & Toys With Aaron Francis". So if you want to learn more about the software and hardware that are used to do streaming, for example, live coding streaming or just recording coding, definitely welcome you to check out our show.

episode_image