background

Index Downsides, TCP Keep Alive, Development with Postgres, Learning PL/pgSQL | Scaling Postgres 189

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the downsides of indexes, TCP keep alive options, developing with Postgres as your DB and learning PL/pgSQL.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "How partial, covering, and multicolumn indexes may slow down UPDATEs in PostgreSQL". This is from postgres.ai, and they're talking about the case where indexes can cause downsides. The first thing he mentions here is "Primum non nocere" which means "first, do no harm". It's kind of like the Hippocratic Oath. It's basically saying when you add an index, you want to make sure you don't harm the performance of the database. Of course, anytime you add an index, there is a burden on the system to maintain that index. But he's talking about certain indexes where it can lead to a pretty big impact on performance.

Now, when he first covers some different index types, he's going to test them out here. The first one is partial indexes, which only index a portion of the table. So for example, where it says billed is not true, you're only going to index where billed is false, for example. Then he talks about multicolumn and covering indexes. A multicolumn index basically has multiple columns as a part of the index to help efficiently pull when you're asking for information from both columns. A covering index is basically an additional payload added to an index. Now the benefit of these is they allow you to get more index-only scans. So index-only scans get all the information you need from a query just by using the index.

You don't then have to go to the actual table in the heap to pull more information. So if you're only selecting column one and column two, in both of these cases, either with a multicolumn index or an index-only scan, if all you're asking for in the SELECT is these two columns, then you only need to go to the index to pull the information. Then he set up a number of experiments to see what impact different indexes have based on certain columns being updated. So he has an example of an asset table with a primary key, an owner ID created at timestamp, three columns, and a price. So basically he inserted random data into this table and he wanted to look at by owner ID as a sum of the price. He's also going to add where the price was NOT NULL to test a partial index.

Then he tested an update of the price based on the ID. So he put these three different queries and did a pgbench run on it. Now, he had some additional considerations here and you should definitely review this post, but I'm not going to go into each of the points he makes here. But based on what he wanted to test out, he wanted to test the NULL condition where essentially no additional indexes are added, and how these transactions perform. He added a one-column index on the owner ID. He added a multicolumn index including owner ID and price. He added a partial index where the price is NOT NULL on the owner ID, then a covering index that is on the owner ID and includes the payload of the price, and then a partial covering index using the owner ID, and the price is a payload where the price is NOT NULL.

And then he looked at what the performance resulted in. So basically, without the index, the transactions per second are quite low on the SELECT because it's doing a parallel sequential scan. The second SELECT where the price is NOT NULL is again using a parallel sequential scan. It's a little bit slower, about 50 transactions per second. So really, really slow. The update transactions per second were pretty fast, 46,000 transactions per second. Now we also include the HOT update ratio. What this is, it's a heap-only tuple update. So it can just go to the heap and update one of the tuple's prices, in this case, without having to update or change anything about the index. This shows you the percentage of the heap-only tuple updates that you get. Now, adding the one-column index on the owner boosted the SELECT transactions by about 1,000 fold for version one and version two.

So that was quite significant in improving performance, 1,000 fold. It actually went a little bit higher in terms of the update transactions per second and the heap-only tuple ratio was equally high. There was a little bit of an increase in the index size though, and this index size is measured before the tests and after the test. So what amount of bloat happened as a result of it? So, because nothing really changed in the index, it was basically doing heap-only tuple updates of the price. Nothing changed very much. Now looking at the multicolumn index or even the covering index, because those were similar, it actually gave about a 2,000-fold improvement, so faster than the one-column index, but it suffered performance-wise with the updates.

So the updates went to about 30,000 transactions per second, which was a loss of about 33% of the performance. You're not getting any heap-only tuple updates because the price is now a part of the index. So that's the key thing to keep in mind here is that if the column you're adding to an index, be it a multicolumn index or a covering index, if that column is updated, you are not going to get heap only tuple updates. So that can cause quite a bit of harm to your transactions per second for UPDATES. As you can see here in this table, you can also see that you're going to have a fair amount of bloating with that index which will require more vacuuming to resolve. So basically, this could be one significant downside of adding indexes to improve SELECT performance.

So by adding some of these indexes, you got a 2,000-fold improvement in performance, but you've reduced your update transactions per second by about a third. But keep in mind that this only happens because it's a column that is being updated frequently. If that column isn't being updated, then you won't see this type of transaction performance reduction for updates and won't see the bloat. But there may be cases where the performance of those SELECTs outweighs the reduction in the UPDATE performance. He mentioned in this blog post that he would probably go with the single-column index because it does give you a 1,000-fold improvement but still allows a lot of heap-only tuple updates and avoids a lot of index float. So this was a great post and I highly encourage you to check it out because it showcases some examples of where there can be downsides to adding new indexes in a pretty obvious way.

The next piece of content- "TCP KEEPALIVE FOR A BETTER POSTGRESQL EXPERIENCE". This is from cybertec-postgresql.com and they're talking about TCP keepalive. Basically, it helps detect broken connections or abandoned connections. Now, they mentioned there are some new features in different versions that can help track and monitor some of these. So they are mentioning that there are new session statistics in version 14 of Postgres that allow you to track sessions abandoned if there's a disconnect. It also has a new parameter called idle_session_timeout which closes idle connections after a while. But they say the problem with this particular feature is that it can terminate healthy idle connections as well. Maybe you want your connection pooler for your application to maintain those database connections open even though there's no active traffic going through them, but you want them to be there and ready to process database statements if one comes in.

What they're saying is that you can alter your TCP keepalive values to get a much better solution to this problem. Then they talk about TCP keepalive that allows quote "...network connections from being idle" and to "...detect if the other communication end has left without closing the network connection". It actually has three parameters that you can adjust. One is idle time and this is the amount of time that it is allowed to be left idle and nothing going on before anything happens. The interval at which after the idle time, it's going to check on those intervals to see if the connection is there and then count how many times before it actually times out that connection. Now, one thing that they said is that at 2 hours, a lot of times, firewalls or routers may close those connections. So the first thing you're probably going to want to do is to reduce this less than whatever that timeout is and that will help you keep an idle database session alive.

And secondly, you can adjust the interval and the count to be able to detect dead connections. Because otherwise, it's going to take approximately ten minutes if the interval is 75 seconds, then it counts to nine. So you're probably going to want to reduce those to more quickly detect invalid connections. Now, to make a change to these parameters, you can do it in the postgresql.conf file so they are configuration variables that you can change for Postgres and on the client there are connection parameters and you can change each of these in those connection parameters. Then finally you can do it at the operating system level that would cover all settings for every application on it, not just the database, but that could be another option if you want to go that route. Basically, with these settings, it gives you fine-grained control over being able to keep the sessions open you want to and then do early detection of database connections going down. So if you're interested in learning more, I highly suggest you check out this post.

The next piece of content is actually a YouTube video from ElixirConf and it's "I can do all things through PostgreSQL: Lessons for the Elixir dev". Now, this is a great video. It's about 35 minutes in length and it has a fair bit of humor in it. It basically covers where you would want to use the tools offered by Postgres as opposed to trying to solve solutions on your own within your application code. Now, Elixir is known to be a highly performant language. So if there are cases where it's more advantageous to use Postgres to solve a solution, you could probably apply this to any other language, particularly Ruby, which I am also familiar with.

It tends to be much slower in tasks than Elixir. Now, they cover things like using constraints within the database, using things like citext to handle case-insensitive text, the benefits of using it for searching in terms of text, searching in fuzzy matches, as well as Bloom Filters. Basically a lot of different areas where you're dealing with a large data set. It's more efficient to use the tools within Postgres as opposed to trying to pull a bunch of data over and process it using an Elixir. But even if you're not familiar with Elixir, I thought this was a great presentation.

The next piece of content- "Exploring PL/pgSQL: Strings, arrays, recursion, and parsing JSON". This is from eatonphil.com and he's basically giving a little mini-course on starting to use PL/pgSQL. He starts off very basic with the easiest function you can create. He's basically renaming the length function and it's called "length" itself. So it's basically this is one of the simplest functions that you can design and he shows how you can use named and unnamed arguments, specify out declarations, and then do a few things like working with numbers and recursion strings and arrays, and then even building a parser for JSON within it. So I thought this was a great post communicating how to get started with PL/pgSQL. If you're interested in that, you can check out this blog post.

The next piece of content- "Secure PostgreSQL 14 with CIS Benchmark". This is from blog.crunchydata.com, and they've released a new version of their Center for Internet Security benchmark that applies to PostgreSQL 14. This is a benchmark that's predominantly used again by the Center for Internet Security. It discusses how to handle things like installation and patches, directory and file permissions, logging, monitoring, auditing, user access, control connections and logins, Postgres settings, replication, et cetera. So if you want this benchmark

document, you can definitely check out this blog post and they have a link to it here the next piece of content.

Also security related is "Database Security Best Practices on Kubernetes". This is from blog.crunchydata.com, and they're talking about different practices you should use for securing Postgres on Kubernetes. And a lot of these are applicable even if you're not running it on Kubernetes, but they call out specific configuration variables you're going to want to set for it. They cover areas such as running the container as an unprivileged user, how to encrypt your data, how to handle credential management, passwords, keys, and things of that nature. How to keep your software up to date, like keeping up with new versions, how to follow configuration best practices, limit where you can write, so define certain file systems as read-only, and identify the weakest link. Kubernetes is one thing, but you have to cover other areas like your hardware, software, operating system, the applications themselves, who has access, etc.

Then a final post is "Encrypting Postgres Data at Rest in Kubernetes". This is also from blog.crunchydata.com. This post specifically talks about using Kubernetes and Postgres within the Amazon AWS framework and using their EKS service, and talking about how you can handle encryption values with this setup. So if you're interested in the security articles, you can definitely check out these.

The next piece of content. The PostgreSQL person of the week is Alexander Kukushkin. If you're interested in learning more about Alexander and his contributions to Postgres, you can check out this blog post.

The last piece of content. We had another episode of The Rubber Duck Dev Show this past Wednesday. In this episode, we talked about how many 3rd libraries should your application code use. So if you're interested in that discussion, you can check out this post. The next episode will cover how much planning you should do before you start coding.

episode_image