background

Automatic Indexing, Function Pipelines, With Hold Cursors, Query Scans | Scaling Postgres 188

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss automatic indexing, function pipelines, with hold cursors and the different query scans.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Why is it hard to automatically suggest what index to create?". This is from depesz.com. They're talking about a post that was done that we reported on in Scaling Postgres a week or two ago about PGAnalyze having an index advisor for Postgres. This is posted live for people to paste in their query and maybe some schema information and it gives a prediction on what is the best index to use. He actually tried it with a table layout like this. So this is the schema and then proposed this query here, looking at four different columns from this table and it recommended a single index with four columns to it. But there are a lot of issues with this. Basically, he's leading to the same conclusion I did, which is that it's not quite ready for primetime. It does basic things, but getting into the sophistication of predicting what indexes are needed, there's a lot more that would need to be done.

So you could take a look at this and say, well, we could do an index on each column where he says maybe you have six different versions where you have a two-column index or four versions with three-column indexes. It doesn't even take into consideration selectivity or how unique the values are within it or how frequently they occur or cardinality is another way to phrase that. He did an example of some here and talked about how this impacts whether you would want to have an index on it. So basically, if you have something that is a high cardinality, you have a lot of different unique values, then that is a very good candidate for an index. But if you have a column that only has, say, four different values, then putting an index on that doesn't make a whole lot of sense if you're talking about millions or even billions of rows. However, that would be a good candidate for potentially a partial index where you just index each of the four values within there and have their own index using the WHERE command.

He has a number of different conclusions down here that I think are very good to follow if you're thinking about the best way to apply an index. The point of the post is this makes it very hard to automatically come up with a way to design indexes for certain data. Now, we'll probably get there eventually, but it's still going to be a while before we get there. His recommendations are that if the selectivity of conditions is low, meaning almost all the rows match, then there's no point in putting a column on the index or an index on the column. If you have a condition that you always use, in this example it was where a status equals done, then you can put it in the WHERE part of the index definition. Also, don't add too many columns to an index because it makes it larger and less efficient.

Particularly if you have two columns and then adding a third column to the index doesn't significantly reduce the number of rows returned, then it may just make sense to go with the two-column index and allow the planner to filter out the rest by looking in the heap, for example. Another thing he mentions is when you have greater than or less than, which he did in his query, it can only operate efficiently one index on one inequality at once and generally, you want that at the end of the index. So you could, for example, have an index here that is looking at A and then B with greater than or less than, and then in the WHERE, have the D column or the status. So this is a great post to help kind of educate you on how you would design your indexes as well as discuss the issue of how hard it is to automatically determine what indexes should go on a particular table. But I think we'll eventually get there because all of these pieces of information are just more information to feed into the model to make better decisions. So I think we'll get there eventually, it's just not quite ready for primetime.

The next piece of content- "Function pipelines: Building functional programming into PostgreSQL using custom operators". This is from blog.timescale.com and they're talking about a very interesting extension that they've just released and so because it's a separate extension you can use it on any version of Postgres. Now, they have the Timescale extension for TimescaleDB so, clearly, you could use it there and they say it's available to use now in kind of a beta state. However, you can download the extension separately and use it on any version of Postgres. Basically, what it does is it uses a combination of custom data types, custom operators, and custom functions to enable you to do some interesting time series calculations. For example, they were looking to do a specific type of data analysis that they describe in the post here.

But in order to get information about variances over time for particular devices, you had to do a query like this with a window function and a subquery. However, it is not the most readable query. Now clearly you can do it but using this new extension with these pipeline functions, essentially, you make the query much simpler and it creates a pipeline of functions that you send your data through. So for example, the time vector here is a custom type that is a timestamp and a value together within that type. You pass them through a sort then you pass through them through a delta to get the differences between the previous one. So these are, essentially, functions that you're just passing data through, that's why it's called a pipeline. Then you take the absolute value of the result of those deltas and then you add them up altogether again grouped by device ID.

So I found this really interesting if you do a lot of data analysis work because it is much easier to read this and reason what's happening versus looking at the raw SQL up here. Now of course, this is not something built into PostgreSQL and you can't use it everywhere, but if your job requires doing a lot of analysis work, these types of functions could assist you. I believe they said they've developed 60 or 70 different functions. You might want to check out this extension because it is quite interesting what's possible, particularly for time series data, which is what these functions are designed for. Here are the number of different functions that they've come up with. Various unary mathematical, binary mathematical, compound transforms, and some lambda elements. So definitely an interesting blog post and if you're interested in learning more, I definitely encourage you to check this out.

The next piece of content- "WITH HOLD CURSORS AND TRANSACTIONS IN POSTGRESQL". This is from cybertec-postgresql.com. They're talking about transactions in PostgreSQL and when you're using cursors, generally, those all operate within a transaction and once that transaction is done, the cursor is no longer needed and is removed. A cursor basically allows you to grab one or several rows at a time from a query rather than getting all the data at once. So you can do different operations with it. In the example here, they have a PL/SQL code block that they have set up where they are looking at the information schema tables to then delete certain tables from a schema.

But you can also use cursors directly in SQL. Now what they're talking about is, again, cursors operate within a transaction. But there is a command you can use called WITH HOLD that will basically keep that cursor in existence after the transaction is complete. But you need to be cautious and close it because otherwise, it's going to be utilizing resources until it's closed. So I personally haven't used a lot of cursors in my work or found it necessary for my application. But if you want to learn more about cursors and how they're used, particularly within transactions, definitely check out this blog post.

Next piece of content- "Analyzing Scans in PostgreSQL". This is from arctype.com and they're talking about looking at how the query plans of Postgres work and specifically, they're talking about the different scans that are possible within Postgres. So for example, they're taking a look at the scan based on a particular query. The first scan they cover is, of course, sequential scans, where you're just scanning the table straight through, it's not utilizing an index, it just goes directly to the heap, directly to the table, and reads all the data. The next one they look at is an index scan where you have applied an index and it shows you how an index scan is applied once you're looking for a particular value in that index.

Then they cover looking at an index-only scan where the data that is returned is essentially the data in the index. So that gives you an index-only scan. Then they talk about when you're doing a query with textual data. Generally, what you'll see is a bitmap heap scan along with a bitmap index scan, and talks about how those kinds of work in tandem to be able to retrieve the data, particularly for text values. Then they also cover BitmapAnds and BitmapOrs that can be relevant when doing Or queries, for example. Then lastly they cover parallel scans, which is basically using multiple workers to scan over a table to return the data faster if you're doing a sequential scan of the table. So if you're interested in learning more about the scans for queries in Postgres, you can definitely check out this blog post.

The next piece of content- "Useful queries to analyze PostgreSQL lock trees (.aka. lock queues)". This is from postgres.ai. So basically, when Postgres goes to create a lock on, say a row for example, and then another process comes through and needs to update that row, it needs to wait. So essentially that has created a queue of processes that need to access data, but it is blocked by a lock. You could also call these lock trees because if another process comes along wanting to update that row, it needs to wait behind the second process until it's able to do its update and is allowed through.

Well, you can sometimes get these in your database system and you may need to diagnose it and the pg_locks table is the place to look. But working with that data and how quickly it changes as queries are happening in the system, could be a little difficult. He has utilized a number of different resources here, all of them very good about how to come up with what he thinks is the best query to use to analyze essentially your lock queues. So it's quite a log query and he does talk about how it works and the important points of it, but he just wanted to share the query that he uses in order to help him trace issues with lock queues, for example. So if you're interested in that, you can check out this blog post.

Next piece of content- "Projecting monthly revenue run rate in Postgres". This is from blog.crunchydata.com. He's talking about developing a query to calculate the projected monthly revenue run rate as well as an annual run rate. He goes through each of the steps of this relatively complex query of how we built it up and how it does the proper calculations for, say, a subscription business. So if you want to get some more query experience, you can definitely check out this blog post.

Next piece of content- "Postgres text search: balancing query time and relevancy". This is from sourcegraph.com and he's talking about not necessarily full-text search, but he's talking about the trgm extension and its usage because I believe they work with source code. They use trigrams instead of full-text search, which basically searches for words, whereas the trigrams are broken up into a series of three characters. Those are the general types of searches that they use. He was showing how when you do a search using a trigram search, you can relatively quickly get a set of relevant results returned. But you can also do the query such that you increase the relevancy and it gives you a ranking. But the problem he says, is with performance, the query not asking for a ranking runs in about 80 milliseconds, whereas one that gives you a ranking runs in about 7 seconds.

So quite a bit slower. He says, unfortunately, due to how it's designed, there's no real way around it. But he does mention something that's mentioned down here where someone proposed using a rum index. So this type of index would be larger than say, a gin index, but it's designed to be faster specifically for ranking. And they posted a link in that post and they're showing the example where yes, a gin is smaller but the rum has more information to help with that ranking. So basically if you use pg_trgm extension and you want more ranking or relevancy, maybe you would want to try a run index or read this post to kind of see the issues that he was dealing with.

The next piece of content- "Does QGIS work with Postgres 14?". This is from elephanttamer.net. This is a super short blog post that basically says yes it does. So apparently there were some issues in versions postgres eleven to twelve upgrade that caused some issues, but apparently with 14 he has not discovered any issues so far.

Next piece of content- "Upgrade Amazon RDS and Amazon Aurora PostgreSQL version 9.6". This is from aws.amazon.com. Basically, they're communicating that version 9.6 of Postgres, on these hosted platforms, will be classified as End of Life at the beginning of next year. Which is appropriate because 9.6 is no longer receiving releases from the Postgres community.

The next piece of content, the PostgreSQL person of the week is Ivan Panchenko. So if you're interested in Ivan and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we had another episode of The Rubber Duck Dev Show this past Wednesday. We discussed "Practicing Continuous Integration & Development". So if you're interested in learning more about that topic, you can check out this piece of content. Our upcoming show will be about how many third-party libraries you should use in your application code.

episode_image