background

Combining Indexes, Regular Expressions, Parallelism, PostGIS | Scaling Postgres 75

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss combining indexes, regular expressions, parallelism in Postgres and PostGIS content.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about combining indexes, regularexpressions, parallelism and PostGIS. I'm Kristen Jameson, and this is scaling postgres episode75 one.

[00:00:20] Alright, I hope you're having a great week. Our first piece of content is bind indexesversus separate indexes in PostgreSQL. This is from Cybertechnposgresql.com and they'rebasically talking about indexes on single columns versus covering multiple columns. So for thisdiscussion, they go through and create a table with three columns, a, B and C, and insert amillion columns into each with random values. And then they also create an index that covers allthree columns. So a multicolumn index of ABC. Now, the first thing that you start with issomething pretty basic that PostgreSQL will rearrange filters for you. So it doesn't matter howyou order your where clause, it'll still use the conditions and scan through the index in the properorder. It'll do that job for you. It doesn't matter how you place it in the index, it even doestransitive properties and breaks them out into how they should be. So even if you have B equalsA, A equals C, C equals ten, it'll know to do A equals ten, B equals ten, C equals ten, and alsowith a multicolumn index, it can use the first part of the index to do searches. So since A was thefirst column in this multicolumn index, you can search A equals ten and it will use the index toscan those rows. And the key thing is, you can also use it in the where clause as well as theorder by clause. So they did a where clause of A and then ordered by B and C. So this is anatural fit for the index. It filters by the A column and then orders by the B and C, which are thenext following columns in the multicolumn index. But of course, they also discuss cases where acomposite or multicolumn index won't work, where if you're searching on a second column or athird column, it's not going to use the index you're going to get, as it shows here, a parallelsequential scan. It's not doing an index scan. Now they say sometimes on occasion it may scanthrough the index, essentially doing a sequential scan of the index based on certain conditions.And they simulated this by setting the sequential page cost relatively high. And then when theydid the scan, it does scan through the index, but it's also a lot slower. So this one is 63milliseconds, so using the second column, 63 milliseconds, whereas the next closest example isfour milliseconds. So definitely a lot slower doing this scan. So it can't really only search on thelast column or middle columns. It needs to at least have the first or 1st 2nd columns in themulticolumn index and then they go into understanding bitmap index scans. So this is where itcan combine the results of indexes so it does a bitmap scan in this example searching where aequals four and then it does another bitmap scan where a equals this other number and it doesa bitmap or them together to combine the results. So that's also something it can do. And in thelast set of examples, they created single column indexes on all three columns. So no longer themulticolumn index and they did the same search. And of course, it can use those index andcombines them together. But if you'll notice, it believes it's more efficient not to use all threeindexes at once, but it uses what it feels are the two most efficient indexes, combines themtogether using a bitmap and and then just does a filter on the index they did not use. So theoptimizer used the index on B and C but it just used a generic filter on A and lastly they follow upwith optimizing Min and max in SQL queries. And this is an additional case when you're lookingfor the Min of a value or the Max of a value, it will also use the indexes. So this post gives agood overview of how indexes basically work and maybe some of these ideas can help youdecide whether you want to use single or multicolumn indexes as you're developing your tables.

[00:04:12] The next post is PostgreSQL regular expressions and Pattern matching and this isfrom Secondquader.com. It says postgres supports multiple different types of pattern matching.One is Possex style regular expressions, the other is similar to which is an operator added inSQL 1999 and the SQL like operator. Now, I usually tend to use a like operator, but thisparticular post talks about possex style regular expressions and that you can use the tilde formatching a regular expression. --Case sensitive. The tilde star, which matches a regular expression. Case insensitive,exclamation point tilde does not match the regular expression and it's case sensitive. And thenexclamation point tilde star or asterisk does not match the regular expression and it's caseinsensitive. And it goes through some different examples just to show it. So this is a pretty easypost to look at if you want to start using regular expressions in your queries.

[00:05:14] The next post is parallelism in PostgreSQL and this is from Procona.com and actuallythe next four posts this week are from Procona.com so they've definitely posted quite a few thisweek. So first they're talking about the components of parallelism and this is mostly theparallelism features that were introduced in 9.6. And they're talking about the concept ofPostgreSQL is process based. And generally you have one process that does a particular querybecause generally you have one process per connection. But with the parallels and features thatthey introduced, you can get additional threads. So here you have a gather thread and thatcoordinates multiple workers. So if PostgreSQL believes that a piece of work can be done inparallel, it will then spawn multiple workers that send its tuples up to the gather node in order toparallelize this particular query. And they talk about the different capabilities this enables. Forexample, the first is parallel sequential scans. So if you're scanning a whole table in order to lookfor data, this makes total sense to be able to do it in parallel with as many courses you have. Inthis first example, I believe they did it just with a single yes, they did it with just a singleprocessor and it executed in looks like 4343 seconds. Whereas when they did it in parallel usingten workers, it completed in 1628 seconds. And so they post the graph here and what they didnote is that there is a diminishing set of returns eventually. So once you go in parallel, numberone, it's not linear and number two, at some point you reach a point at which the work ofcoordinating all of the different workers exceeds the benefit from adding another worker. Butthey do show how this definitely does improve performance. Next, they talked about parallelaggregates. So this is when you're trying to say do account of the number of rows in the tableand how optimized can that get. And again you have a gatherer working with partial aggregatesthat are done in different parts of the table leading up to a finalized aggregator and then aleader. Now here they didn't have quite as much of a difference. So for example, they say theygot a boost, performance boost of just over 16% doing it in parallel with ten workers compared tojust using a single. So that's definitely not significant, but it does give you a bit moreperformance. Then they talked about a parallel index on Btree scans and how that works, aswell as discussing a little bit about parallel bitmap scans and parallel joins. So if you want tolearn a bit more about parallelism in PostgreSQL, this is definitely a blog post to check out. Thething that I always like to keep in mind is that is your workload more transactional based and doyou want a lot of connections doing work. In that case, maybe parallelism isn't that much of abenefit because you're already asking it to handle many, many connections doing small queries.However, if you have more of an analytical workload, you're going to be doing a lot of sumscounts as well as fewer numbers of users but wanting to use more cores of the machine tohandle each query, then definitely parallelism is a benefit. So just something to keep in mind ifyou're looking at optimizing your parallelism settings.

[00:08:41] The next post also from Pocona.com is how to adjust Linux out of memory killersettings for PostgreSQL. Now they go into a lot of detail about the out of memory killer and whyit exists. Basically you don't want a rogue processor program taking the entire memory of amachine. Unfortunately, for PostgreSQL that is the primary thing that generally runs on it and ittends to be a candidate for being killed. But of course the recommendation is to set the VM overcommit memory variable to reduce the probability of this happening to two. And that means thatquote here, the kernel is not supposed to overcommit memory greater than the overcommit ratio.And this overcommit ratio is another kernel setting where you specify the percentage of memo --ry the kernel can overcommit. And this is pretty much the safest option recommended forPostgreSQL. So if you want to learn a little bit about the out of memory killer in Linux and someof the settings that are suggested for PostgreSQL, definitely a plug post to check out.

[00:09:46] Next post is PostgreSQL simple C extension development for a novice user andperformance advantages. So here they're talking about developing a C extension and they gothrough the process of how you do it. So if you're interested in developing additional features onPostgreSQL, this is definitely a great post to check out. But they created a very simple extensioncalled Addme that just adds two arguments together. And they did it in C. And then forcomparison, they actually did a plpgsql function to compare what their performance was like.And they ran the timings and just using the straight SQL plus operator, because you're justadding two parameters together, you got a performance of zero 65. Using the C function call Igot zero 65. However, using the PL SQL function call I got 1.4, so maybe 40% slower or 35%slower and then they graphed it out here to show you the reduced performance. So definitelycreating a C extension definitely gets you more performance if you have custom functions youwant to develop. So definitely interesting blog post to check out if you're looking into a way tosqueeze more performance out of PostgreSQL.

[00:11:02] The next post if you're interested in doing more plpgsql development, is using plpgsqlunderscore Check to find compilation errors and profile functions. So basically this is anextension and what it does is it allows you to check for compilation errors in function code,finding dependencies and functions and also offers profiling functions. Now if you use theprofiling functions you do need to add this extension to the shared preload libraries area of thepostgresql.com file. So they have an example from a previous post where creating this functiongenerated no error, but if you actually ran it through the Plpg SQL check function command, itactually found that a relation needed does not exist. So this could be a great benefit if you'redoing a lot of plpgsql development and they point out here you can also help you finddependencies as well as do profiling functions. So if you're doing a lot of Plpg SQL development,definitely blog post to check out.

[00:12:09] The next post is postgres twelve highlight replication slot copy and this is from PakirXYZ and they're talking about replication slots that are used in streaming replication or logicalreplication or logical decoding. And now they have a new feature where you can actually copy aslot, so they have PG Copy Logical replication slot and PG Copy Physical replication slot. Nowsome use cases is maybe you wanted to do a backup with a permanent slot and then make acopy of it that'll give you the same restart LSN and then you can use two Replicas or standbys tothen each one gets their own slot to be able to follow to do a restore from.

[00:12:54] So that's pretty interesting. They also talked about scenarios where maybe you'redoing logical decoding or logical replication and you can create a slot of where an LSN of anexisting slot is to be able to test decoding. Maybe you want to output information that you'regetting from like PG receive logical. So definitely an interesting feature being added toPostgreSQL twelve and if you want to learn more, definitely check out this blog post.

[00:13:20] The next post is waiting for PostgreSQL 13 add support for Jobs in Reindexdb. Sothey're already talking about version 13 and there is actually a parameter Jobs added to be ableto parallelize certain work, and in this case reindex DB. And they show here when theyparallelize it with eight they get completes in about 6.8 seconds, whereas when they didn'tparallelize it, it completes in 24 seconds. So definitely an improvement. So I know twelve hasn'tbeen released yet, but they're already working on version 13.

[00:13:59] The last set of posts are all related to PostGIS. The first post from Cleverelifent CA isPostGIS overlays. So basically if you have a particular overlay you want to compute, this goesthrough on how you can get the combining set here. Now again, I'm not an expert on PostGIS,but this post covers how to do that. So if you're interested, definitely a post to check out. Thenext post is simple SQL GIS. So this is a case where they h --ad a particular area that water was included and they were actually able to use someadditional data of where the water exists and then use some SQL in order to create a differencefrom it to create this type of representation of the data. So if you have a use case for this,definitely a blog post to check out.

[00:14:57] And the last post is waiting for PostGIS Three St underscore Asmvt Performance andthey are talking about greater performance for this PostGIS function. And this is fromCrunchydata.com. So again if you're interested in the upcoming PostGIS three, definitely a blogpost to check out.

[00:15:17] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scaling Post where you can sign up toreceive weekly notifications of each episode, or you could subscribe via YouTube or itunes.Thanks. --

episode_image