background

CTE Warning, PG 11 Features, Death by DB, Correlation | Scaling Postgres 31

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering a CTE warning, Postgres 11 new features, death by database and column correlation.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres we talk about a CTE warning. Postgres Elevenfeatures death by database and column correlation. I'm creston. Jameson and this is scalingpostgres episode 31.

[00:00:22] Alright, I hope you're having a great week. Our first article is Be careful with CTE inPostgreSQL and this is from Hakai Benita. Now a CTE is a common table expression which youcould also think of as the with clause. So it's basically taking a subquery and putting it earlier inthe statement and being able to call it again. So he shows in a simple example of one here withCT as select all from Foo. So CTE is what he's named it and then he can then do select all fromCTE where an ID equals something. Now, what he's saying you need to be careful of is asimplistic example here where he has a foo table and he's looking up by an ID that has an indexon it and that lookup took less than one millisecond. Whereas using a CTE, of course you'redefining all the records in foo so that is going to take a while. But then you do a select all fromCTE where the ID equals the same and it suddenly takes 227 milliseconds. So basically what'shappening here is the CTE kind of materializes, what you're asking for basically caches it andthen can use that query again. And it's not, for example, pushing down this ID into the Cde here.It's essentially materializing. This part that you've done and he references some things from thedocs where it says, quote a useful property of with queries is that they are evaluated only onceper execution of a parent query. So basically you could use that with query multiple times and it'snot going to reexecute it. But again the quote says however, the other side of this coin is that theoptimizer is less able to push restrictions from the parent query down into a with query with anordinary sub query and again with the explained plan. You can see where just the simple querylooking for a particular where ID is using a index scan. Using the CTE example, it actually has todo a sequential scan. Now, unlike a CTE just using a basic sub query.

[00:02:35] So select all from and then do the sub query here select all from Foo where the IDequals that it is able to push that down. So you can see it does do a sequential scan in this caseand you get a query response of less than a millisecond. Now he also mentions Oracle actuallylooks different in that it is able to essentially inline the CTE and give the same response timewhether using it as a CTE or a subquery in quote Oracle is not materializing CTEs by default.

[00:03:12] So basically this is just a decision that PostgreSQL has adopted that you just need tobe aware of, that it does materialize its CTEs and because of that it can't push down predicatesinto your CTE from the parent query and if you're needing to do that, just use a subquery to do it.So it is a warning in terms of you should understand how PostgreSQL works and this was agreat post that goes through it and I encourage you to check it out.

[00:03:41] The next series of posts all talk about features coming in PostgreSQL eleven and ourfirst post is PostgreSQL eleven Something for Everyone. And this is from the LWN net and it'san article contributed by Peter Shogein. My apologies on that pronunciation, but this article goesthrough a lot of the major features of PostgreSQL eleven and a little bit of the background in it.So of course some of the biggest features are the partitioning changes and we'll go into someadditional articles and talk a little bit about this more in detail. But I thought this was probably thebest article to read about some of the upcoming changes and the reasoning behind it and whyit's happening now versus not earlier and things of that nature. But it goes over the partitioningimprovements, parallelism improvements, just In Time compilation of queries. And again, basedupon my understanding, this is something that you need to enable when you're actually buildingfrom source. So for example, using the Hyphen hyphen with Hyphen LLVM flag there's theintroduction of procedures. So not just functions, but procedures where you can actually docommits and rollbacks within them, the introduction of covering indexes. So basically you canadd additional columns to an index that doesn't necessarily index byte, but it includes additionalcolumns in essentially the index payload so that you can increase the number of index onlyqueries. And this is a great --one here the instant add a column with a default value so not having to rewrite the wholetable. So again, I definitely encourage you to read this if you're interested in a lot of the newfeatures.

[00:05:27] The next post highlights some additional features, but it's a little bit shorter of somethings that were notable to him. And this is postgres eleven. A first look, this is from the CraigKirstein's blog. Number one is quitting postgres. Typically you had to do a backslash Q, but nowyou can type Quit or exit. So this is great for new users to using postgres and then a headlinefear column addition no more. Essentially this is where adding a column with a not null and adefault essentially had to rewrite the whole table. He talks about performance improvementsgoing back to the parallelism and he's highlighting the capabilities of doing a parallel hash joincapabilities of doing parallel append. And I believe this comes in cases where you're doing aunion and have multiple select queries. It can parallelize that as well as parallel index creation,which again that's a great improvement. He mentions the Just in Time compilation that'spossible and he also covers creating statistics which wasn't mentioned in the previous post, butwith ten you could start defining multiple columns that you can create statistics across. But withpostgres eleven now you can create statistics based on expression indexes. And lastly, hementions some improvements with PG underscore pre Warm. So again another post talkingabout the new features coming in postgres eleven.

[00:06:56] Now my understanding is that postgres eleven is due to be released essentially anyweek now and they just released beta number four, so it's definitely going to be releasedrelatively soon.

[00:07:08] The next article is actually a PDF of slides that were given as a talk. Again, the title isMajor Features Postgres eleven. And this is by Bruce momgian. And again he goes over againthe partitioning improvements, the parallelism improvements, the introduction of storeprocedures, the executor stage compilation which basically is the just in time compiler,preventing table rewrite for alter table add column with non null default. Again, that's a huge onefor large databases. He mentions finer grained access control, so there's some additionalaccess control features that have been added you can check out in this presentation in additionto some write ahead log improvements. I believe this is where you can actually increase the sizeof the wall files that are being generated, which may be helpful for larger databases. Again, hereferences the allowing the quit and the exit when you're going to be exiting psql. So again this isabout 24 slides worth of information. And again, if you're looking to learn and understand moreof what's coming in postgres eleven, definitely another piece of content to check out.

[00:08:16] The next post is why covering indexes are incredibly helpful. And this is from theCrunchy Enterprise PostgreSQL blog. And essentially this is talking again about a feature thatwas mentioned in the previous piece of content, but it's dedicated to covering indexes. This iswhere you're adding additional columns to an index. It's not indexing on that column butessentially it's an additional payload so that when you use the index that data can immediatelybe used. It doesn't have to go to the heap to pull additional data. So essentially this increasesthe amount of index only scans you can have without that data exactly having been indexed. Sothis goes into detail about that and gives a very practical example of using it. So if you're adeveloper and potentially interested in this feature, this is definitely a blog post to check out. Tolearn more.

[00:09:08] The next post is Partitioning improvements in PostgreSQL eleven and this is from thesecond quadrant.com blog and again this just goes into more detail of the partitioningimprovements that we've discussed. So just to run those through quickly, PostgreSQL elevenhas introduced hash partitioning. So if you don't have a natural key to partition by, this allows youto just use a hash to do it. Update can now move rows between partitions if it needs to insert onconflict update now works across partitions or for partition tables. And you also have support fora default partition in a partition table. So if partition table is not necessarily targeted, data can beplaced in a default partition. --Better DDL support are basically easier to work with partition tables for DDL commands. Sofor example, you can use create index on a partition table whereas you used to have to do themindividually. You can do create unique constraints as well as primary key constraints as well as apartition table can have foreign key constraints. And then he talks about the performance work interms of pruning the partitions in version ten that only happened at one stage at query plan time.However, with eleven with some improvements it can now do this type of pruning at query plantime when the query parameters are received as well as at each point where one query nodepasses values as parameters to another node. So this should allow much more queries to targetonly the partitions that they're interested in. So definitely huge improvements in that area andthey also mentioned partitionwise joins as well as partitionwise aggregates. So again, this is apost dedicated to those partition improvements. If you are using partitioning or considering, itdefinitely a blog post to check out for PostgreSQL eleven.

[00:11:03] The next post is death byte database. And this is from the obid GitHub IO blog. Andit's basically kind of like a parable and a story of working with someone who had a particulardata model for their application and how that data model essentially constrained what they coulddo and all the workarounds that would have to be done to be able to make it work. They wantedit to, just because of how the database was modeled. So basically the post is emphasizing reallythinking about your data model and doing proper database design. Now, related to this, there isa linked YouTube video, it's a presentation that's approximately 45 minutes in length called howto fake a database design by Curtis Poe. And I definitely suggest you checking this out if you'rea developer and wanting to learn more about database design. I really like some of his keyconcepts to keep in mind when you're doing it. So this was a great piece of content to check out.

[00:12:01] The last post is Correlation of PostgreSQL columns explained. So basically this isexplaining that after you analyze a PostgreSQL table, the distribution statistics are located in thePG stat system view and there's a correlation column. So what is it correlating basically how welldoes the order you want in the index match the physical ordering of the data on the table? Andhe describes this as the logical ordering versus the physical ordering and a perfect correlationbetween that would be a one or a negative one depending upon the order and a zero meansthere is no correlation. So basically if you're using a sequence for primary ID. That primary keyindex that generally gets created should have a perfect correlation because all new data comingin has an increased index number that should be perfectly correlated with an ordered index.Whereas if you have random data being inserted into a value and you have an index on that,there's probably no correlation at all between the index and where the data is located in theheap. Now he mentions why should I care about this? Because when you have a goodcorrelation, you're going to get faster response times for your queries. Generally because there'sonly one place in the heap it has to pull it out and doing a sequential retrieval of that data shouldbe faster than having to do random I O all over the heap area. Now, he does an example herewhere he's creating two tables, one with a high correlation and a low correlation. So going backwhere you have an index and just a series of numbers, you see the correlation is one. Whereaswhen you are randomly ordering data in an additional table he's creating the correlation isessentially close to zero, which means no correlation. And you can see the different executiontimes when querying those tables. The correlated table returns in one millisecond and theuncorrelated returns in 2.7 milliseconds, so more than twice as slow. Now, are there some thingsyou can do in order to increase your correlation if you're really interested in speed and hementions some of this at the end of the post. If you need to efficiently scan bigger portions of anindex, it's good to keep the table in index order. However, there are no index order tables inPostgreSQL. However, a high correlation column can be maintained in two ways. First, he saysautomatically where table rows are inserted in logical column orde --r and there are no updates or deletes on the table. So the physical ordering should be identicalto the logical ordering. So again, this is the example I mentioned where you have a primary keycolumns generated by sequences or perhaps with timestamps. And the second way he says isby a clustering. So you can use the cluster command to actually rewrite a whole table so that thephysical ordering is identical to the logical ordering. But again, if you're having updates anddeletes, it's going to alter the order of that. So if you're really looking to push the performance,this is definitely something you may want to potentially consider, particularly if you have a largeanalytical database or maybe a data warehouse and you have more flexibility in terms ofordering data in the table, this might be something to consider.

[00:15:22] That does it. For this episode of scaling Postgres us. You can get links to all thecontent mentioned in the show notes. Be sure to head over to Scalingpostgres.com where youcan sign up to receive weekly notifications of each episode. Or you could subscribe via YouTubeor itunes. Thanks. --

episode_image