Query Optimization, Normalization, Visualizing Vacuum, Sharding

In this episode of Scaling Postgres, we discuss different query optimizations, normalizing to save space, visualizing vacuum and shardings future.

Content Discussed

YouTube Video

Podcast Audio

Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about query optimization, normalizationvisualizing vacuum, and sharding's future. I'm Kristen Jameson and this is Scaling Postgres,episode 66.

[00:00:22] Alright, I hope you're having a great week. Our first piece of content is thatPostgreSQL twelve, beta one is released. So you can download it now and try it out. The link isin the postgres news and it goes over all the different feature highlights. I'm not going to go intodepth of all of these because we've mentioned some of these different features in previousepisodes. But talking about index, performance, functionality and management, mainly withregard to management, the reindex operation that can be done concurrently, that's pretty huge.And also some performance and efficiency gains for different indexes is good inlined withqueries or basically common table expressions where they're no longer materialized, but youcan always add that option if you want. Again, different partitioning improvements, JSON pathqueries per the SQL specification, different collations feature, a most common value extendedstatistics, so you could create statistics to cover more complex relations between columns. Thegenerated columns feature the new pluggable table storage interface, page checksums andauthentication and connection security changes. And also notable behavior changes that aredefinitely something to watch out for. One is that the recovery.com configuration file is nowmerged into the main PostgreSQL file, so that's definitely something to be aware of. Alsointerestingly, JIT compilation is now enabled by default and then OIDs can no longer be added touser created tables. So definitely a lot of new features coming in twelve, some of them verywelcome. So if you want to try it out, feel free to go ahead and download it.

[00:02:06] The next article is a tale of query optimization and this is from Parallel Thoughts XYZ.And he talks about a query that they were running for. I believe this is the visual websiteoptimizer platform.

[00:02:21] A query that a customer was running that was slow and he looked at the investigation,this is what the query looked like. And they actually have tables per account, so it wasn't anothercustomer and they were trying to figure out why it was taking so long to run.

[00:02:38] Now, what I found interesting is that he shows the plan time, execution time, but hedoesn't show the Explain plan anywhere. So I really like posts like this where they look at aperformance optimization and what they did to get queries running faster. And I usually likelooking at the Explain Analyze output to see okay, where is it exactly? Need to be improved or atleast hypothesize. Oh, that's the problem. But he didn't include that in this particular post that Icould see anywhere. But what he did first, maybe he looked at Explain Analyze, but he basicallystarted looking at what could be the problem. Maybe it's the I like in doing the scan, trying to findthat. So he looked at that query in isolation and that wasn't too bad. So then he said, all right, isit the multiple join statements? So he looked at all the different join statements and that waspretty fast too. So we said maybe it's the URL sub query. So we tried using an exists instead andthat was pretty fast. So we tried moving into a subquery into a CTE, trying to break up what wasrequired, like he said, but it was still extremely slow. Again, for some of these tests, likeparticularly this one, I would have liked to have seen the Explain Analyze plan to see if ittargeted what the problem was. But basically the issue was the ampersand ampersand operatoror the and and operator. Now he did mention he did run Explain Analyze when looking at thisampersand ampersand. He said there were multiple rows of just these filters. And he says quotewhich means that not only was this operation expensive, it ran multiple times. So he basically didthe problem in isolation and just did this select queries to actually pull out the data himself to beable to do the compare and then unnest what was there. So as he says, we can look at thisoperation as traversing items in an array or rows in a table and stopping as soon as thecondition match is met. So basically, coming all together, he redesigned the query like this andthe final execution time was 2 seconds. Now, I don't know if you looked at how many secondsthe original one took, but the original --query took 24 minutes. So this doing taking 2 seconds is a huge advantage. So I think thereare definitely some insights to gain. And if you want to look into this post in more depth, Idefinitely encourage you to check it out.

[00:05:15] The next post is how I decimated postgres response times for my SAS. And this isfrom the blog Checklehq.com. Now this is again another query performance optimization post,but this one's a little bit more simple. So I'll skip to the end where basically it was just a singlemulticolumn index that drastically improved the performance. So he had an index on a check IDbecause they do API checks and he had an index on a date field and making it a multicolumnindex for doing queries where such as this, where the check ID is a certain value and it's orderedby that date time field, that resulted in an enormous performance benefit for him. So again, thisis a simpler perspective on a query optimization, just using a multi column index to satisfy whereand or order by clauses and make them run more efficiently. But if you're interested, definitely ablog post to check out.

[00:06:19] The next post is optimizing your app by Understanding your Postgres database. Andthis is from actually he's from Citus Data and this was a presentation given at rails. Comp. Andthis is part of the Confreaks YouTube channel so this video is about 40 minutes in length, and itis a very, very good presentation because how he approaches it is from looking at theperspective of looking what your database statistics are telling you. So not just saying you havea single query, but looking at the database holistically. So the number one thing he says to lookat first, for example, is your cash hit ratio and to keep that in the 99% realm. And if it's notlooking at some ways to improve that. He also covers things like the PG Stat Database table thePG Stat User tables the PG Stat User Indexes to get an overall sense of how your database isrunning, how often vacuums are running, whether Tables is using. Indexes or sequential scan,whether it's more updates, deletes, inserts and also examining PG Stat statements where youcan get statistics on your overall queries against the database. So it's definitely a very goodpresentation that I encourage you to check out.

[00:07:32] The next post is Normalize to save space. So this is from Luca Ferrari at FluCAN1978 GitHub IO. And he basically looks toward normalization as a way to save space. So hesays, I don't tend to over normalize data ahead of design.

[00:07:54] So he had this case where he had this name that existed on essentially a statisticstable, or I should say a sensor table that was also located in another one. So he decided tonormalize it and remove the name from that table. And of course, when he did that, he went froman overall database size of 13gb, not huge, but down to essentially 9GB. So it's just a short littlepost demonstrating how when you practice normalization, you can actually decrease the amountof data your database contains.

[00:08:25] The next post is visualizing PostgreSQL vacuum progress. And this is from DaveJacob's blog@dtrace.org. Now, this is a pretty long post, but it's very interesting. So he goesover looking into and how they visualized vacuum's progress, because vacuum for a lot ofpeople is a black box, and there is one area that remains a black box, no matter using the newview that they have added in recent versions. So he's using this way to visualize using grafanathe progress of vacuum. So this is a single table being vacuumed. The blue line is the totalnumber of heap blocks that need to be scanned. And then the view can tell you how many heapblocks are scanned. So you can see over time it's progressing toward that goal, essentially. Nowthe green line indicates what vacuum phase it's in. So basically, this is scanning the heap, andthen it goes into this next phase that is vacuuming indexes. So it vacuums each index,identifying and removing these dead tuples, essentially. And then it goes on to actuallyvacuuming the heap and removing those tuples. So you could see here that the heaps beingvacuumed goes up to the point of the heap block's total over time. So the view does give usinsight into when things are being scanned and when things are being vacuumed. But thismiddle part here, the vacuuming indexes, we have essentially no idea what's going on. Now,with a manual vacuum, you can do it in a verbose mode and it --will tell you each index as it's happening. So you do can get some insight in here, but it's notcontained within the Pgsat vacuum progress view. Now, this is an interesting visualization andthroughout this post he talks all about how vacuum operates. So again, this is another great postI suggest you check out. But going back here to multiple index scans. So if there's a case whereyour maintenance work mem is not high enough, it actually has to do multiple index scans tovacuum up all the tuples. Now this complex graph kind of explains this, where it goes throughthe phase of scanning the heap, doing the index scans, vacuuming the heap, but then it must goback into scanning the heap again because maintenance work memory wasn't sufficiently sizedto be able to hold everything that needed to be vacuum in memory. So it must basically do asecond pass of doing everything again. So definitely a message to make sure your maintenancework memory is sufficiently sized. So if you're having large vacuums, maybe you want to lookinto this type of visualization process to know what's going on and know how many times you'reessentially doing a vacuum. Like this demonstrates here where you had to essentially vacuumtwice because maintenance work memory wasn't sufficiently set. Now, he does provide this viewhere showing their vacuum progress layout in Grafana. So again, I find this super interesting andhe goes over a lot of depth explaining how things are working. So if you want more insight intoyour vacuum process and even just to know a little bit more about it, definitely a blog post Isuggest checking out.

[00:11:39] The next post is introducing PG Auto Failover, a High availability and automatedfailover postgres extension. And this is from Citusdata.com blog. Now, in a previous episode, Ibelieve it was episode 63, I mentioned PG Auto Failover, it was from the Microsoft blog that wasannouncing it. And this one again is making an announcement on this blog and it goes over agood introduction about how there exists a monitor, which is a postgres database itself with theAuto Failover extension. And it basically registers and checks the health of the different activepostgres nodes. So it basically is a self contained high availability solution. Or I shouldn't say selfcontained, but this one open source project is a high availability postgres solution. And then youregister each postgres node with the monitor. And the monitor itself basically manages thesepostgres instances for you. And if it notices that one goes down, it switches the traffic to a newone and it goes over in depth about how it works. It has a good flowchart here of their finite statemachine and even goes to exactly what commands you need to run to make it work. So again,this is a good complement to the previous post I did. So if you're interested in a high availabilitysolution that looks relatively easy to set up, definitely check out this blog post and the previousone in episode 63 and try it out. I'm actually going to be looking into this myself because I do findit pretty interesting and wonder how easy it will be to set it up for perhaps some of my systems.The next post is Swoop Dedup and this is from the Second Quadrant.com blog and they'retalking about the problem of duplicate rows and they have imaginary scenario where created acontext table and there's no kind of controls in here, no keys or anything to prevent duplicaterecords. Let's say you use copy to copy multiple times because there's issues. It's supposed tohave 26 records but it has 104. And here they use the solution of first creating a primary key asa serial type and then they use a window function using a row number over and partition bydifferent columns to look for duplicate names and then do a delete statement from this querywhere the row number is greater than one. So eliminate for example where the row number istwo and three. Anything more than one, eliminate these two rows keeping this one. So this isone solution you can use to remove duplicate data from your tables.

[00:14:12] The next post is the Future of Postgres Sharding by Bruce momgeon. This is fromMomgeon us. And again, this is something that's coming in the future and it's postgres slowlymoving towards having Sharding built in. And he talks about scaling up, scaling out and talkingabout Sharding and kind of what exists today where essentially relying on a postgres foreigndata wrapper, you can send d --ata to multiple foreign servers is what they're called. And you can push down again with themost recent version of now eleven different types of aggregates to the server. And then whatthey're saying here is that some of these need to be done serially when you're for example,calculating aggregates. But they basically want to have parallel Shard access. So that's kind ofone of the next stepping stones. And then the other thing is joins with replicated tables. So forexample, maybe you have certain tables you want to exist on each one to make push downqueries easier when looking at data on a particular Shard or in a particular foreign database. Sothey're looking into that as well as of course shard management and global snapshots. So this isa pretty brief presentation but it kind of puts forth some additional features that are needed tomake Sharding a built in component of the community PostgreSQL.

[00:15:34] The next post is parallel PostGIS and PostgreSQL twelve. Now, the TLDR he sayshere is that postgres twelve and postgres three have finally cracked the parallel spatial queryexecution problem and all major queries execute in parallel without extraordinary interventions.Now, we mentioned this previously in Scaling Postgres, where you had to basically alter costingsto be able to trigger some of the PostGIS queries to make them happen in parallel. But with thedefault costings in this you can do parallel sequential scans, aggregates and joins. So he'sparticularly pleased by this capability. So if you use PostGIS, maybe you want to start checkingout PostgreSQL twelve to see if it can benefit your postgres queries.

[00:16:24] The next post deploying active active PostgreSQL on Kubernetes. And this post isusing symmetric DS, which I believe is a type of doing a master master type PostgreSQLdatabase using a trigger based replication system. So if you're wanting to use Kubernetes withPostgreSQL, definitely a blog post to check out.

[00:16:46] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can signup to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes.Thanks our. --

Query Optimization, Normalization, Visualizing Vacuum, Sharding | Scaling Postgres 66

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript