background

Scaling Out, Planner Estimation, Create Statistics, Stay Curious | Scaling Postgres 103

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to scale out, how the planner estimates, uses of create statistics and investigating PostgreSQL running environments.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about scaling out, plan our estimation,create statistics and stay curious. I'm Kristen Jamison, and this is scaling postgres episode 103one.

[00:00:21] All right, I hope you're having a great week. Our first piece of content is preparingyour postgres data for scale out. And this is from Dev Two, and they're discussing differentpreparations you would need to do if you're wanting to potentially scale out your database. Now,they talk about two different types of scale scaling up, which is just basically putting yourdatabase on a larger instance, give it more memory, CPU, disk, et cetera. Scale out is when youadd one or more instances and try to scale out across multiple databases. And they have a littletable that talks about the ease of doing both the effectiveness, unique benefits and Gotchas.Basically the easiest is definitely this scale up. It's relatively quick to do. The Gotcha is a singlepoint of failure, whereas scaling out requires some more planning upfront. But it does have someinteresting benefits in terms of maintaining the database being online. Now, one of the first areasthey cover is do you want to do read only or read write? Because read only is just basicallysetting up some read replicas, which are relatively easy to do. And we have some tutorials atScaling Postgres that discuss how to do that. Read writing is harder because then you're goingto actually have to shard your data across different databases. Now, going into sharding, theytalk about different patterns you can use. You could do range based partitioning and they use theexample here. So you have a set of last names and you want to put certain last names intoseparate database systems. Now, it's a derived example, but it still demonstrates what rangebased partitioning means. But of course, the disadvantage of that, you can get hotspots with it.Like they mentioned, there's tons of people with the last name Smith in the United States.

[00:02:04] Next type of partitioning you can consider is vertical partitioning. So for example, thisis basically a service based approach. So maybe you have some microservices and each ofthose have the separate database. Well, with that your service can be scaled up by and eachindividual database that is responsible for part of that service doesn't have to be as large. But asyou scale, you may eventually get to the point where the databases of one of those services istoo large and you need to scale it out. Then they go into hash based partitioning. Basically youhave some sort of value and you create a hash of it and you assign data to that particularpartition. Now, I think a lot of solutions like Citus may be using something similar to this or somesort of partitioning key in order to do sharding. So I would say this is probably one of the morepopular ones. And then lastly, he talks about directory based partitioning. Basically you havesome service that sits between your app and the database and dictates where it goes. So this isprobably more complex because you have to have the separate service that interprets where thedata should go. And then of course there's difficulty when you actually do decide to shardbecause eventually you may want to bring that data back together and then how do you do crossshard, joins and things of that nature. But this blog post is a pretty good overview of the thingsyou need to consider if you eventually get to the point where you actually want to scale outacross multiple database instances. So if you're interested in that definitely blog post to checkout the next piece of content is Planner Selectivity Estimation Error statistics with PG QualstatsTwo. So quite a mouthful. This is from Rjuju's blog at rjuju GitHub IO and he's talking about aQualstat, specifically qualstats Two which is an extension. And I'll have a link here that keepsstatistics on the predicates found in where statements and join clauses. So generally, like forexample, I believe PG Stats statements does not include what are the exact values being used,whereas this is something that Qualstats tries to track. Now they had a previous post that theydiscuss here where it's talking about being able to do index prediction on what indexes would begood to add. Whereas this particular feature talks about being able to look at relationshipsbetween columns within a table in terms of estimating if they need additional statistics. So --by default statistics are tracked per column and they don't track relationships betweencolumns. So for example, it doesn't track there's so many zip codes in a state or any kind ofrelationship like that. He developed an example where the planner, even after vacuum analyzedthe planner believed it would come up with 12,500 rows but it actually only came up with zerorows, so there were no rows that matched this query. So this is an example of the planners usingits estimates based upon looking at the statistical coverage of each column individually andtrying to make a conclusion about both of them because the where statement is taking both intoconsideration and there is a relationship between these but the planner doesn't know that. So it'smaking an estimate of say, half of the rows are this and half of the rows, looking at themseparately to get some kind of an estimate, but it's very inaccurate because there's actually zerorows. And he goes over the source code and how this comes to this conclusion and looking atthe PG Stats view to get that information. And then it gets worse when you try to do a join. So inthis example here it thinks it's going to find 313,000,000 rows where there's actually zero rows.So it can get worse. Now, what PG Qualstats, at least the version two can do, it can help detectthis problem. So here's an example of a query you can run against this PG qualstats table. It'spart of the extension and it looks at when looking at this value in isolation. So looking at thiscolumn, sorry, in isolation, the mean ratio means how relatively accurate the statistics are isclose to one. So just looking at one column it gives a good estimate. But looking at one columnin relation to another meaning anded with another, as he references here, or Val two column, themean ratio is huge, it's very far from one. So what that means is that these would be goodcandidates for creating statistics for so you can define the relationships between these whenthey're looked at in combination. And he covers discussion of this talking about extendedstatistics where you can actually use create statistics to create these two values so itunderstands the relationship. Now, when you do a query, you can see expected one row, but Igot zero rows so much more accurate, it's not say, hundreds of millions off. Now, with that, healso talks about some slides that Thomas Vondras put together for a talk on the subject withregard to create statistics. Now actually he did a webinar here that was published this weekcalled Webinar all you need to know about Create Statistics follow up. So this is an hour longwebinar about create statistics. So you can just click the link here and register for it and you canget immediate access to looking at it. So if you want more information about create statistics andhow it can help you educate the planner on relationships between columns, definitely a webinarto check out.

[00:07:35] The next piece of content is actually a YouTube video and it's PostgreSQL at lowlevel. Stay curious. And this is from the all systems go YouTube channel. And what he covershere is not necessarily so much PostgreSQL, but PostgreSQL on how it runs on an operatingsystem in containers, all sorts of different tools to look at to see what kind of behavior you'regetting for your Postgres instance at a low level where Postgres runs is it running in aKubernetes cluster, a VM? And he talks about all the different tools to analyze it from Perf to Strace. So if you want to learn more about the actual environment you're going to be runningPostgreSQL in, definitely a YouTube video to check out.

[00:08:20] The next piece of content is researching PostGIS Slowness 2019 edition. And this isfrom secondquadrant.com. He's talking about a customer they were working with where theyhad a particular query that was 60 times faster when there was only a difference of zero zeroone between two different values in terms of calculating a point. But again, it was 60 times fasterwithout that present. And they tried to replicate the environment but they used the exact versionof PostgreSQL postgas Lib project and they couldn't see any differences. Then they tried Ubuntu18 Four and finally they saw a difference and what it led to is actually a difference in Glibcversions. So once they upgraded to the most recent Glib C versions, both queries became fast.So this is just something to keep in mind. That PostgreSQL is great, but i --t also relies on other libraries, as does PostGIS and other things in your environment. So besure to keep aware and keep up on patches and upgrades for all the different libraries thatyou're using in your system to make sure you're getting optimum performance. So if you'reinterested in this story, go ahead and check out this blog post.

[00:09:34] The next piece of content is optimations in group by and select distinct. This is fromthat guy from Delhi.com and he's talking about differences in performance when doing selectdistinct, a certain number of columns from a table versus doing a group by. And sometimesgroup by can give you better performance and it has some specific optimizations that if there is acolumn present, it only uses that column for the group key. But he also made an interestingobservation that a unique index wouldn't do it, but a nonunique one would and he looked overdifferent variations of choosing different columns, different keys to see what the differenceswere. So it's an interesting blog post to cover if you're interested in learning about specificoptimizations you could be using with group by or select distinct.

[00:10:25] Next piece of content is creating a PostgreSQL procedural language. Part Threeexecuting user code. So this is the part three that we've covered in previous episodes. This isusing Julia language and this is the next phase where they're actually going to be executing usercode. So if you've been following along, the next blog post is available.

[00:10:46] The next piece of content is actually a new website that seems to have been put uprecently. It's called PostgreSQL Life and it looks like its intent is to interview a PostgreSQLperson of the week and as of the 27 February they've posted the first interview. So, quiteinteresting, and maybe check this on a weekly basis to see all the different people working inpostgres.

[00:11:14] The next piece of content is Parallel Vacuum and upcoming PostgreSQL 13. So thistalks about the parallel vacuum coming in 13 that we've discussed before. They talk about howyou can enable it, the different settings that will be available, and looked a little bit at itsbehavior. It's a little early, so they had some unexpected results so far, but interestingly becauseyou have to give it a parallel command when you run vacuum. I wonder if auto vacuum cannotdo in parallel yet and you have to do it as a manual vacuum. I didn't see any reference to that,but it's just a question I'll have to keep track of as version 13 gets closer to release later in thefall. And this is from the highGo CA website.

[00:11:56] The next piece of content is on Recursive Queries. This is from Haber.com, and itlooks like it's been translated from Russian at postgrespro Ru. And it's all about recursive CTEs.So with recursive ads So if you're interested in this content, definitely a blog post to check out.

[00:12:15] The next series of posts are all about PostGIS. The first one is PostGIS Day in STL,meaning St. Louis, and there's a number of presentations talking about PostGIS that have beenpresented here. This is from the clever Elephant CA blog.

[00:12:32] Next piece of content is visualizing OCM data in QGIS. And this is fromCybertechyphenposgresql.com. So if you have an interest in doing this mapping, definitely ablog post to check out.

[00:12:46] The next post is setting up SSL certificate authentication with PG pool two. And this isfrom the higo CA. And with the version four of PG pool, they added some additional, as they say,securing authentication features. One of them talking about SSLs. So they talk about how youcan generate an SSL certificate and start using it with PG pool two.

[00:13:12] And the last piece of content is how to use the SVM machine learning model with twoUDA PostgreSQL and orange. Part two. So again, this is about machine learning. Again, I don'thave a background in this, but part two of this blog post is available from secondquadrant.com.

[00:13:33] That does it. For this episode of Scaling Postgres, you can get links to all the content

episode_image