Postgres Rise, Row Count Estimates, Nulls, Linux Config

In this episode of Scaling Postgres, we review articles covering the rise of Postgres, row count estimates, nulls and linux configuration.

Content Discussed

YouTube Video

Podcast Audio

Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about the rise of postgres, row countestimates, nulls, and Linux configuration. I'm Kristen Jameson, and this is scaling postgresepisode 44.

[00:00:15] You.

[00:00:20] Well, I hope everyone had a great holiday. I know we still have New Year's coming up,so I hope your New Year's is a good one. Again, we're still kind of light in the contentdepartment. So there's just four posts that I've selected this week. The first is PostgreSQL onThe Rise 2018 Postgres findings and 2019 trends. This is from the Several nines.com blog and ittalks about the rise of postgres and it goes over some different resources that it kind of putstogether. So this is an interesting post to look over, like they're looking at a DB Engines surveyrank and it shows that PostgreSQL is number four in terms of most used database and numbertwo for open source database. But what's super interesting is how much score has increasedcompared to all the other database systems that they track here. And it also mentioned that DBEngines declared PostgreSQL as the database management system of the year for 2017. Sowe'll have to see what 2018 brings. Even the Stack Overflow survey, it's in third position withreference to top used database technologies. They talked about an article Why Oracle'sMissteps have led to PostgreSQL's moment along with another one. Has time finally come forPostgreSQL? And then they talk about a little bit of their best blogs. So if you're wanting to lookat kind of where PostgreSQL is trending, this is a blog post to check out. The next article is rowcount estimates in postgres. And this is from Median.com, the PG Mustard area. And it basicallyjust talks about row count estimates and why they're important. And primarily why they'reimportant is for the query planner to make good decisions on what plan to use to execute yourqueries. And basically it keeps track of these by analyzing statistics about the data in each ofyour tables on a periodic basis, usually along with vacuum, although you can run an analyzemanually, or you could do a vacuum analyze manually. And they talk about a tool they havecalled PG Mustard that helps you diagnose and fix performance as issues with your PostgreSQLdatabase. And they talk about they flag it when there's a factor of ten difference between whatthe planner thinks, how many rows are there based upon the statistics, and then how manyactual rows were returned for a particular query and then they go in how not to fix your plan.Basically you don't want to try to do tricks to kind of choose one index over the other or turn outcertain features. Basically what you should probably do is help the planner out so increase thefrequency that you're collecting those statistics so that they're more up to date, or maybe youhave some jobs that's drastically changing the data that's in tables, maybe you could alter that. Ifnot, you could either run analyzer vacuum manually more frequently or you can adjust thedefault statistics target in the configuration or on a per table basis by setting the statistics targeton a per table basis. And then lastly they follow up with multivariate statistics. So these arestatistics that you can set up. They are not collected manually, but if you are aware that there issome relationship between two pieces of data, like certain cities are only in certain countries orcertain cities, or most of the time are only in certain states or states or in certain countries, youcan define that relationship in statistics starting with PostgreSQL ten. So those are some tricksand tips to hopefully get better plans for your queries. The next post is a PostgreSQL story aboutnull is null equals null and not null. So basically this is a post about nulls. So if you're not familiarwith what they are exactly and how to use them, it's a good post to go over. But just tosummarize it, nulls are undefined. So if you do select ten equals null, it's not going to tell youfalse, it has no idea what null is, it's undefined, it could be ten, it could not be ten. So it justreturns null itself.

[00:04:34] Even if you say is something undefined equals something undefined, well, maybe not,we don't know it's undefined. And if you want to actually get an actual true false value of it, youhave to do select null is null. Or you could do the check is not null. And in this initial example,where do you select ten equals null? You could do --select ten is null and it will actually give you a false return value. And you can also do this fora row of data. So the row function lets you do a row of data. Now, I don't necessarily know howuseful this would be, but they do have an example where you have a row with all nulls. You cando a test to say it is null true, or if any of the values are not null, it's going to give you a falsevalue. They talked about null handling and limit clauses. So if you do a limit using null it'sbasically just going to return all values, it's not going to enforce the limit. And then they talkabout order bys. So in order bys you can specify whether the nulls are going to be first or thenulls are going to be last. So just keep that in mind if you're ordering data where the nulls aregoing to be at. And then they talk about aggregate functions where generally they ignore nullvalues, except in the case of count. So if you're looking to learn about how nulls work insidePostgreSQL, this is definitely a blog post to check out.

[00:05:55] The last post is benchmark PostgreSQL with Linux. Huge pages so here this is abenchmarking post which I always kind of find interesting the best ways to configure yoursystem. But of course they mentioned here you definitely want to do it in your own environmentand for your own application, make changes and then test and ensure that those changes giveyou performance improvement versus a decline. So they show their benchmark machine and Ishould say this is all about Linux kernel changes so not so much configuration of PostgreSQLthey do list their configuration that they started with but their focus was adjusting the Linuxconfiguration parameters. The first thing that they do is they do disable transparent huge pagesand their belief is that disabling this feature and defaulting to classic huge pages is alwaysrecommended. And so here they are looking at different huge page configurations and theyshow the benchmarked results here with huge pages set at 4 megabytes and 1GB with differentclients and sizes of databases. And generally for this layout they tended to get the besttransactions per second based upon the highest huge pages setting. But again, there's a little bitof caveat that the highest database size they went up to was, I believe, 96gb.

[00:07:16] And I regularly work with databases in the terabytes range, so I'm not so how muchsure how that would fit with a terabyte range database, because their conclusion here, decidingon the size of the huge page to use, requires a bit of trial and error. But this can potentially leadto significant TPS gains where the database size is large but remains small enough to fit in theshared buffer. Well, if you're at a multi terabyte stage, I don't think that's going to be fitting in ashared buffer of a database system. So just something to keep in mind if you look over thesebenchmarks, that does it. For this episode of Scaling Postgres you can get links to all thecontent mentioned in the show notes. Be sure to head over to Scalingpostgres.com where youcan sign up to receive weekly notifications of each episode or you could subscribe via YouTubeor itunes. Thanks. --

Postgres Rise, Row Count Estimates, Nulls, Linux Config | Scaling Postgres 44

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript