background

Statistics Speed Up, Synchronous Commits, TLS Updates, Avoiding Cursors | Scaling Postgres 128

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss speeding up performance with statistics, setting synchronous_commit, updates to TLS settings and how to avoid cursors.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about statistics, speed up,synchronous, commits, TLS updates and avoiding cursors. I'm Kristen Jameson, and this isscaling postgres episode 128.

[00:00:22] All right, I hope you, your family, friends and coworkers continue to do well. Our firstpiece of content is how we used postgres extended statistics to achieve a 3000 x speed up. Andthis is from Affinity Co. And so they had a query here that was giving them some poorperformance and then they did an analyze of it and they noticed that the statistics thought it wasgoing to pull one row as a result of this join. But when the query was actually run, it was pullingover 2000 rows and it was taking a lot longer than they anticipated on the order of minutes torun. So whenever you have a difference between what the planner thinks the results should beand the actual results, you want to check out your statistics. So that's basically what they did.Now, this post then goes into all the basics of statistics on how it keeps statistics on eachcolumn. It doesn't do cross column statistics, which we'll mention here in a second, but it does itper column and it takes a sampling of it. And you can set statistics for the database itself, butyou can also set them for individual columns as they indicate here. Now, extended statisticsactually allows you to define additional statistics where you're defining relationships betweencolumns. In the example that they're using here. Say you have artists, albums and songs wherethere's a relationship between each of these. A song belongs to a particular album, typicallymaybe more than one, and that belongs to an artist, although maybe more than one in that casetoo. But there's a different relationship. There's a much smaller number of albums, artists givinga particular song. So that's not a randomized distribution, there's correlation between them. Andwhen you have that, you can actually, in postgres ten and above, create extended statistics toindicate that fact that the query planner can take into account. So they give an example of thecommand here where they use creating statistics based upon album ID and artist ID from songs.Now, one thing they did also suspect is that they were using an inefficient join. So they go aheadand review a nested loop join, merge join and hash join. And they noticed that the nested loopcan run into problems if there's too many rows in the left relation. So for example, if it has to do arepeated process to look on the right side of the relation for a join, it has to execute it each time.So I believe they were hoping if they updated the statistics, it would potentially choose a differentjoin strategy. Now, what they observed is that the most costly step of their analyze was loopingover index access to this particular index, and they did it over 13,000 times so even if that indexaccess is fast, it's going to take forever if you have to do it over 13,000 times. Plus, you know,there's a problem if there's a difference between what the planner thinks an estimate is and theactual counts, like we saw with the rows. So what they did is that they quote maxed out percolumn statistics for all the columns involved.

[00:03:27] I think these are relatively generic but they're org ID list ID in one table and ID entityattribute ID in the next table and then they created dependency statistics for each of these aswell. Now as a result of that, they got a different query plan that actually finishes in 42milliseconds. So vastly faster than the two minute one. And therein lies the 3000 fold speedincrease. Now, it's interesting they didn't mention what their improvement would be justmaximizing these statistics for each column versus using the extended statistics or thecorrelation statistics that they added. Because I would assume that increasing the sampling rateof each column would allow the planner to give better estimates. But I'm not sure how much ofan additional benefit the extended statistics would do. But that would have been an interestingthing to know. But if you run into a scenario where the planner expects one result from a querybut the actual results are quite different, maybe check out your statistics to see if you canimprove how many times you're sampling it or potentially look for correlations between columnsthat may be able to get you better performance. And if you're interested in learning more aboutthat, go ahead and ch --eck out this blog post.

[00:04:40] The next piece of content is PostgreSQL synchronous commit options andsynchronous standby replication. This is from Procona.com and they're talking about thesynchronous commit setting. Now first they talk about wall propagation. So wall is the writeahead log. It's the log of all activities that are happening on the database in terms of objectchanges and inserts updates deletes and they go through this flow diagram here. So this isassuming you have one local PostgreSQL system and then a replica the remote PostgreSQLsystem. So first the wall inserts are written to the wall buffers in memory. They are then written tothe operating system page cache and then on some interval they are flushed to the actual disksof that local system. Meanwhile, if you have replication set up the wall sender sends it to the wallreceiver and it does a remote write to the page cache on the remote system. Then at someinterval it flushes it to the disk and then usually sometime later it then applies those changes thatit's received to the database system itself so that it is accessible for others to read the data that'sbeen inserted. For example. Now with this flow you can determine what setting you want to usefor synchronous commit. So when the setting is off, essentially you're just writing to the wallbuffers and you have no guarantees that things actually make it to the disk. So clearly that's fast.But you're opening yourself up to data loss even if you're operating on a single server. So if youcare about your day, you probably don't want to choose off. The next option is local. So local isassuring that the wall records actually get flushed to the disk so everything is safe on the disk, atleast on that local system. The next option is remote write. This is assuring that the wall iswritten to at least the OS page cache on the remote system. Now this gives you a fastacknowledgment so that there's no slowdown in the system, but you can't be sure if the replicafails, that it was written to the wall files. For example, the onsetting for synchronous commitensures that it gets written to the disk on the replica. And this is the default setting for Postgres,so you have to change it from this setting if you want to choose a different synchronous commitoption. And then the last option is Remote Apply. So this takes the longest but is the mostdurable and it assures that the information written on the local database appears on the remotedatabase. So it gets written all the way to the database and is available for selects. So it's themost durable but has the greatest performance penalty. And then the next section of this articleexplains how you can change the scope of synchronous commit. So you can of course set it atthe cluster level, but there's also multiple other levels that you can set it for. So for example, youcan do a set local of the setting which does it at the transaction level and it enables you tocustomize how durable a particular transaction you want to be. Or you could also set it at thesession level or at the user level so you can have certain users where synchronous commit isoff, for example, or some users where synchronous commit is on. So you can definitely tailor thisso that certain transactions or certain work you want to do is more durable than others and otherwork is maybe faster and less durable. So if you are interested in learning more about how toadjust your synchronous commit settings, definitely check out this blog post from Percona.com.

[00:08:08] The next piece of content is TLS related updates in PostgreSQL 13. This is from HigoCA and they're talking about additions to the PostgreSQL comp file where you can now definean SSL minimal protocol version and a max protocol version. In previous versions of Postgres,the default was TLS version one, which is not considered secure anymore. In postgres 13 it'sincreased to 1.2. But now these settings enable you to customize your TLS settings to matchwhat is acceptable for your environment. The next thing they mentioned is that they also allowminimal and max protocol versions for the client as well, so you can pass in these values whenyou connect as a psql client. The next area is they mentioned channel binding when usingScram, so we've covered this in a previous episode of Scaling Postgres, but this covers how youcan set up channel binding to help to authenticate the server that you're connecting to from t --he client's perspective. And they also mentioned a new SSL password function when trying todo a verification of the SSL mode. So definitely some great changes to take into account withPostgreSQL 13.

[00:09:18] The next piece of content is Oracle to PostgreSQL cursors and common tableexpressions. This is from secondquader.com. So I actually found this quote interesting. In a 23year career with PostgreSQL, I have only actually found a need to use cursors twice and I regretone of those. So he's basically communicating that with Oracle you apparently use a lot ofcursors when doing programming, whereas with PostgreSQL you don't tend to do that andthere's other ways to achieve the same thing. And he talks a little bit about how cursors can beexpensive rather than relying on the set based logic and essentially using code that's alreadybeen written for using pure SQL versus programming something yourself to return data. Now, asan example, he used being able to create some hierarchical data. Now as opposed to use acursor to try to do this, he used a common table expression and the width recursive feature. Sohe used this capability in order to generate this hierarchy of data that maybe normally you wouldtry to use a cursor or some other method like that. So if you're interested in learning more aboutthis, definitely check out this blog post from second quarter.

[00:10:29] The next piece of content is eight fascinating things you probably didn't knowPostgreSQL can do. This is from Enterprisedb.com and this is an eclectic set of different SQLtricks or tips that you can use with Postgres. So I'm not going to run through all of these in detail,so you should definitely consult this blog post. But they're talking about utilizing whole rowreferences, comparing a selection of columns, hard coded tables, custom config parameters,booleans can, standalone convert column data types for free, find which rows belong to whichunderlying partition and tables are types. So if you're interested in working a bit more in depthwith Postgres, definitely check out this list of tips and tricks and some I haven't really seenbefore.

[00:11:22] The next piece of content is updating the PostgreSQL root CRT file. He's saying youhave a scenario where you have SSL authentication set up between two different postgresservers. Perhaps you're using Postgres foreign data wrapper they mentioned here. So if youhave server certificates and the server keys, client certificates, client keys, and you have yourown certificate authority that has generated this root CRT file that has signed all of thesedifferent certificates, well, what happens when the root certificate comes up for expiration.What's the best way to do that? And what you don't want to do is recreate the root certificate andthen overwrite that along with generating all the different keys for a single server and thenupdate it, you'll run into problems because on the other server it's still using the old rootcertificate key. So he says that the correct way to do it is to generate a new CA certificate andadd it to the same root CRT file. So basically it trusts both CA certificates, both the old one andthe new one. Then you generate your new certificates with the new signed CA certificate andthat allows you to update all of your certificates. So if you want to know a bit more about theprocess, definitely check out this post from secondquader.com.

[00:12:41] The next piece of content is PostgreSQL group by expression. This is from CyberTechPostgresql.com and they're talking about group bys. So for example you can group data by thevalues in a column for example. So here they do region or they go by whatever the first columnis. So that's pretty much a group by. Group by expression is that you define an expression togroup by. So in this case they're saying where the production is greater than 9000. I think this isthe case of barrels of oil or using a case statement in a group by. And that is what I've used a lotof this for is case statements in a group by in order to get particular data. So this continues andshows you how to use that as well as following up with grouping sets. So if you want to learnmore about some capabilities of group byte, definitely check out this post from CyberTechpostgresql.com.

[00:13:34] The next piece of content is using postgres row level security in Python and Django.This is from Pginalyze.com and they're talking about implementing post --gres's row level security. So in this scenario with your application, you're actually going to needto set up a role per database user because you want to do the security in the database systemitself. So in their example they have a salespeople table and they create each salesperson andthen for each salesperson they create a role. They happen to use the ID from the salespeopletable and there's probably better way to do that, but that's what they use for this example. Andeach of these roles are part of the salespeople role so you can grant certain permissions to that.Then they enabled the row level security to access the tables that they should. And how theychange each person is that they use the set role option. So with Django it uses a commondatabase user to connect and pull all the data. But when it needs to act as a particular user, itdoes the set role option to then do that command. So after showing how the database sideworks. They then show the implementation in Django itself. And the main things that they haveto do is after inserting a new salesperson, they actually need to create the role and grant it to thesalesperson role as well. In addition, whenever they're doing a query, they need to set the role ofthat person. So I don't necessarily think that I would use role level security for something like aweb application unless it's for internal use only. And these database users have access to otherdatabase systems, so it makes sense they already have a database user set up for a specificpurpose. Otherwise, to me, this tends to get very complicated. The other thing that they alsomentioned here is performance. So the performance of the row level security feature of Postgresisn't as high as when you're not using it. So if you're interested in scaling postgres, this may bean option you want to avoid. But if you're interested in learning more about row level security anda basic setup for it, you can check out this post from Pganalyze.com.

[00:15:41] The next piece of content is how to set up PostgreSQL on an IPV six enablednetwork. This is from Higo CA, and it talks about how to set up IP version six for PostgreSQL,and it goes into a lot of detail about how you can do this. So if you're interested in doing that,check out this blog post.

[00:16:00] The next piece of content is a Webinar similarity queries in PostgreSQL follow up. Sothis is a webinar on similarity queries, and it gives about a ten minute YouTube video of anintroduction of what the webinar is like. So you can take a preview of it here. And if you'reinterested, then click here to register for the full webinar.

[00:16:20] And the last piece of content is the PostgreSQL person of the Week is CharlesKlavdeche. So if you're interested in learning more about Charles, his contributions toPostgreSQL, definitely check out this blog post that does it. For this episode of Scaling Postgres,you can get links to all the content mentioned in the show notes. Be sure to head over toScalingpostgres.com, where you can sign up to receive weekly notifications of each episode, oryou can subscribe via YouTube or itunes. Thanks. --

episode_image