background

Performance Impressions, Mystery Solving, GiST Performance, K-Anonymity | Scaling Postgres 89

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss Postgres 12 performance impressions, mystery solving, GiST and SP-GiST performance and k-anonymity.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about performance impressions,mystery solving, gist performance, and K anonymity. I'm Kristen Jameson, and this is scalingpostgres episode 89.

[00:00:21] Alright, I hope you're having a great week. The first piece of content is PostgreSQLversion twelve, initial query performance impressions. And this is from Cyberdeck. Hyphenpostgresql.com. And they basically have done some performance checks that they mentionedthat they've done with previous releases for version twelve. And they're comparing versioneleven to version twelve, predominantly using PG bench and I believe a number of otheranalytical queries that they've come up with. So they go through their testing methodology, whatthe different setups that they used, the hardware software, the configuration options that theyset, and then they get into the results, of course. Now they had a scale indicator here. Thisindicates whether a particular query or work was small enough to just fit in the buffers, whether itwould consume more of the ram of the machine, and whether something was so large it wouldprobably spill onto the disks. So those are the different scales that they've done here. And thenthey have the meantime for version eleven, version twelve, and then what the meantimedifference was, as well as these standard deviations. And there were some cases where twelvewas faster. Those would be the negative percentages here, but overall it was a little bit slowerthan eleven. So he took a total average and it was 3.3%. But it was interesting looking at someof the queries here. Now, a lot of these are very fast, like less than one millisecond, so it's hardto gauge speed at that type of resolution. But even some of the analytical queries here, so thisone was 0.8% slower, this one was 1% faster, this one was 8% slower and then 3% slower. Sodefinitely looks like it just looking at the analytical queries that he had developed, it's a little bitslower. And the other interesting thing mentioned was the inserts being slower like this is 20%slower for the PG bench default insert, and the PG bench extra indexes insert is 12% slower.Now, we mentioned that perhaps this is due to the space improvements to the index, maybe ittakes a bit more time to get the data inserted. But what I also found interesting is the standarddeviation is huge. Like this is 177%. So looking at the average time, it's 0.4 milliseconds for thisinsert. But the standard deviation was 0.2. So that's a huge standard deviation. The same thingwith the extra indexes. It's zero five in terms of the speed of it, but the standard deviation is zerotwo. So a huge range in terms of the inserts. So not quite sure what that is. And he doesn't haveany theory from what I looked at in here. Now, of course, he does mention these arepredominantly artificial tests, so you have to take everything with a grain of salt. But definitelyperhaps investigate the insert performance as you're using version twelve to see if there's anydifferences from version eleven. So overall, very interesting blog post and I encourage you tocheck it out.

[00:03:40] The next post is making mystery solving easier with Auto Explain.

[00:03:45] So they had a situation where they were working with a customer and they tried allthese variations to get better performance after migrating from Oracle to PostgreSQL. So theysaid they adjusted shared buffers, tried swapping the JDPC drivers they were using, theyperhaps thought it was the replica, but just using the primary database had no impact. So whatthey did is they set the log men duration statement to zero, to log all the queries and get sometiming. And it looked like it was an update statement. So we took one of the queries and ran itthrough Explain Analyze and didn't see an issue. Now, I don't know if it's the exact query thatwas hitting the database as we'll see lower down here, but they took one of the example queries,ran it through Explain Analyze and it was using the index. There was seemingly no problem. Sowhat they did is they used Auto Explain, which is an extension that you can put in your sharedpreload libraries and you can configure it. They set the long men duration to be zero and to turnon Analyze. So it gives you the actual real count as well as the cost estimates.

[00:04:55] And when doing that, they noticed that the filter that was being applied, it wasconverting the integers into doubles. S --o it would take the column, make it a double afloat, as you were a double float, and then takethe number that was input, make it double precision as well, as well as the second column andthe second value.

[00:05:18] So of course this caused a sequential scan and it was no longer using the index. Andthey finally traced it down to a third party interface, was adding an execute update method fordatabase connections and basically it was, as they say here quote, basically converted all the intarguments to double precision which prevented it from using the index that was there. Sobasically the additional software between the actual application and the database system wasadding this to it and causing performance problems. So definitely something to be aware of. Andalso a tool, an extension, Auto explained to be able to try and help diagnose it. I wonder though,with this, the logman duration statement, I wonder if it was possible to capture more of it, to beable to diagnose it earlier without having to use the Auto Explain. But definitely something tokeep in mind if you suddenly run into performance problems. And again. This is RichardYen@richyen.com.

[00:06:19] The next article is PostgreSQL twelve. Implementing kneeest neighbor space partitiongeneralized search tree indexes. This is from secondquader.com and basically they're looking atperforming a nearest neighbor. So a simple linear distance using this operator here and they'reseeing what you can get the best performance with based upon indexes. So the first they didn'tuse any indexes at all and they just retrieved five results. The closest location to thesecoordinates with the data that they had on hand, it basically took looks like 2 seconds to be ableto return the data that you see here. And then they added a Gist index to the location columnand it took about three minutes to build. But after they did that, when they ran the query, it ran inone millisecond, so definitely a lot faster with the Gist index. And then they tried an SP Gistindex again adding it to the location. And this one ran in 00:35 eight milliseconds, so even twiceas fast. And they have a table of the comparisons here, so unindexed, I'm going to ignore theestimates, but the actual query time was 2.5 seconds. With a Gist index it was about amillisecond 0939, and with the SP Gist index it was 00:35 eight. In addition, the index size wassmaller with the SP Gist index compared to just the Gist index. And of course the creation timeof it, it took a minute and a half. So half the time of a Gist index. So definitely if you're using Knearest neighbor searches or looking for location distances, definitely use the SP Gist index asopposed to just a Gist index. So if you want to learn more about how they set up the data anddid this test and the results definitely in blog post to check out the next post is achieving Kanonymity with PostgreSQL. So this is talking about the PostgreSQL anonymizer and this isfrom the blog Tadeen. Net and they say that they use various different strategies to anonymizedata to make it easier to look at without being able to identify a single individual. Like theymentioned using randomization faking, partial destruction, shuffling, noise addition. Well, nowthey're talking about generalization. So in terms of generalization, it's taking certain data andmaking it more general. Like for example, in the case of a zip code, give a zip code range. So it'snot specific, but it's a zip code range. They live somewhere within this area, or a birth date,maybe you give a range of years that that birth date would fall, but not the actual birth date. Oryou could even say the year that they were born would be another way to give something moregeneral. So they give an example here of using this generalization technique where you have anSSN, a first name, zip code, birth disease, and then what they did is redacted the first name, gotrid of the well, I should say they got rid of the Social Security number, redacted the first name,so it was removed. Then they used ranges for the zip code and ranges for the birthday, but thenkept the disease so you could analyze this disease based upon the ranges of birth and zip codewithout being able to identify the specific individual that had a given disease. And then theytalked about being able to do the K anonymity, which as they say here is an industry standardterm used to describe the property of an anonymized data set. --So basically it gives you the probability that someone can identify an individual from the set ofdata that it's given. So here you can specify which columns are being used as an indirectidentifier. So in this case the zip code and the birth. And you can run this function to give you a Kfactor. So in this case it's three, which means for any value of the zip code and the birth, you willfind at least three individuals in the generalized patient table. So any one grouping has at leastthree individuals. Now, as you get closer to one, that means that you can identify a givenindividual based upon those combinations. So the more columns you have in a table, forexample, you would be more likely to drop down your K to where you could actually identify anindividual from the data given. So basically you want to keep that value above one to keep thedata anonymized. So definitely interesting blog post and an interesting tool given the differentlogs that are being enacted with regard to privacy. So definitely a blog post to check out the nextpost is Gzip in PostgreSQL, and this is from Cleverelephant CA Blog by Paul Ramsey and he'stalking about a very simple extension that he set up that supports doing Gzip. Now this is notsomething to use to compress the data you're typically storing in PostgreSQL, but it just does ashe says here, a binding of Zlib to SQL. So basically if you have a process where you need toextract data from PostgreSQL and then zip it up for like consuming as a part of an API or you'rereceiving Gzip data that you want to insert into PostgreSQL, this extension gives you a means todo that. So if you're interested in it definitely a blog post. And a new extension to check out thenext post is PostgreSQL Query Plan Visualizer. So when you're analyzing a query plan, it givesthe representation to you in text, but there's actually a PostgreSQL Explain Visualizer tool that'sbeen developed, it's an open source, so you could install it yourself, but on this actual blog postthey did that and they're calling it the Explain Plan Visualizer. So you could paste your plan inhere, place the query that resulted in this plan. It will give you a graphical representation of it. Sopresumably something easier to read. So if you're wanting to look into that, maybe check out thisblog post as well as maybe the separate PEV tool that you can install on your website.

[00:12:36] The next post is Streaming Replication setup in PostgreSQL twelve how to do it right.And this is from the highGo CA Blog and this blog post just basically goes through how to set upstreaming replication in PostgreSQL twelve. Given the changes that have been made to therecovery.com file, the fact that it's gone and they're now using the two different signal files. Sothis is another piece of content if you need to set that up. Second related post is replicationfailover with PG rewind in PostgreSQL twelve. Again, with the recent changes to PostgreSQLtwelve, this goes through the process of replication failover or with PG Rewind. I call it failingback. So once you've failed over to a new primary, so have a replica be the new primary. If youwant to fail back to the old primary, PG Rewind helps you accomplish that process and they gothrough how to do that in detail. So if you're interested in that, definitely a blog post to check out.

[00:13:36] The next post is Configure HAProxy with PostgreSQL using built in pgSQL Check.Now, we've covered the previous article related to this HAProxy using X in ITD, and I believethey didn't use pgSQL Check in that case. They use a custom script to do it, which to me lookingat that post seems like a better way to do it compared to this one. But this one their objectivewas to use the built in pgSQL Check, but in order to do that, it doesn't really do authentication toverify that it can connect successfully in order to check whether a database system is inreplication mode or not. So they actually set it up to use this tool using modifications to thepghba. Comp file, which I'm not necessarily a fan of, but it is a way to do it to use it. And thisblog post explains how to do these changes so that you can do failover switchover proceduresusing HAProxy and the built in pgSQL Check tool. So if you're interested in doing that is the blogpost to check out.

[00:14:45] Next post is Introduction and how to etcd clusters for Petrone. So if you use Petroneand want to use etcd for your consensus holder, this --is a pretty long comprehensive blog post that explains etcd and how to set it up for thatpurpose.

[00:15:02] And then the last post is PostgreSQL twelve generated columns. And this is fromfluca 1978 GitHub IO, and it's another blog post that talks about the new generated columnfeatures in PostgreSQL twelve that does it. For this episode of Scaling Postgres, you could getlinks to all the content mentioned in the show Notes. Be sure to head over toScalingpostgres.com where you can sign up to receive weekly notifications of each episode, oryou could subscribe via YouTube or itunes. Thanks. --

episode_image