background

BRIN Indexes, Partition Enhancements, Tips, Standbys | Scaling Postgres 73

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss BRIN indexes, partition enhancements, general tips and standby scaling.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about Brin indexes, partition,enhancements tips and standbys. I'm creston. Jameson. And this is scaling postgres episode73.

[00:00:15] One.

[00:00:20] Alright, I hope you're having a great week. Our first piece of content is Brin index forPostgreSQL. Don't forget the benefits. And this is from Procona.com blog. And basically they'retalking about a Brin index, which is a block range index. So basically it stores ranges of values.So this type of index is optimal for things that are going to be stored incrementally. So forexample, you're storing data in a table by date or by a primary key that's incrementing. That'swhat is best used for this type of index. And it says basically that brain usage will return all thetuples in all the pages in the particular range that's chosen. Basically the index is lossy, so itdoesn't have an exact representation, it stores ranges and then it can pull up that range and thenfind specific data that you're looking for. So sometimes it does take a little bit more to do asearch, but the benefit of it is the size is incredibly small and that can help it with caching. Sothey basically created a table, inserted about 8 million rows into the table and then they did aquery on the table with no index and it finished in about 1.8 seconds. They created a b treeindex. Now that query instead of 1.5 seconds takes 2.6 milliseconds. So super fast with the Btree. And the size of that B tree is 171 megabytes. So now they take a look at the Brin index. Sothey use the using Brin when creating the index. And that same index that was as a B tree, 171megabytes. As a Brin index it is 64 KB. So vastly smaller, that may be 0.1% of the size, so vastlysmaller than the index. So that has a higher probability of being cached, definitely. Now thequery finishes in 87 milliseconds. So again, the original query was finished with no index was 1.8seconds, so 1800 milliseconds, whereas this one finishes in 87 milliseconds. So not bad, apretty good improvement. However, the B tree still beat it with 2.6 milliseconds. So this examplehere, it's about 40 times slower. So you get a huge space savings at the cost of not thenecessarily best performance, although it probably depends on the query. Now they also go intoa section talking about storage and maintenance of it and that you can specify pages per range.So this is basically how many pages are going to be in a particular range. So basically you canmake it less lossy and presumably you can improve performance a little bit at the cost of takingup more space. So this is think of it, I kind of think of it like a JPEG image. You can have ahigher quality image that's a lot larger or a lower quality image that is a lot smaller. So I think thebrin is kind of that same concept. So if you're interested in potentially using a brin index,definitely a blog post to check out.

[00:03:28] The next post is partitioning enhancements in PostgreSQL twelve. And this is fromthe Second Quadrant.com blog and they basically take an example from their previous post ongenerated columns in PostgreSQL twelve. And they're doing it like a data warehouse exampleand using partitionings. So they create a media calendar, breaking out all sorts of different datetimes as integers in a single table. And then they have a time dimension where they look at timeof day, hour of day, minute of day, second of day, and then they're going to do a cartesianproduct of joining these two together and placing the results into a partition table. So theintermediate table is an hours to days table and they partition by range day. And then they put allof these values into these partitions as defined here. And basically they're going to be generating6.3 billion rows. Now, getting into the first enhancement, it's basically that quote, you may nowhave a foreign key relationship where the partition table is the child. So here they created a saletable and referenced as a foreign key one of the child tables. So this is definite benefit forintegrity purposes. And then they also mentioned some new introspection tools, whereas there'sa PG partition tree, PG Partition Ancestors and PG Partition Root. So you can do a D plus andthey even have an S plus to talk to display all the partitions within PgSQL. But they also havethese functions that you can use that shows you the partition tree. So for example, the hourstoday's is the parent. It def --ines if it's a leaf or not and what level it's at. So you could have essentially multiple partitionlevels and it references what the parent rel ID is for it. There's also the PG partition root toidentify what the root table is as well as Partition Ancestors. The other thing mentioned is thataltering a table to attach a partition now does not require an exclusive lock. So that's great. Nowthe detached partition still does, but at least the attached partition does not require an exclusivelock anymore. So that's definitely a benefit. And they list three other benefits here. One, the copycommand has reduced a bit of overhead allowing for faster loading. Two, the table spacespecification for a parent is now inherited by the child. And three, PG catalog PG Indexes nowshows indexes on partition children. So definitely some great enhancements for partitioningcoming in version twelve.

[00:06:01] The next post is postgres tips for the average and power user. And this is from theCitusdata.com blog and the first one they mentioned is actually one I haven't heard or usedbefore in that in psql you can actually tag your queries and use Control R to do a search. So youcan just use Control R. You don't need to necessarily tag them, but you can just type part of aquery and it will pull up the different queries that you've run historically and you can just hit Enterand run that command. So that's pretty interesting. But he's talking about using a tag with acomment here to be able to find a particular query you're looking for. And then he talks about abetter psql output. So there is the backslash option that he talks about, something called aPGDG that you can install that he likes better and gives you some configuration you can use toyour Plqlrc file in order to enable it. The next is getting a stack trace for an error, so he goesthrough the process of being able to do that, talking about inserting fake data. Again, if you'vewatched previous episodes of Scaling Pro Screws, there's been many blog posts aboutGenerate series, so they're talking about it here. They mentioned a better PG stat activity. Sogive some suggestions on how to better read the active statements going on in PostgreSQL aswell as a process to manage multiple versions of postgres that you may be running using PGNs.So some tools that you could use in your postgres environment and another blog post to checkout.

[00:07:31] The next post is Standby in Production Scaling application in the second largestclassified site in the world. And this is on Avivotech on Medium and they have a very largepostgres installation, greater than 20 terabytes on over 100 nodes. And this is Avito, which is thebiggest Russian classified site and the second largest classified site in the world. So this post isvery, very long and pretty comprehensive and it goes into basically all about standbys andtalking about how they can be used from a high availability and a scaling perspective. And hetalks about physical replication, which is you're streaming the wall files essentially and talksbriefly about logical replication, but primarily it's about physical replication and using standbys tobe able to horizontally scale reads. And he goes a little bit into the history and then somespecific things that they've run into and how they've been able to resolve them. Some of themain things he covers is how to avoid stale reads from a replica and some techniques they usecompared to some other people, how they have implemented some caching. They discoveredsome issues with regard to deadlocks they were having on a replica, discussing issues related toexcessive wall files and how they handled that. They talk about standby pools. So this is a reallycomprehensive blog post that talks about a number of different items that they have encounteredas they're scaling postgres. So if you've run into problems with scaling as it relates to standbys,this is definitely a blog post to check out because he does cover a lot of different informationthat's a little bit too long for me to go into comprehensively in this episode, but definitely a blogpost I encourage you to check out.

[00:09:18] The next post is the mysterious backend flush after configuration setting. And this isfrom CyberTech postgresql.com. And there is this setting that I haven't heard about or lookedinto, but they heard Andres friend that basically said, quote, if your workload is bigger thanshared buffers, you shou --ld enable back end flush after parameter for improved throughput and also jitter. So they said,well, let's take a look and try it. So basically it communicates something to the OS about how tohandle some F syncing. That's my interpretation in a nutshell. So he decided to go ahead andtest this. He used a four CPU, eight gig system and PG bench. And with it set at its default, itfinished in zero 67 milliseconds. When he set that to 512 KB, it wasn't in significant change, lessthan 1%. But then increasing it to megabytes, he got close to a 4%, close to a 5% difference interms of better performance. So this is another one of those configuration settings I would sayprobably depends on your load and what you're doing. So this is one where if you want to try tosqueeze a little bit more performance out of it, take a base measure of your system, alter thisparameter and see if you get a difference in the performance of your system and your workloadto see if it could be beneficial. So if you're interested in getting a little bit more performance,definitely a blog post to check out.

[00:10:42] The next post is suggesting single column primary keys almost automatically. Now,this is a continuation of a blog post previously about how we had to Luca Ferrari. And this is fromfluca 1978 GitHub IO. Now, he talked previously about having to generate primary keys for agreat number of tables in a database with a lot of different tables that didn't have any. And he'scome up with kind of like an automated method to identify which is probably the best primarykey, again by looking at the system tables. So he has all his code right here and he's consultingthe PG Stats table in order to determine what are some good candidates. And he goes throughhis methodology here and then runs it to suggest certain primary keys for different tables. So ifyou have a particular use case for this, definitely blog post to check out.

[00:11:35] Next post is also by Luca Ferrari. That's titled checking. PostgreSQL version inscripts. So basically he had a psql script and he wanted to be able to check the version as itrelates to partitions because sometimes he was using something that's version ten that hewanted to use this particular type of syntax. And then version eleven, a slightly different one thathad a default partition. So he came up with this method to determine which postgres version isrunning. So that he can conditionally choose it during the query execution. So if you have a usecase for this, definitely blog post to check out.

[00:12:12] The next post is Per User Encryption with Postgres, and this is from the startup onMedium.

[00:12:19] And this is a pretty simple post to start off in regards to Postgres, but he's talkingabout basic encryption that you can do using the PGP SIM Encrypt and PGP SIM Decrypt toencrypt and decrypt data within the database. And he talks about how you can basically definean environmental variable on your probably application server and then be able to encrypt anddecrypt data without having the keys on the database. And he takes it a step further and talksabout Sandwich encryption, where, for example, maybe your users have their own key and theserver has its own key, and basically you double encrypt it with both of those. So that way evenyou can't decrypt the data. Only the user with their known decryption key can do it. So if youhave a need to do that, definitely a blog post to check out. The last post is PG backrest s threeconfiguration. And this is from Pgstep GitHub IO. And PG Backrest is a separate backup tool forPostgreSQL, and you can configure it to use S Three to store and retrieve data, and this blogpost shows you how to do that. Now, it doesn't use S Three, but he actually uses a tool calledMinio, which is a server that's compatible with Amazon S Three's object storage. Butnonetheless he goes through and shows you how to configure it to use PG Backrest and does abackup with PG Backrest and shows how to configure it and get it running. So if you'reinterested, definitely a blog post to check out that does it. For this episode of Scaling Postgres,you can get links to all the content mentioned in the show Notes. Be sure to head over toScalingpostgres.com where you can sign up to receive weekly notifications of each episode, oryou could subscribe via YouTube or itunes. Thanks. --

episode_image