background

Multi Column, Multi DB, Dos & Don'ts, RUM Indexes | Scaling Postgres 65

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss multi-column indexes, using multiple databases, Postgres dos & don'ts as well as using RUM indexes.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about multicolumn, multidatabases do'sand don'ts and run indexes. I'm Kristen Jameson and this is Scaling Postgres episode 65.

[00:00:20] Alright, I hope you're having a great week. Now our first article is multicolumnindexes, and this is from Medium in the PG mustard area. And they're talking about an indexover multiple columns and usually how you would start indexing your database or your tables oryour application would be putting on single column indexes. But then if you start gettingperformance issues, a lot of times you move to multicolumn indexes. And this is kind of acounterpoint in that there's kind of no magic bullet and it talks about some disadvantages. Sothey talk about when malduk column indexes fall flat and that's predominantly when you'reneeding to search on the second column in a multicolumn index or the third column or the firstand third thin column of a multicolumn index, then it won't really work. And they go into someclear examples about this and why it won't work. And they even say how to spot some inefficientefficient index use. And that's when you have like a high amount of rows removed by the filter,when you look at the query plane, when you run Explain or Explain Analyze. And they say whenyou're sticking with just single column indexes, it basically can do a scan on each of those singlecolumn indexes and then do a bit scan between them. And that can work in a lot of cases.There's just cases where multicolumn index definitely excels. So if you want a refresher on whatcases you may want to move to a multicolumn index, definitely a blog post to check out the nextpost is managing multiple databases in Rail Six. And this is from the Citusdata.com blog. Andwith the new Rail Six they have the ability to support talking to two different database systems.I'm sure other frameworks have that, but this is calling out Rails because this could beadvantageous if you have a primary PostgreSQL database and then a Replica PostgreSQLdatabase. Here they call it a follower and basically you would direct all data changes to theprimary and then all essentially read only queries to the Replica or the follower. And they talkabout how to configure Rails to set this up to work in Rail Six, and how a lot of times, basicallyyou just tell Rails to automatically utilize the reading role, which is hitting the Replica for gettinghead requests and probably leave post patch put things of those other http verbs for just hittingthe primary database. So if you use Rails or another web framework and you want to look intohow they're offering the ability to talk to multiple databases, definitely a blog post to check out.

[00:03:01] The next post is don't do this. This is from the Wiki postgresql.org. Now I saw thispost earlier, but heard some mention of and I'm going to mention another blog article related toit. I'm not quite sure who authored this, but it does have some, I would say, opinionated choiceson why you should and should not use different features. Now, some of them are clear in termsof my opinion, but some of them not so much. Like, for example, a lot of ones that definitelydisagree with are don't use the psql capital W or include the password, don't use between.

[00:03:40] I agree with I always have to take a step back and think about it, usually with dates, Idefinitely do what they're suggesting here, or at least timestamps, but like, things that's differentis that I frequently do use timestamp with that time zone just because of the framework that Itend to use. As that framework offer different options. I might use timestamp including the zonewith it, but a lot of that is just due to the framework that I'm using. Text storage. Basically they'resaying you should mostly use Vericare or text and rarely use Vericare in, although I find itbeneficial. And they say that this is when you should do it is using a limit when you actually wantto limit the text. And a lot of times with my columns I do want to limit because I don't want peopleto be able to insert data into the database in an unbounded fashion. A lot there's one here thatsays don't use serial. But again, a lot of the frameworks they're using the serial data types asopposed to the new identity columns. Now, hopefully the frameworks that I use will be switchingover, but as of now they are not. So it's an interesting blog post and has a lot of advice to lookinto, but --I guess I wouldn't necessarily take it as gospel. And again, for everyone, they say whenshould you? I guess some of these when should you? I would say what is your framework or arethere things that you're using like an application framework that makes it more difficult to kind ofdo some of the recommendations in here? I guess I don't have a high difference of opinion witha lot of these, but a lot of the reason I do do some of these things is because of the applicationframework and what capabilities it offers.

[00:05:22] The next post is Table Inheritance what's it good for? And I believe this is kind ofmentioned because in here it says don't use table inheritance, essentially. And it says whenshould you with regard to table inheritance? Never, almost. But here this blog post byPerspectives on Ledger SMB is talking about a use case that they felt inheritance is good.Basically.

[00:05:47] They had some sort of objects and those objects had notes for each of those types ofobjects. So they were using a parent notes table and then did an inheritance to like an invoicenotes table and then other different types of notes table and they found that this particularinheritance feature works for them. I wasn't 100% wedded on the idea as I read through this blogpost as something that I would do, but I could see it as a use case where you want to make thedecision to do it. Definitely the different developers need to be on board to understand how thisworks, but it's definitely a use case I could see where table inheritance could possibly be useful.So if you're interested in looking into that, definitely blog post to check out.

[00:06:32] The next post is using docker Hub PostgreSQL images and this is from secondquadrant postgresql.com and they're talking about a docker Hub and it has images forPostgreSQL. And basically they started using it for a particular use case and they're thinking thata lot of people are going to use these because it's the PostgreSQL image on docker Hub. Butwhat I found interesting about this, they talk about different warnings that you should consider.Like number one is it's ephemeral by default. And that's the thing that has me nervous aboutusing containers with PostgreSQL is making sure the data just doesn't go away. And apparentlythis docker image on Docker Hub is ephemeral by default. So you can essentially, as he sayshere, quote by default, these images make it very easy to lose your data forever. So I myselfhaven't looked into using Docker with PostgreSQL and part of the reason is the fear of thislosing the data. And then another one he talks about is data safety. And here's a warningdocker's default storage configuration on most installs may perform poorly and isn't necessarilycrash safe. The docker documentation says as much. So again, more warnings about this. Andthen he talks about how the locale and encoding and collation is set up, that it's using root bydefault, how to handle connections between containers and some other information to help youunderstand and use the PostgreSQL container on Docker Hub. So if you're interested in lookinginto that, definitely a blog post to check out.

[00:08:01] The next post is actually presentation. It's Streaming Replication the Basics. And thisis from PG Day in Belgium, 2019 by Stefan Forkat. And it's at Pgsth. GitHub IO.

[00:08:14] And this talks about streaming replication with PostgreSQL and it talks about thebasics but it definitely goes on a little bit further than that because this presentation is 54 pagesin length. He goes over the basics of Wall and how streaming replication works, how to set it up,but then he also goes into issues such as monitoring, so how you monitor using PG Statreplication, the PG Wall receiver, how to do failover, promote the different databases and evenfailback using PG Rewind. So it does start off with the basics, but then it goes in much moredeeper about all the issues related to replication, even mentioning synchronous replication. So ifyou're interested in getting an overview of streaming replication and then going much more indepth. Definitely a blog post to check out.

[00:09:07] The next blog post is Indexes and PostgreSQL part eight, Rum. This is fromHaber.com and I believe this was posted on Postgrespro Ru and it basically explains the Rumindexes. So it's kind of like the, as they say here, quote the next generation gen has been calledRum. So it's the next version of the R --um index because typically you use gen with data types like for text search, for JSON, Bqueries, for arrays, things that have multiple values in the data type. But they're saying there arecertain limitations in gen and basically Rum has been developed to get around those. Now, it'snot offered as a part of the core PostgreSQL install, but you can install it as an extension. And ittalks about the use case here.

[00:10:04] Now, one of the things they mentioned here is they actually added more informationto help things with ranking and make it more efficient, better for tech search with Rum indexes.And at least one disadvantage here is like looking at the size. A gen index for something thatthey were looking at was 179 megabytes, whereas the Rum index was 457 megabytes. Butbasically it makes search much faster. There was a presentation that I found from PostgresOpen 2016, and I'll include this YouTube channel link in the notes, but it talks about create indexusing Rum and by Oleg Bartunov. And he's describing the Rum index and he was seeingperformance improvements like six to ninefold in text searching because, again, the gen indexhas problems with ranking because it doesn't have enough information in it to do that. And it canalso do some things like also order by date time when you're searching on a particular textquery. So if you use a lot of full text search in PostgreSQL, these are two posts to check out tosee if you may want to look into using the Rum index.

[00:11:14] The next post is a glance at PG Cron to automatically schedule database tasks. Andthis is from Luca Ferrari at fluca 1978 GitHub IO. And he's talking about something that wasdeveloped by Citus as a way to store scheduled jobs in PostgreSQL using PG Cron. So this isan extension that you can install and it has a Cron like syntax and you can set up to rundatabase jobs directly within your database system.

[00:11:45] I'm probably going to still be sticking with Cron, but if you have a use case where youactually want to store the jobs in the database itself to complete on a periodic basis, definitely ablog post to check out.

[00:11:57] The last post is an overview of Sharding in PostgreSQL and how it relates tomongoDBs. And this is from Percona.com blog. Now this is related to a YouTube video that wasshared a few weeks ago in Scaling Postgres, but it talks about first partitioning in PostgreSQLand particularly declarative partitioning. And then it goes into Sharding with a concept of you'restoring one of these partition tables in a totally separate server instance and using a PostgreSQLforeign data wrapper to actually communicate with that secondary instance that has that partitiontable on it and it goes through the code and shows you how you can do this today. And they alsogive some advice on when does it make sense to partition a table and then when should youtake the next step, potentially and look at sharding. Now, again, there are solutions that do thisout of the box, like Cybus, but this is where you're just doing it using the PostgreSQL tools thatare available today in essentially the community edition. And of course, the biggestconsideration they talk about is how you shard the data, because ideally you want the queries toall take place on that one shard. Like for example, if you have a system where your customersare set up by accounts, maybe you would want to shard by account, so that when you'reprocessing one customer's data, it all happens on one of those database shards. So if you'rewanting to look into PostgreSQL for sharding, even though it's not here yet and there's somedisadvantage with regard to it, this is a blog post to check out if you're interested in that. And italso does a brief comparison of MongoDB, which has essentially replication and sharding kind ofbuilt into it on how it compares to PostgreSQL and what you can do today.

[00:13:45] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can signup to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes.Thanks. --

episode_image