Multiple Databases, Slow Queries, Sorting Performance | Scaling Postgres 27

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering scaling with multiple databases, detecting slow queries, and sorting performance.

Content Discussed

YouTube Video

Podcast Audio


[00:00:00] In this episode of Scaling Postgres, we talk about scaling with multiple databases,detecting slow queries, sorting performance, and using foreign data wrappers. I'm KristenJameson, and this is scaling postgres episode 27.

[00:00:22] Alright, we have a lot of content to go over this week, but I'll try to move through itquickly.

[00:00:29] The first post is Scaling at Instacart, distributing data across multiple postgresdatabases with Rails. So this is from the company Instacart and this is from the blog.

[00:00:44] So they had a challenge, basically, they had a lot of data and they were going to berunning out of space. So they were going to be hitting a six terabyte limit raised to twelveterabytes before they finished on Amazon RDS. So basically the solution that they actually cameup with was segmenting their data into two separate databases. So the post describes it as theyhave a four sided marketplace, essentially customer shoppers who pick up and deliver thegrocers to customers, as well as they work closely with retailers and consumer packaged goodcompanies.

[00:01:22] So basically, they examined all the tables in the database and they only found a fewdozen tables that were needed by multiple domains. And they're saying, for example, when acustomer places orders, the orders are written to the customer's database in the customerdomain. However, that same data is also needed in the fulfillment domain. As this diagramdemonstrates, they wanted to separate their data into two databases along a domain boundary.So this would be the customer domain, say, and this would be the fulfillment domain, say, andeach database would have ownership of specific tables. Now, if you're looking at this, you'reprobably thinking about microservices. However, they did this a little bit differently. So they usedone application that actually talks to both databases, depending upon what table they wanted tofill information in on. So they weren't using a microservices architecture, but they were havingseparate databases and they would call it to the database based upon the information that theyneeded. So essentially, they have a monolith application. And this code shows how they're usingtwo database connections with Rails. Now, one of the databases is essentially the master of aparticular table, and because they wanted to avoid master to master, they basically assigned thisparticular database would own this table, and if the other database needed that data, it would becopied over. And then they put some controls in place, as demonstrated in this code, at both thedatabase user level to prevent writes as well as in the application to prevent writing to essentiallywhat should be a readonly table. And then basically they did one way sync between thedatabases, copying the table information that was needed from one database to another usingwhat they're calling an application level data pump. Now they said they experimented usingforeign data wrappers in conjunction with materialized views. But they said, quote, theexperiment wasn't successful, but we found materialized view refreshers to be potentiallyproblematic at scale. But it's interesting. I wonder if the foreign data wrappers could haveworked. But basically they didn't use foreign data wrappers. They didn't seem to use logicalreplication, nor did they use logical decoding. Those are some postgres feature they could haveused, but they decided to do replication at the application level and they said it gave themgreater logical control. And they're using something called Hub, which is their own Pub subimplementation. And they go over some of the things that they're doing whereby they arechoosing an eventual consistency model. So eventually those copy read only tables will bebrought over to the next database, they do some things to verify data integrity of course andthen they went through their whole process of actually migrating a table to reside in this separatedatabase. So this is definitely an interesting approach to take and again it harkens back tolooking at microservices where each of those services tend to have their own database but herethey have a monolithic application. But I guess due to the constraints they were seeing withRDS, they decided to actually shard their database solution into two or potentially moredatabases, I'm not sure, but they use that one monolithic application to write two specific tablesin --each of those databases and then sync data that's needed from one to the other. So definitelyan interesting blog post to check out.

[00:04:56] The next post is Three ways to detect slow queries in PostgreSQL and this is from theCyberTech blog and basically number one is using the slow query log. So this isin the PostgreSQL configuration. You can set the log min duration statement and any statementthat is longer than this many milliseconds will be logged to the PostgreSQL log. So any queriesthat happen will show up in there and you can look for slow queries in there and they note hereyou can also do it for specific databases. So one database could have a setting that's differentfrom another. And then if you happen to see a slow query pop up you would basically use anexplain analyze on that query to see what potentially took it so long to run. And what was thequery plan? Now, another alternative with regard to query plans that they mention is that youcan use auto explain. So this is something that you can load into your session and turn it on andit will automatically log queries that are longer than a specific duration and he says you can evenin the configuration do set session preload libraries to enable it. So that way if you have a slowquery it will not only log that query but also give an explain plan for what plan was taken toexecute that query and the third is checking PG stat statements. So the first options are great forfinding slow queries. However, when you have many many thousands or millions of shortrunning queries, but it's because they're running for so long. Even something that's taking 100milliseconds, if you're running it thousands, millions of times, that could actually be thebottleneck. And if you go to reduce the speed of that query, even though it is pretty fast, if youcan shave off 50% of how long it takes to run, that could have a huge benefit if you're running ita lot.

[00:06:46] And basically to enable PG Stat statements, you just add it to the shared preloadlibraries of your file, you create the extension and then it allows you to collectstatistics on queries. Now, he doesn't go into depth about using this in this post, but there is alink post about using PG Stat statements in order to track how often these queries are run andthe resources used. So, good blog post to check out. The next post is from the same blog. It'sPostgreSQL improving sort performance. And this is again from the blog. Hestarts off creating some sample data and then determining how to sort. Now, in his example, hesorted a lot of data and it basically had to go to an external disk merge. And he says, of course,the most important performance to change in order to speed up sorts, not to be on disk if thereare large sorts, is to increase your work memory. So in this example, he had four megabytes ofwork memory, he sets it up to 1GB and now that sort happens in memory and it goes from takingsix and a half seconds to 2.7 seconds. So definitely work memory is something that you canincrease to give more memory to sort operations and other types of aggregations. However, thething to keep in mind he doesn't necessarily mention here is that work memory is used per userand even sometimes multiple times within a user's query. So you do need to keep a ceiling onthat. If you have a lot of connections to your database, if you have a dedicated analyticaldatabase that has not that many users, then you can feel free to crank it up. But if you have likea transactional work will load with many, many users or many connections, you're going to wantto keep a ceiling on how much work memory you're actually using. And then of course, one ofthe main ways to improve sorting is indexing. So applying an index where he's using like a limitten, it blows the doors off the other examples. So definitely having an index to assist with sortingis definitely the way to go.

[00:08:52] Now, related to indexes is our next post. So it's indexes on rails. How to make themost of your postgres database. This was from the Carol Galanchak, my apologies, I'm not surehow to pronounce that, but this is just a good overview of indexing in general, particularly forpostgres. So he goes over the index types terms of Btree hashes Bren, Gin, gist and then whenyou want to get to the stage of optimizing queries, how do you do that? Use explain or explainanalyze and then he goes on to addit --ional advice in terms of the sequence of the columns in a B tree index matter. In other words,you could do a query that just uses the first part of the index or both, but you can't use a query,won't use it if you're just looking at the last column that's indexed in a multicolumn index. Hetalks about partial indexes which are actually my favorite performance improvement for queriesbecause used in certain situations they can drastically increase the performance of queries. Hetalks about expression indexes which have their uses if you're always querying by or alwaysquerying in conjunction with a function. And then he goes through about optimizing like queries.He basically says quote forget about Btree indexes for this case, so what do you use instead?You basically use a gen index and taste. And he also says to take advantage of trigram matchingprovided by the PG trigram extension. And lastly, like the previous post covers ordering and howthat can benefit from indexes. And of course, once you have a large database, or even not solarge, you always want to add your indexes concurrently. And throughout this he gives this codein Rails so he tells you how to implement it using Rails Migrations. So if you're looking for anindex review, definitely a blog post to check out.

[00:10:43] The next post is Transactional Data Operations in PostgreSQL using common tableexpressions. And this is from the Rob Connery blog and he talks about essentially his own storethat he has and how he has set up some different tables. And basically when an order comes in,he wants to record the order, record the items in the order so that's a separate table as well asrecord the downloads that someone can access as well as potentially products. And he goesover and shows you how he tends to like to set up his database in terms of using on deletecascade for foreign keys, trying to avoid nulls where he doesn't want them, when he has a notnull, always trying to set a default. And he actually generates random UUIDs in certain cases incertain columns using the PG crypto extension for postgres. Now, what's interesting about thispost is he's used a technique I haven't seen used a lot where he actually uses CTEs in order todo transactions. So for example, CTE is a common table expression essentially I like to think ofas a many function for a subquery. And the advantages he says is that CTEs execute within asingle transaction. So he can do an insert into his order table here with the value he's interested,return all that information and then in this new items essentially he's calling or I should sayretrieving the information returned and then using that to insert into his order items table,returning that value. So essentially he can run this transaction with select all from new items andhe assures that both get entered and he builds this implementation incrementally. And in thisexample he says, all right, well, what have you do about the downloads table? And then heshows a downloads implementation just by adding another common table expression. Then hesays, okay, what happens if you have multiple order items? So here he's basically showing thatexample where basically for the insert statement, he's using a select sub query to do it. Nowthen he goes into all the examples he showed above, he's hard coded the solution and how doyou handle that coding? How do you handle data integrity and validation? So what he says is hisfirst inclination is to basically put all of that in a function that you can just call and then pass inthe data needed as part of it using a JSON B variable. And another solution he shows ispotentially using a prepared statement. So it's an interesting use case. I'm not necessarilywedded to doing this particular type of implementation, but it was something I really hadn't seenbefore, so I thought it was definitely a good thing to share and show.

[00:13:22] The next post is Simplifying recursive SQL queries. So if you didn't get enoughlearning about CTEs in the previous post, this one goes even deeper because this actually usesrecursive CTEs. And this is where a table essentially references itself. So say you have a list ofemployees and one of them is the manager of another row in that table, and you have like aManager ID and how best to show who that person manages and queries. Now again, this startsoff slow, talking about what are CTEs? And then showing an example of a basic common tableexpression. And --then it goes into what's a recursive CTE basically showing where you do a union with basicallyhave a start point and then at the test at which point it exits. And when you call that, here's theresults you get basically print one to and add one to it every time until it exits, once the point isless than ten. So this post actually goes into some of the actual use case he's dealing with. AndI'll let you review this post to see if this type of hierarchical use cases is something that you needto work with. Now, we also covered this in a previous episode of Scaling Postgres, and this isfrom the blog. In their Fun with SQL series. They did recursive CTEs andPostgres. This was from back in May. In this example they're using essentially employees with aManager ID. So again, if you have a use case where you have a table that has a hierarchy withinit, one row reference than another, these are two posts definitely to check out how to work withthat using Recursive CTEs.

[00:15:01] The next post is Foreign Data Wrappers in PostgreSQL and a closer look at Postgresforeign data wrapper. And this is from the Procona database performance blog. So this goesover what a foreign data wrapper is. It basically enables you to contact a separate databasefrom within postgres and do queries against it or insert or update data. And specifically thePostgres foreign data wrapper enables you to communicate with postgres instances. Now, whatI find interesting about this is that the future of Postgres Sharding is probably going to involvethese foreign data wrappers as we've mentioned on previous episodes of Scaling Postgres. SoI'm always interested to find content that is talking about using foreign data wrappers. And this isa fairly long post that actually goes into how you set it up to communicate with a separatepostgres database. They go into some of the advantages of the foreign tables and talk aboutquery optimization and even how to set up writable foreign tables. And then they also discussissues with regards to pushing down joins predicates aggregates and push down is basicallywhere the work is pushed down to the foreign database server as opposed to the one that you'reactually logged into. So if you're looking into foreign data wrappers as a scaling initiative or youneed to access other databases from your PostgreSQL instance, definitely a blog post to checkout.

[00:16:26] Now, related to that is they have PostgreSQL accessing MySQL as a data sourceusing MySQL foreign Data Wrapper. So basically this is another post by the Procona DatabasePerformance Blog and it covers connecting to MySQL from a PostgreSQL server and it goesthrough all the different options to configure it. So if you use MySQL, definitely a blog post tocheck out, the next post is actually a YouTube video. And this is from the EuroPython conferenceYouTube channel and the title is Craig Kirstens or Kirstein's Postgres at any Scale. And this waspresented at EuroPython on July 23 to the 29th in 2018 that this YouTube video was recentlyposted. Now, this presentation has been given at other conferences, but this is the most recentversion that's come up in the areas I've monitored. So if you haven't watched it yet, definitelycheck it out. The latter part is related to Citus Data, which is a Sharding solution for PostgreSQL,but a lot of the advice is directly related to basically the community postgres version. Sodefinitely a YouTube video to check out if you haven't watched a previous version of it yet.

[00:17:39] That does it. For this episode of Scaling Postgres, you can get links to all the contentpresented in the show notes. Be sure to head over to where you can signup to receive weekly notifications of each episode. Or you could subscribe via YouTube oritunes. Thanks. --