background

Shardings Future, Query Optimization, Replication Read Performance, PostGIS | Scaling Postgres 13

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering the future of sharding PostgreSQL databases, query optimization, replication read performance and PostGIS.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about Sharding's future queryoptimization, replication, read, performance and postgres. I'm Kristen Jameson, and this isScaling postgres episode 13.

[00:00:21] Alright, our first blog post is built in Sharding for PostgreSQL. And this is from theRobert Haas blog. And in it he talks about basically the future of Sharding for PostgreSQL, thecommunity version. Now, there's a couple of other solutions that allow you to do Sharding today,such as Cytus is something that I'm familiar with, but this focuses on what core PostgreSQL isgoing to be doing for it. And this is a really great post and I definitely suggest you read it andeven all of the linked content here, it's very interesting what they're thinking about and planningfor the future. So I definitely suggest you review this. But basically he talks about this is a quote.One way to look at Sharding is as a form of partitioning. This is where you take a table, divide itinto partitions, where the partitions might happen to be foreign tables rather than local tables. Soforeign tables are basically ones that reside in another database. And there exists thetechnology today called a foreign data wrapper where you can communicate with otherPostgreSQL databases or even Oracle or Redis or other databases, but basically use thatconcept to be able to push down queries and directing inserts and updates to particularpartitions using this foreign data wrapper technology, essentially.

[00:01:47] So he's basically saying the future is being paved, but it could be another year, acouple of years, maybe three years, don't really know. But that we keep getting closer to thepoint of where basically the core the community PostgreSQL could support Sharding using allthe different features that have been building it upon with version ten and version eleven. Sodefinitely a blog post to check out as well as each of these linked posts. And he talks aboutbasically needing to do more push down into essentially the partitioned databases. When you'redoing Sharding, you're essentially partitioning the data to multiple databases. He talks aboutassuring acid compliance when you're pushing down an update to a particular partition, that ithappens on a global scale, that it's consistent and also basically ease of use, how easy it is it toprovision these multiple databases and managing them and have everything work in tandem. Soagain, definitely a blog post to check out for this week.

[00:02:53] The next blog post is simple tips for PostgreSQL query optimization and the subline ishow to speed up your database performance 100 fold. And this is part of the Stats Bot blog.

[00:03:08] Now it is pretty simple. So they talk about some use cases where they added a datebased multicolumn index and got 100 fold or they say 112 fold. Performance improvementdefinitely possible. I have seen much higher than that performance improvements with indexes,particularly partial indexes. But this is covering some information about multicolumn indexes.They go over, explain, analyze and how you can use that to analyze exactly what the query isdoing talks about.

[00:03:41] At least from an analytical perspective, you may want to consider one index perquery, although with a transactional database you want to minimize indexes and possibly try touse indexes that target more than one query.

[00:03:55] But basically mostly what it covers is these multicolumn indexes and talking about theorder and how that's important and how it can not just affect of course the where clause, but alsowhen you're doing which is considered filtering, but also when you're joining two different tables,those indexes can be utilized as well as ordering.

[00:04:17] So definitely a blog post to check out if you want to get more up to speed on indexing.Although I will say this comment here where he says quote date columns are usually one of thebest candidates for the first column in a multicolumn index as it reduces scanning throughout ina predictable manner. And while that can be the case a lot of times, there's plenty of instances Ihave encountered where using a different type of column and then following up with a date cangive a lot better performance, particularly if you're ordering by things. But definitely a blog post tocheck out if you want to get more up to speed on this.ormance boost for a particular request. And this is on Medium by Oyulio Oletskaya, I believe.And then he talks about it's basically a search query. So they're wanting to do a similar or a likesearch across two different tables, searching for something related to series and episodes by thename, so doing a partial search. Now this is a little bit unique because they are actually using aCI text column, which is a case insensitive text column which basically performs a lower whenit's storing the data. There's a reason why these particular data types exist, but there's alsosome caveats in their use and based upon their use case, they were getting sequential scans,they weren't able to use the index because they were using a trigram index. Now that might be alittle bit familiar to you if you've seen the previous episodes where excuse me, previous episodeof Scaling Postgres where we were discussing uses of ors and it is possible to use potentially aunion with that. And he discusses different options that he has done trying to get betterperformance, but because of the orm they were using, they didn't want to go with the trying to doit using a union. So here he goes into the possible solution is to use SQL unions. So because oftheir orm issue they got some better performance but they decided not to go that path. Whatthey ended up doing is actually storing Aggregating those two bits of data together, the episodename and the season name, into one column in one table to give a lot higher performance, ofcourse, because you're no longer having to do joins and it can directly use a Gen trigram indexto get you very fast response times. So this blog post was an interesting examination because,again, it's a Ruby developer. So it's from a developer's perspective of an issue he had to dealwith and the business requirements that he had to work around in terms of case insensitivity andhaving an orm that they really didn't want to have to work around, and kind of the compromisedsolution he came up with for this problem. So another blog post to check out.

[00:07:22] The next blog post is High Availability and Scalable Reads in PostgreSQL. And this isfrom the Timescale.com blog and in it this is with a subtitle, is a detailed primer on scalingPostgreSQL via streaming replication. So this is a pretty comprehensive blog post and it goesinto kind of the Y four replication, how it works in terms of synchronizing data from a primary toReplica. How it works using wall files, essentially the write ahead log and distributing that to thedifferent slaves by different mechanisms. I found this particular point very interesting whereyou're defining how you want to replicate the data. So there's basically three different options.One is asynchronous replication. In other words, you are going to send the data to the Replicawithout waiting to see if it has confirmed that it's written or not. Now, each of these options has abalance asynchronous replication gives you the highest write performance. You don't have towait to ensure that data has been written before returning to the application that something'sbeen written to the master, for example. And also in terms of read consistency, this is classifiedas the weakest because it's eventually consistent across all your Replicas. So it's committed onthe master or the primary and eventually it'll get out to all the Replicas. And the data loss is thehighest risk because you have no guarantees that all the data was written to the Replicas. Ofcourse, the next replication mode is write replication. That means data has been written to thewall files of the Replicas but not necessarily saved to the database. So it could exist in the wallfile. So it's been logged and written to disk. But if you query that database, it has not updated thedatabase yet. Now, the write performance on that is about a medium level because you justneed to get the data on the disk on the Replicas in the wall files, not necessarily in the database.Yet the read consistency, it gets a little bit closer than, of course asynchronous but it's still not astrong guarantee of consistency and there's definitely a lower risk of data loss because at leastit's on the disk on the Replicas. Now, the next replication mode is synchronous apply replicationand this is where you want the data to have been written and be present in the database of eachof the Replicas and be queryable. So this the write performance will be --the lowest because you have to wait for everything to be written to the database itself, not justthe wall file. So it's going to have to update, indexes and whatnot in each of the Replicas. Nowyou have a strong consistency guarantee because everything is essentially committed all atonce across all the databases. But again, that's part of what is the poor write performance, but itdefinitely offers the lowest risk of any data loss. So this whole section was a really great review.And then finally they go into measuring the performance impact of each of these modes. And ofcourse, this is by TimescaleDB, which is an extension of PostgreSQL that allows you to storetime series data, basically allows you to accept fast inserts into specialized tables. And they'reshowing the performance difference from when you have no replication. When you addasynchronous replication, the performance drops by 1%. When you go to asynchronous write, itdrops by up to 28%. And then when you have synchronous apply, where you're applying to allthe Replicas at once, essentially you've halved the performance of PostgreSQL by doing that.And then the next area of performance they measured is how well it scales. So they looked atlike a three node server and they got an average improvement of 2.5. They looked at a five nodeand got about a 4.8 performance improvement. So they're basically creating more Replicas sothat they can spread queries out to each of them. So that's very interesting. But there is onething that they did that I think is different than what you may experience in real world conditionsin that they are using. So for these performance measurements, this is a quote we presentperformance for 100 million row data set that does fit in memory. So everything is fitting inmemory and that means that you can be fully cached across them because all the data set fitswithin memory among each of the Replicas in more real world conditions. I normally haven'tseen that. Typically the database is much larger than the memory being used.

[00:12:03] So the database hardly ever all fits in memory.

[00:12:07] The consequence of that is that you're not going to get a linear performance becausedepending upon the queries hitting each Replica, they're going to cache different ones. So whatyou'll see is your cache hit rate drop. Maybe it hits 90%, maybe it hits 85% because differentqueries are hitting different Replicas at each time. Now, the only way to ensure a linear increasein performance is if you actually had a tool or a mechanism to target the same queries hitting thesame Replica database. That way you can assure high cache hit rates for each of the Replicas.

[00:12:44] So I don't think in real world conditions you will always see what it describes here, butdefinitely an interesting blog post to take a look at.

[00:12:53] Now, if you want to learn a little bit more about Timescale DB, I presented in aprevious episode of Scaling Postgres about Timescale DB, but there is also anotherpresentation that's been recently posted on the Hacker Labs YouTube channel, and it's calledRearchitecting, a SQL Database for Time Series Data. And this was presented at the Data Eng IGuess Data Engineering Conference in New York City in 2017. It's by Mike Friedman, who's theco founder and CTO of Timescale DB. So again, this presentation goes over TimescaleDB whatit is essentially a PostgreSQL extension that allows you to get really high performance for timerelated data being fed into PostgreSQL. So if you're not familiar with it, I definitely suggest youcheck out this YouTube video so you can understand what potential benefits this may offer you.

[00:13:50] The next blog post is PostGIS for Managers, and this is a presentation by P ramseyof Carto.com. Now, this is called PostGIS for managers, I believe, because he's essentiallytalking about people who may want to get their feet wet with postgres, or they're already usingan Oracle equivalent, I believe, called Oracle Spatial, and how you could potentially migrate tousing open source solutions such as PostgreSQL and their PostGIS support, as opposed tousing something like Oracle Spatial solution. So he talks about the concept of support, how doyou transition to it? Some of the different feature parity with the Oracle solution. So if you'relooking to get into PostGIS or if you're using it on another platform, this would definitely bepresentation to check out so you can see what the dif --ferent capabilities are for it.

[00:14:46] The last post is fun with SQL recursive CTEs in postgres. This is from theCitusdata.com blog and CTEs are basically common table expressions. Essentially they'reclauses with width that allows you to do essentially subqueries easier, but it also offers someadditional capabilities so you can use it for recursive calls, which can be used in cases whereyou have hierarchical storage of data in PostgreSQL. And he goes over some basics of how touse it. And then also a more real world example of where you can use this, where you're tryingto pull all the different people that are managed by a particular person. So if you're unfamiliarwith this capability and want to potentially add it to your toolkit, this is a pretty short post, so Isuggest you check it out.

[00:15:36] That does it. For this episode of Scaling Postgres, you can get all the links to thecontent presented in the show notes. Be sure to head over to Scalingpostgres.com where youcan sign up to receive weekly notifications of each episode, or you can subscribe via YouTube oritunes. Thanks. --

episode_image