background

Analytical DBs, Recursive Queries, Replication | Scaling Postgres 2

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering analytical DBs (potentially using GPUs), recursive queries and different forms of replication.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about recursive queries, indexing,analytical workloads using GPUs, massive streams of data, and multiple forms of replication. I'mcreston. Jameson. And this is scaling postgres episode two.

[00:00:25] It all right. The first article is indexing with Postgres when less is more. This is onmedium part of PlanGrid technology. And so this post goes into a query that was running slowerthan anticipated. They use Python and they use SQL alchemy as their orm. And they had aquery when they ran Explain Analyze. It was running in 1.8 seconds, and this was slower thanthey felt it should be, given the amount of data that they were pulling. What's interesting aboutthis article, it goes into depth how postgres works to retrieve your result set in terms of doingindex scans to retrieve the data, potentially retreck conditions, as well as doing bitnet heapscans to actually pull the data from disk. And he goes into how some things were happening thathe didn't quite expect. Basically, he had to go over his query again and really thinking about howhe was pulling the data and what he needed to do to make it more efficient, and even said he didsome experimentation as well. But doing that experimentation with the query and taking a look atthe Explain plan and the performance, he was able to make the query much more efficient. Likebefore, he was using two columns to do a join to the table that he was interested in, but he justrealized that he was able to filter it down to one and make the query much more efficient. So hegoes from, I believe it was 1.8 seconds down to 23 milliseconds. So definitely a big win. Sosome of the points he covers here is that when you're having a performance problem, youdefinitely want to be testing out assumptions on a production like database, because postgrestakes into account table size, cardinality of the data resources available. All of those come intoplay when it's determining the best route to take to process a query as efficiently as possible.And even though he says at the end, less is more isn't an axiom you can live by, but measuretwice, cut once is. So basically when you have queries that are slow, you definitely need to testthem against production data to assure that they're as optimized as they can possibly be. Thenext arter we're going to look at is from thenxtplatform.com, and it's called bringing GPUs to bearon bog standard relational databases. This isn't explicitly about postgres or doesn't explain howit works, but it brings something to the forefront that I think more people need to think about andbe aware of. And they're talking about the separation between operations and operations thatneed to be in terms of transactional databases and then analytical databases. Historically youhave online transaction processing OLTP and online analytical processing. And typically therewere two different workloads. So you would have your transaction processing database is veryefficient in processing transactions as fast as possible. Inserts, updates, selects, deletes,whereas your analytical database is tailored for workloads where you're analyzing a lot of data,maybe doing a lot of joins, and it's okay for queries to run longer, but you usually had less usersprocessing those queries. So that separation is still in effect today. They make a quote here,every company, even Google, so this is a quote, every company, even Google and Facebook,has this split between transaction processing and warehousing. And another quote, the split isnot going away anytime soon. So it's basically this recognition that this has to happen. What I'veseen in some clients is that they use the one database for everything, and at times they get intoproblems doing a lot of analytical, even customer demanded things to process analytics. Andsometimes it's best to separate that off into a separate database, or even process summaries oraggregate tables in a database to be able to handle those queries very fast, as opposed to tryingto process and read across millions of rows at a time. So it's just something to keep in mind. Theother interesting thing about this article is it goes into people looking at using GPU processingcapacity to do massively parallel operations like you would do in an analytical type database.And interestingly, many of the vendors that they're mentioning here have Postgres SQL as thecore engine that does these massively pa --rallel queries. So I thought this was a very good article to check out. Again, it's not necessarilyeducational in terms of teaching you something, but it gives a broad view of the state of usingmassively parallel analytical databases as well as their future potential with GPUs. Okay, thenext article we're going to look at is from Medium, again from Alibaba Tech, how to create astream system with tens of millions of feeds. So this is a blog post from Alibaba, which I believeis essentially China's Google, but they go into how they're using Postgres, and there's a diagramhere where they're streaming all of the data that's coming in into Postgres SQL instances, thenpasses it off into a storage service. And then they're using this database system called HybridDB for Postgres SQL to actually do the analytical processing. So here again, this is an exampleof the separation from the transaction processing that's happening and the analytical processingthat's happening. Now, if you're curious what Hybrid DB is, essentially that is their own productthat they developed from Postgres SQL, and I believe it's mentioned a little bit in this blog post,but they basically took the Green Plum database. I'm going to the Green Plum database sitenow, and this is an open source massively parallel data platform. So they do this kind ofanalytical processing and use Postgres SQL core as the engine. Now, it's not the up to datepostgres SQL. I believe they're targeting moving to version nine as the core relatively soon, but itis meant for very massively parallel processing of data. So basically they took that Green Plumdatabase and modified it for their own use case for developing this hybrid DB. So it goes intotheir data ingestion process here and under the heading advantages of the scheme that they'vecome up with, the number one thing they say is performance. So Postgres SQL's powerfulconcurrent write performance combines with hybridb's analytic performance. So a singlepostgres SQL can support millions of writes and batches loaded into their hybrid DB analyticdatabase. Now their hybrid DB analytic database enables real time millisecond level queriesthrough MPP's superior analytical performance. The other thing they mention is data handlingand cleaning that they've set up for their system, as well as some other benefits that they gointo. So I thought this was a great article to look at to get again the perspective of how peopleare using Postgres, what it is best for in terms of transactional processing. So I definitelyencourage you to check out this article and if you're interested in analytical processing andyou're interested in open source, maybe check out the Greenplum database site which isgreenplum.org. Okay, for the next article, this is a bit of a fun one. This is not necessarily relatedto scaling your database, but in terms of something of interest to developers in Postgres. Ithought this was a fun article to take a look at because actually I am a weekend dungeonmaster. So this blog post is exporting a hierarchy in JSON with recursive queries, and this isfrom the site tapoueh.org. So there'd definitely be the link for this link in the show notes that youcan take a look at. But what was interesting is that they're using Dungeons and Dragonscharacters and using with Recursive SQL in order to produce a JSON document out of or usingSQL. So not necessarily related to scaling Postgres, but definitely a fun and interesting article totake a look at. The next article we're going to take a look at is actually a YouTube video. So if youare interested in multimaster replication, there was a recent YouTube video that was posted thatI felt was very good. I watched through the whole thing. The title is geographically distributed.Multimaster replication with Postgres, SQL and BDR. And BDR is an acronym for bi DirectionalReplication. So natively Postgres does not have a facility for doing multimaster replication, butthere are other solutions that are available from vendors that can do that for you. And this talkgoes into kind of two different camps that these solutions fall into. One is a tightly coupledrelationship between the masters and one is a loosely coupled relationship between themasters. So what they say in terms of tightly coupled is that with tightly coupled, most things arelocal. So for example, your storage is sharded or excuse me, shared. Typically in terms ofconsistency, in isolations, it tend --s to preserve the acid model. So a lot of applications can work with it fairly well. Disadvantageis geographical distance. It doesn't really work that way to have your masters geographicallyseparated because a lot of them work. It works synchronously, particularly if it's shared storage.Transactions are happening in a synchronous fashion in terms of data conflicts and collisions.They can be prevented more easily and the application compatibility is pretty transparent. Theother camp is loosely coupled. So this is where your storage is pretty much independent. So youhave entirely separate databases with their own storage system and replication happensbetween them. So the storage is not shared and it depends on eventual consistency. So inlocation A, some data gets updated and then B it eventually ends up in location B. Sogeographically speaking, it's very good at handling that. It's not so great for consistency, ofcourse. So you tend to have database conflicts and usually these solutions rely on optimisticresolution to those data conflicts. And the committing of transactions to the database aretypically asynchronous because there's such a distance between these solutions. Now this talkfocused more on BDR, which is a loosely coupled solution and they like to think it a little bitdifferent using an acronym pace.

[00:12:22] So basically when things are connected, meaning the whole system is working, youhave multiple masters and they're replicating data between one another. You can either chooseto have latency between them being updated or consistent. But when you have a breakdown inyour replication so when it's partitioned you can either choose from availability or consistency. Ifyou choose consistency, it's probably only going to mean one of those masters is operating.Whereas if you choose available, probably multiple masters if the link is broken, are still working.But now you're potentially in an inconsistent state. Another thing that was really great about thispresentation is if you're going for a loosely coupled solution, he emphasized the need to doapplication changes and he goes over how to handle conflicts.

[00:13:11] How are you going to handle keys?

[00:13:15] Are you going to use something like UUIDs? Are you going to have a central keydistribution for the keys on your tables? How are you going to handle schema changes? Anddefinitely, really emphasized the importance of testing. So if you're considering multimasterreplication and you're using postgres, I definitely suggest you check out this video. Okay, movingon to something related to replication I have the next article is from the Citus Data blog and it'sthree approaches to postgres SQL replication and backup. So it talks about three approaches,although reading it I kind of felt more like it was approaches to replication and recovery. So thefirst solution is you have streaming replication between two postgres databases and if you wantto bring up a new node, essentially you have to start replicating data from the primary usingstreaming replication and basically it has to stream that entire database over the wire to be ableto set up replication as a new Replica. The second level is where you're doing volume levelreplication and you replicate at the storage layer. Now, this is not something that Postgres SQLsupplies natively, but this is a data storage solution to replication. The third approach is basicallytaking backups from your primary, including the wall files. Then you restore that backup, applywall files until the Replica is close to the primary, and then you start streaming. So the advantageof this approach is that you don't have to have the load on the master while rebuilding this newReplica. So they don't explicitly show how to do this or don't have code examples. But if you'rethinking about different techniques for bringing up Replicas for your postgres installation, Idefinitely suggest checking out this article to see their discussion about it. The next article we'regoing to cover is Replication Topologies in Postgres SQL and this is from the Ops blog. Andbasically this is a more basic post that kind of just goes over Postgres SQL replication in generaland it presents how do you handle do master slave, how do you do hot standbys. It talks abouteventual consistency because generally when you're setting up a hot standby, it's asynchronousit does not write to both synchronous in a synchronous way. But it a --lso talks about some settings you can do to actually make them synchronize standbys. And italso talks a little bit about the logical replication that postgres provides, as well as cascadingreplication between for your Replicas. So if you're just getting started in replication, this would bea good post to look at. Similarly, another post that was updated is Postgres SQL ReplicationSlots Again by Ops Blog. And here they talk about using replication slots. Historically youhaven't needed to use replication slots to do replication, but this talks about setting them up andhow to get them running for your replication setup. Now, the last piece of content we're going totake a look at is another YouTube video. This is postgres SQL Replication in 2018. So againgoing on this theme of replication, it goes through and talks about all the different replicationoptions and things to look at. So if you're looking at replication and thinking about changing thescheme you're using, this is definitely a video to watch. Now, I will warn you that the audio of thispresentation is not great, so just be aware of that as you're reviewing it. But definitely a piece ofcontent to take to look at if you're looking into replication. Now that does it for episode two ofScaling Postgres. Be sure to check out the links in the show notes to each article or piece ofcontent that I mentioned. Be sure to head on over to Scalingpostgres.com so that you can getweekly notifications of each episode or you can feel free to subscribe via YouTube or itunes.Ends. Thanks. --

episode_image