background

World's Best, Subtransactions, Sharding, Schema Changes | Scaling Postgres 105

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how Postgres is the world's best database, performance ramifications of subtransactions, the current state of sharding and safe schema changes.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about world's best subtransactionssharding and schema changes. I'm Kristen Jameson, and this is scaling postgres episode 105.1.

[00:00:20] Alright, well, we're definitely living in crazy times with the Coronavirus around, so I dohope that you and your family, friends, coworkers are safe. And of course I also hope you'rehaving a great week. Our first piece of content is PostgreSQL is the world's best database andthis is from Secondquadrant.com. This just goes through a list of features that he feels makesPostgres the best database. And I'll just go over some of the highlights from here. First istransparent security. So unlike a commercial vendor that usually keeps things secret until theyhave a solution and then resolves it, any security issue is released in the open and everybodyhas to work against it to find the solution. So it's very transparent working on the security. Thenext area is MultiVersion concurrency control is good for you. Now, the thing to keep in mind interms of performance is that this, as he mentions, is best for high insert and select workloads.So right now, updates and deletes take a bit of a hit because it marks rows as unused anymore.Once something is updated, it doesn't do an update in place. So essentially, if you have a lot ofupdates going on, it's creating more and more postgres rows that need to be vacuumed. Sothere's a performance penalty for that. With some of the new storage solutions they're talkingabout coming online. That may speed up some of this because that still maintains theMultiVersion version concurrency control, but it's just something to keep in mind in terms ofperformance. Next is a PostgreSQL does everything he's proposing. All of the features youwould see in NoSQL, React, React, Redis, Mongo, it has ability to do them, although with ofcourse, maybe not some of the richness that exists in those solutions. The next section is hook itto a map. So Postgres very significantly used and chose PostgreSQL as its backend database.So there's a lot of support for that. PostgreSQL is growing, leading the way in open source. It isa very stable, well run project. I have to agree with that. PostgreSQL builds solutions that arestable forever. Now, you can run them forever because it is totally open source. There's no builtin support, of course, but you can run particular versions for other, although they still come withupgrades. Declarative is better than imperative. So he's advocating for SQL versus other typesof query languages for a database.

[00:02:48] Extend PostgreSQL the way you like, so all of the different extensions that you canadd to it and even write your own software to make PostgreSQL your own. Go big and go wide.Talking about a lot of performance improvements for significantly large workloads likepartitioning, parallel execution, partial indexes, table spaces, caching, et cetera.

[00:03:11] All the variety of indexes that are available, gist spgist, KNN, Gist, Gen, Bren, Btree,et cetera.

[00:03:20] Roll it in, roll it out, they're talking about you can use this as a heart of a datamanagement platform where you can easily roll data in or send it out. You can use logicalreplication, you can use foreign data wrappers for managing data. The license is wide open. Soa very liberal BSD license, a fantastic documentation. It's developed with test drivendevelopment to be able to achieve, as he says, quote this provides for zero known bug releases,internationalization and localization. You can cloud operation, you can operate in the cloudstandards compliance, always working towards the standards and language features. So I feltthis was an interesting post. You may want to take a look at at all the different features thatpostgres has, particularly if you're having discussions with other developers with regards to itsuse.

[00:04:10] The next piece of content is subtransactions and performance in PostgreSQL. So thisis a post he says came from Troubleshooting, a PostgreSQL performance problem. So first itgoes into what are subtransactions? So a transaction, basically you do a begin, you start somelevel of work and then you eventually end it or commit it. And within that transaction you canchoose to roll it back before the commit and work is done as a unit. Now, subtransactions, hesays, are a way to roll back just part of what you are working on and that you can start asubtransaction i --nside a transaction by creating a save point with a name as an identifier. And then if you wantto roll back to that point in the transaction, you just do rollback to the save point and then givethe name you provided. And here's the use of a subtransaction. So you do a begin, you do somework, you do something that results in an error division by zero error. And then when youcontinue to try to do more work, that transaction was aborted so everything was lost. But if youdo the same thing, you begin a transaction, you do some work, do a select statement and thenyou create the save point. Then you create that division by zero error. Then at this point in thetransaction you can do a rollback to the save point and now you can do some more work. So it'sas if this error never happened because you're reverting to the save point you created. Now, interms of subtransactions in plpgsql, anytime you do a transaction that has an exception block,essentially you're creating a subtransaction here even if it's not executed because I believe wesaw that from a previous post that was talking about these and we'll mention why this is a littlebit relevant later. Then he talks about compatibility with other databases. Other databases havedifferent behaviors like for example, they may constantly auto commit things whereasPostgreSQL waits for that commit. It doesn't auto commit something. But there are drivers thatare used with PostgreSQL. Like they're saying, the JDBC driver has a parameter called autosave, and that could potentially lead to some performance problems.

[00:06:25] So he did this as a performance test. He created an unlocked table halfway filled witha fill factor and inserted some data into it, and then ran PG bench and had a few queries. And heset up a save point. So it ran a save point, set a random number, and then executed a selectand then an update. And he did two tests. One had 60 save points for test number one. The nexttest, test number two, had 90 save points. And we'll see why this is relevant here. So he had amachine with eight cores. He set the test to run six concurrent clients for ten minutes. And thenthis is the PG bench he ran, and it gives the output here. And then he used this perf command tobe able to track what the database is doing. And you can see, and this is the test number one.With 60 subtransactions, the transactions per second were 167. And nothing looks relatively outof the ordinary here, but when he goes up to 90 subtransactions, the transaction per seconddrops by less than half to 60, say 69 transactions per second. So less than half the previous167, as well as had a lot of work being spent in locking areas like LW lock attempt, lock, LW lockrelease, LW lock acquire. So he says, quote, even if we take into account that transactions intest two are one longer, there is still a performance regression of 60% compared to test one.Now, here he goes over the implementation of how some subtransactions happen, howtransactions are stored in the PGx Act directory, and how subtransactions are stored in the Cbpgsubtrans subdirectory, and then goes into the visibility, and how you can define whether thesetransactions and some transactions are visible. But the important point is here it creates a bit ofa snapshot to determine what's visible. And it, quote, has room for at most 64 nonabordedsubtransactions per session. And there's one transaction allowed per session. If there are morethan 64 subtransactions, the snapshot is marked as sub overflowed. So basically, that's why 60transactions performed well. But when he went to 90 subtransactions in this test, that's why itslowed down. So he says a quota subflowed snapshot does not contain all the data required todetermine visibility. So PostgreSQL will occasionally have to resort to PG subtrans. So thatresults essentially lightweight locks trying to look up this information. Now, he talks a couple ofdifferent ways. How you can diagnose the problem of too many subtransactions is that, numberone, if your workload performs well when run single thread a bit badly when run in severalconcurrent database sessions, two, you often see the wait event subtrans control lock inPGSTAT activity. Or three, you export a snapshot using Pgeport snapshot function the resultingfile in the PG snapshot subdirectory of the data directory will contain the line. SOF one toindicate that the subtransaction array overflowed. Now, the thing to keep in --mind is that we've been talking about using save points and that's how he did his test here.But this also applies to exception blocks because exception blocks create subtransactions evenwhen not fired. So you can imagine in plpgsql code, if you have a loop that goes over andexecutes this multiple times, once you hit that 64 limit, you're going to start getting performanceramifications of that. So definitely something to keep in mind, which kind of goes to what he'ssignifying here in the image at the top of this post is that when you're looping over a thousandtimes and you have an exception block, you're going to start running into performance problemsdue to subtransactions. So definitely a great blog post. I suggest you check out the next post.Horizontal scalability sharding in PostgreSQL core missing pieces of the puzzle so this is afollow on post from a series of three that they did back in August 2019 talking about the future ofsharding with PostgreSQL. And basically the intent is to go the foreign data wrapper route. Souse foreign data wrappers to create separate foreign servers and essentially those would be theshards. So you have a primary database and it uses the foreign data wrapper technique to treatthose as separate shards of a common database. And he goes over that process discussinghere and how you could potentially create it. And a little bit of this is ideally what should be done,but what are the missing pieces of the puzzle? So basically the core PostgreSQL communitywants to eventually get to support sharding. This is probably similar to a Cytus type solution, butin terms of what's missing is that number one, you would need a global Transaction Manager.Two, phase commit for foreign data wrapper transactions. So he mentioned this is in processand the state of it is in as well as high Go software is contributing to this and he says it lookspromising for postgres 14.

[00:11:39] The next area is a global snapshot for cluster, wide read consistency, so basicallyvisibility of where transactions are across the cluster of machines. And again he goes over thecurrent status of this, although he doesn't project here when it will be. So it still may be ways off.The next area is parallel foreign data wrapper scan, so basically asynchronous query execution.And it looks like this feature is as he says, while this patch is destined for PG 14, they may get apiece of it in postgres 13. So that's something that could be potentially beneficial. And lastly ismostly regarding management, he says a shard management. So the partitions on foreignservers not currently get created automatically as described above. So basically it's themanagement of a cluster. How are you going to handle that in the creation of the different tableson the foreign servers. So this is something that needs to be investigated. Now, they alsomentioned they have their own version of PostgreSQL that does something like this, but you canchoose to look at that if you'd like. Now, apart from this, he also says there's a need for generalmanagement tools as well. So for example, how do you manage the cluster monitor, themachines? How do you do backup and recovery when you have all these different databasessharded around?

[00:13:02] How do you support high availability if one of the foreign servers goes down, how doyou handle that? So a lot of areas of maintenance and management you need to consider. But ifyou want to look for what the future of Sharding is, here's an updated post you can check out.

[00:13:21] The next post is fear free PostgreSQL migrations for Rails. Now this is dedicated forRails and it's talking about database migrations that alter the schema essentially of yourdatabase. And this is from Gocardlist.com and they're talking about how schema changes canrun into problems. So first he goes over migrations, what they are, how they work, and thenmentions how you can run into problems.

[00:13:47] Now, for versions prior to eleven, if you added a default it would lock the table. Thatdoesn't happen since eleven, but there are other areas that you can encounter problems. Whena table gets an access exclusive lock, you could get stuck behind a lock queue because thereare other locks in front of it and that lock is waiting in a queue that locks everything up. So it canstill cause problems and essentially downtime for your production environment if you don't domigrations carefully. So they describe that, --but then they go over to potentially solving the problem and that they had developed a gem,which is a Ruby library, essentially called Active Record Safer Migrations. To help prevent that,well, now they have a new solution called Nandi. Now, not many people may be using Ruby, butwhat's interesting is the techniques that they use. You may want to consult this library or look atwhat they do. Like of course, one of the main things they do is they use a lock timeout and astatement timeout. So if any statement exceeds a particular amount of time, it airs out, or if anylock is maintained for greater than 5 seconds, it goes ahead and stops the migration. So thoseare great things to have. But also when you're creating a foreign key, you're locking two differenttables with an excess exclusive lock and there's a particular technique you need to do in order tonot bring your database down. And this library, Nandi, does it for you. So for example, the firststep it does is it creates the new column that's needed, then it creates the constraint and finallyvalidates the constraint. So that's a process to avoid issues and they're continuing to work onthis library. So if you want to learn more about ways to migrate your schema safely, definitelycheck out this blog post, particularly if you use Ruby or Rails.

[00:15:32] The next piece of content is how to provision and deploy highly available postgreswith TerraForm and Ansible automation scripts. This is from Enterprisedb.com. Com andessentially on GitHub they've released a postgres deployment repository that containsTerraforms and Ansible scripts to be able to stand up a replicated environment with managementtools as well. Now they are using the Enterprise DB management tools, but still the fact that theyopen sourced it you could definitely look at how they have their TerraForm set up, how they havetheir Ansible set up, to see if you want to use that or even modify it for your own use cases to beable to manage your postgres deployment. So if you're interested in that, definitely a blog post tocheck out. The next piece of content is actually a PDF from Momgn, US databases, Containersand the Cloud. So he talks about using Kubernetes and database containers for managingdatabases and he goes over what it is, how it can work. I found slide 14 interesting of thebenefits for databases and how rapid creation destructions less overhead than the VM. They'renot huge benefits for databases. There's a little bit of a benefit for scaling and migration. In hisestimation.

[00:16:55] The most powerful benefit is automated deployment, although you could be usingsomething like TerraForm and Ansible to essentially do the same thing, which is actually what Iuse. I use Ansible to do my deployments, but interesting nonetheless. If you want to check outKubernetes and postgres presentation, the next piece of content is webinar state of full textsearch in PostgreSQL twelve follow up. Now this is a webinar that covers a YouTube video weposted actually in episode 100 about full text search. So this is the same individual in thepresentation and it's a webinar version. It's a little bit longer. So I believe I preferred the versionthat was presented in episode 100, but you can check out the webinar version of it just byclicking here to register. And this is from a second quadrant.com. Also fromsecondquadrant.com is with a little help from my friends. So this is basically for postgres twelvetalking about width and Cdes and the enhancement that was done in terms of inlining as well ashow to use it in use cases.

[00:18:02] Next post. Also from secondquadrant.com is how to deploy a multimaster BDR clusterin Kubernetes. So again, BDR is bi directional replication and how you would deploy that usingKubernetes. And they have a video that shows the different commands to run through as well asa text description of it. So if you're interested in BDR on Kubernetes, definitely check out thispost.

[00:18:26] The next piece of content is the new PostgreSQL Person of the Week as ChristophePettis. So this is an interview with Christophe and his work and contributions in Postgres. Andfinally, the last post is Querying spatial features with PG featureserve. So this is the nextmicroservice that Crunchy Data has introduced for the purposes of providing a web interface forPostGIS data. And they tell you how to get started with it right here. So if you're interested inthat, go ahead and --check out this article that does it. For this episode of Scaling Postgres, you can get links to allthe content mentioned in the show notes. Be sure to head over to Scalingposgres.com, whereyou can sign up to receive weekly notifications of each episode. Or you could subscribe viaYouTube or itunes. Thanks. --

episode_image