background

Multi-Terabyte Scaling, Kubernetes DBaaS, Encryption, High Availability | Scaling Postgres 42

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering multi-terabyte scaling, building a kubernetes DBaaS, encryption and building high availability.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about multiterabyte scaling, Kubernetesdatabase as a service, encryption and high availability. I'm creston. Jameson. And this is scalingpostgres episode 42.

[00:00:18] You alright? Our first article is Postgres QL affiliate projects for horizontal multiterabyte scaling. This is from the CyberTech Postgresql.com blog and this is a follow on to apost that was released a couple of weeks ago about terabyte scaling. And this talks about whatdo you want to do when essentially scaling across or kind of do some horizontal scaling to themulti terabyte level. And he says, quote, be warned, the robe's getting bumpy. Now we usuallyneed to change the application and also the surrounding bits, which means basically doing someSharding. And he mentions three different extensions or derivatives of postgres that you can useto do a multi terabyte scale out of PostgreSQL. So first he mentions Sharding via PL proxystored procedures, which he describes as some glue to get the stored procedure called to reachyour Sharded database, and that this has been battle tested at Skype. So this is one solution,but it looks like it does require using stored procedures. The next option is PostgreSQL, whichhe says, quote, could perhaps be described as a PostgreSQL based Sharding framework and itlives somewhat under the umbrella of a PostgreSQL development group, but it lags a version ortwo behind a current PostgreSQL. But it says it lets you manage or run queries on tens ofterabytes of data with relatively little restrictions. Although he does mention there are somecaveats and the biggest cluster he's heard about is 130 terabytes of data on it. And the nextsolution he mentions is Sharding with Citus. So Cytus is an extension of PostgreSQL andmentioned it many times on Scaling Postgres. So this is an extension you could add toPostgreSQL to shard your database across multiple database servers and they also offer aservice as well as just using the extension, so that's a possibility. And then lastly is a green plum,which is an actual fork of PostgreSQL that's dedicated for data warehousing. So you probablywouldn't want to use this for online transaction processing applications, but generally formassively parallel processing. So those are some of the options available today if you arelooking to do a scale out of PostgreSQL. But keep in mind that over the next few versions ofPostgreSQL we're moving toward using the features of parallelism and partitioning and foreigndata wrappers to build out the ability to horizontally scale your PostgreSQL database system. Sowith each new feature added to the core, we're getting close to the point where PostgreSQL willdo something like this out of the box. But if you have a large data size you need to managetoday and want to do a scale out, here are some great options.

[00:03:14] The next post is actually a YouTube channel, PostgreSQL and Kubernetes databaseas a service without a vendor lock in. And this is posted on the Dalaibu YouTube channel. Sobasically he covers PostgreSQL in general, but then he talks about how you could run it onKubernetes and all the different component parts that you would need to put into place to buildyour own database as a service. Basically bring up multiple database systems. Maybe you'regoing for a microservices architecture or potentially you're wanting to do your own scale out.Well, Kubernetes could coordinate all the different database systems that you need to set upand basically you can do this using open source software without having to rely upon a vendorsuch as AWS or Google Cloud or whomever. So this is an interesting YouTube channel that talksabout how you could actually put this together. And again, with the future of scaling being scaledout across multiple database instances, this is definitely a presentation that could prove useful inthe future.

[00:04:19] The next post is databases versus Encryption. And this is from the secondQuadrant.com blog. Now I don't, wouldn't necessarily say this is Databases versus encryption,but how do you encrypt data in a database essentially, and they talk about with a lot of thedifferent requirements for data protection. Now one of the big ones is full disk encryption. Nowthere are a lot of vendors that provide it and there's tools that you can use to be able to do that.And that's kind of a baseline you would want to do. But for really protectin --g your data from people who are actually getting access to the database or through yourapplication, it's probably not going to be sufficient. And he mentions extension PG crypto and itaddresses some of the issues of encrypting data in your database. But a disadvantage here. Hesays quote, but it means the database has to know the keys that will decrypt the data and thoseare likely part of SQL queries. And so the issue with leaking data into server logs and monitoringsystems is still there. So he contends that it's not a great solution. And what a lot of people do isdo application encryption. So within the application itself, for data that you want to secure, youencrypt it, store it in the database, and then it's the application that decrypts it. So basically thedata is stored in an encrypted or scrambled fashion in the database. Now he mentioned that justleaves the database acting as double quote, dumb storage. So that means it can't reallycompare values. And one thing that this blog post proposes is potentially doing encryption as apart of a separate service, potentially maybe on its own instance. And one of the main things is ifyou want to compare data, so maybe you want to search for compared data so you rely uponthis service to do the encryption and decryption for you so you can compare values. Now hetalked about an experimental extension CC number that is potentially looking at doing some ofthis. So that's definitely something to take a look at. But I've also heard of other tools that enableyou to do blind indexing. So even though you have an encrypted field, you could do what's calleda blind index to index some of that data, but it doesn't really reveal what the contents are. But itdefinitely discusses some of the issues with encryptions in databases. And if you are needing toimprove your security through encrypting data in your database, this is definitely a presentationto check out.

[00:06:50] The next post is waiting for PostgreSQL twelve. Add log Statement sample rateParameter so basically this is a feature that has been added to twelve. We don't know if it'sgoing to make it there, but hopefully it should. And it addresses the issue where you have logmin duration statement and that allows you to set the logs to log any statement that is longerthan a certain duration. So basically you are able to search through the logs and find slowstatements, but the issue comes what if you have very short running statements but so many ofthem? Well, one option is to use PGSTAT statements that does a sampling of queries. So youcould deduce what those many, many queries that are being run but they're relatively fast, butoverall they're making potentially the system slow. That's one option. The other option is thisenhancement where it actually takes a sample so that you can put the log min duration a little bitlower or potentially log all statements but just do a sample rate of them. And in their examplehere that they did a test on, they were able to reduce the size of their log file over tenfoldbecause they set their sample rate at essentially 10% from set it from one to zero one. So thiscould potentially be a great feature if you wanted to reduce your log size but still get somevisibility of the queries that are going on in your log.

[00:08:13] The next post is evaluating high availability solutions for timescale DB andPostgreSQL. So basically, even though they're talking about timescale DB, which is an extensionfor allowing PostgreSQL to act as a timescale database, they're talking about high availability.So how well can a cluster of PostgreSQL instances be able to fail over? So they're talking aboutyou have a physical replication, so you have one primary database and two replicas. And theydid an evaluation using the criteria that they mentioned here to find what automated solutionthey could find that will allow them to automatically fail over from one of the primaries to aReplicas. And basically the one that they chose was Petrone. And they talk about how youbasically set it up. You need some sort of consensus holder to flag which one's primary, whichone's a replica, and they're using etcd here. And they talk about how the process actually worksin terms of doing the failover when you have the primary fail, and then how it elects the nextnode to be the new primary. So if you're potentially looking for a high availability solution, youmight want to check out this post and see --if you agree that that perhaps Petrone would be the solution you'd like to use.

[00:09:29] The next post is how to upgrade PostgreSQL ten to PostgreSQL eleven with zerodowntime. And this is from the Several nines.com blog and basically this is using essentiallylogical replication to do the upgrade from PostgreSQL ten to eleven. Now we've seen someother posts in previous issues that have discussed doing this process. So this is another postthat essentially describes the process of setting up logical replication and then how you mightswitch over to it when you want to do an upgrade. Basically you have an existing databaseserver that is acting as a publisher for logical replication and then you have a version eleven,one that is serving as the consumer or the subscriber, and then once all data is synced, youswitch the application over to use the new primary. So with the release of PostgreSQL eleven, ifyou're looking to potentially use logical replication to upgrade from ten to eleven, this is definitelya blog post to check out.

[00:10:28] The next post is at 22 years old, postgres might just be the most advanced databaseyet. And this is from the Arcentry.com blog. And basically this is a post talking about some of thefeature sets that are important to this organization and how they can use it and its extensionsand all its different features to provide exactly the features that they need, from pub submessaging to triggers to foreign data wrappers JSON b different add ons. So if you want a littlebit of insight as to how someone is using the features of PostgreSQL it's the blog post to checkout. Last post is Why the RDBMS is the future of distributed databases featuring Postgres andCitus. And this is from the citrusdata.com blog. And again, this is a discussion of all the differentfeatures in PostgreSQL and what makes it unique and different compared to other databasesystems. And they're making the argument here. One of the most important things in terms ofchoosing what database system you're going to be using are what are the development costs?So they're arguing that using a relational database system such as PostgreSQL and then alsohaving the scale out features of Cytus is probably the best choice today. So if this type of contentis interesting to you, definitely a blog post to check out.

[00:11:51] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the Show Notes. Be sure to head over to Scalingposgres.com where you can signup to receive weekly notifications of each episode. Or you could subscribe via via YouTube oritunes. Thanks. --

episode_image