Impact of Virtualization, Compressing Data, Parallelism

In this episode of Scaling Postgres, we review articles covering the impact of virtualization on databases, techniques to compress your data and parallelism.

Content Discussed

YouTube Video

Podcast Audio

Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about the impact of virtualizationcompressing data, parallelism and cross cluster monitoring I'm Creston Jameson and this isScaling Postgres episode 18.

[00:00:21] All right, our first article is the State of Developer Ecosystem Survey in 2018 byJetBrains and we're looking at the database version here. And I just like to take a look at thesewhen they come up. And the first question is which databases do you regularly use, if any? AndPostgreSQL came in a respectable second to MySQL. But what's interesting is that do you planto adopt migrate to other databases in the next twelve months? And PostgreSQL comes in third,but it's definitely the highest rated relational database that people are moving to. The nexthighest one. Well, maybe SQLite or Oracle, but definitely some interesting stats to be aware of.

[00:01:13] The next post is the impact of virtualization on your database. And this is from theCeliadb maybe, I'm not quite sure how that's pronounced, but this post takes a look andcompares a bare metal instance to a virtualized instance. And this is due to a new instance typethat is now available in AWS called I Three Metal. Now, the I Three series is a storage focusedinstance type. Normally for databases I see people using the memory instance types, but this isa storage instance type and it's the only one available in a metal series right now. So that's whatthey were comparing. And this isn't necessarily PostgreSQL specific, but it's just in generaltaking a look at some of the differences here because a lot of people run PostgreSQL, maybe asan RDS instance or they're using virtual machines to run them. And it's just what kind of animpact can this have looking at a metal compared to a virtualized? And these are pretty muchidentical configurations. They actually cost the same amount, about $5 per hour. But just goingto the metal gets you some more cores and a little bit more memory. So there's already a little bitof a performance gain due to that. And they did some different analysis here and basically themetal one. But I'll let you take a look at this post. But looking at the conclusion here, they havethis quote. We showed in this article that despite having 12% more CPUs, I Three metal cansustain a 31% higher write throughput and up to eight times lower read latency than thevirtualized I three, confirming our expectation that removing the hypervisor from the picture canalso improve the efficiency of resources. So in the interests of scaling out your PostgreSQLdatabase instance, if you have opportunities to use bare metal, that can give you someperformance boost to your application. So just something to keep in mind.

[00:03:24] The next post is scaling past the single machine. Now upon seeing that title, and thisis from the Mode Se blog. Now upon seeing this title I was thinking Sharding, but they're actuallydoing something a little bit different. They make the assessment that they're just about to hit thestage of big data and a lot of the data they are collecting is of a time series nature. So they'regetting massive amounts of data in and they wanted to find a good way to help manage thegrowth of that data where still granting access to it. Now, how they're doing this is an interestingway using native partitioning to partition out the data and then store some of those olderpartitions on a second database and accessing those through a postgres foreign data wrapper.So basically they're archiving data to a secondary database server and granting access to itthrough a foreign data wrapper. So, pretty interesting technique to use. Now, the reason theywere doing this is because they were literally using a file system that enables them to compressthe data significantly. And they tried two different ways. They looked at a CSTORE foreign datawrapper that was pretty good with their insert transactions per second was pretty efficient on thedisk use, but being able to select from it the performance was very poor. However, thistechnique of using Btrfs with Zstd and this is basically a file system that has compressionapplied, they were able to get savings on the disk being used, high insert transaction per secondperformance as well as their select performance was approaching having no compression. Andthey go over the steps that they use to essentially put this into place. So essentially their solutionoverall they have a quote here --, it said the end result is a reduction of disk usage to 12% of the original. So pretty big savingswhile keeping query performance at around 90% of the original. So this is an interestingtechnique I haven't seen before, but it seems a huge win if you are concerned about how muchdata you're storing. So, definitely a blog post to check out and an interesting technique to take alook at.

[00:05:48] The next blog post is part of the fun with SQL series that Citizens Data has done andit's functions in postgres. So if you're a developer, this is kind of a catch. All of some differentfunctions you can use with PostgreSQL.

[00:06:03] The first one they talk about is Arrays and how you can store arrays in a singlecolumn within PostgreSQL. Now, even though there is an array data type, you can also usefunctions that aggregate into an array typical column. So like for example, this takes aggregatesan array of email addresses and the result will look like this. But you can use an additionalfunction that they show here doing array to string which actually breaks it out with a space and acomma. So there's a couple of functions that you could potentially use if you want to withoutdoing it on the application side.

[00:06:38] Another one they talk about is some time functions. One where you can look at thedifferences between time. So for example, now gives you the current timestamp and you canlook at differences from that period of time. Using intervals like this goes back seven days andyou can also use Truncate by day or by week. So these can give you other functions that helpyou manipulate time based data and then they go over JSON and the different JSON functionsthat are available for working with JSON and JSON b data types and particularly talking aboutthe different operators that you can use with it. So, a pretty short post, but if some of these arenew to you, definitely check out the blog post. And again, this is from the Citusdata.com blog.

[00:07:28] The next post is using Force parallel mode correctly. So this is a pretty good post andgives you some good insight into some of the parallelism features of PostgreSQL. But a goodsummary of this post is that if you're not doing debugging of code that you're writing, you shouldprobably never use force parallel mode because it has a particular purpose when doingdevelopment in terms of testing out parallel queries. And he does mention a use case whenyou're creating a function how it could be beneficial, but most of the time you don't want to forceparallel mode for your entire database cluster. And then if you're looking to get more parallelizedqueries, there's two other parameters you should actually be adjusting. One is set parallel setupcost and the other is set parallel tuple cost. And there's a couple of other ones that you canadjust too, talking about min parallel table scan size, min parallel index scan size. So a goodblog post that talks a little bit about how you can change some of these other configurationparameters in order to get more of your queries using parallelism, if that is of interest to you oryou feel it could give you some additional performance. So, definitely a blog post to check out.And this is from the Robert Haas blog.

[00:08:49] The next post is PostgreSQL Clusters monitoring Performance. And actually this is ablog post that talks about if you have multiple postgres clusters running like multiple servers,maybe you have 510 servers, how could you potentially query and look into what's going on ineach of those servers. Now this had a pretty interesting idea where basically you use foreigndata wrappers. Now they're focusing on looking at the PG Stat statements view essentially tolook up this information. So they talk about how you would potentially set up being able to querythis view to get access to the information across multiple systems. So they set up a user, theycreated the foreign data wrapper extension. They set up the foreign servers, in this case, twoservers that they want to contact and get statistics for. They map users to the appropriate useron the systems they want to monitor. And they created a schema for which particular table orview they wanted to access, and they're limiting it to the PG Stat statements, and they'recreating a single view on essentially, their monitoring performance workstation that aggregatesthe different PG stat statements together and then even proposes a query that could allow y --ou to take a look at the queries, how they're performing across your multiple PostgreSQLinstances. So definitely really interesting blog posts. So again, looking at the future ofPostgreSQL when potentially Sharding will be available, and maybe people are installingPostgreSQL on Kubernetes, this may be a technique to use to be able to monitor theperformance of those without having to connect to each PostgreSQL instance individually.

[00:10:38] The next blog post is Elephant in the Room database Backup. So of courseeverybody knows that they need to backup their database, but then the question is, what are youdoing to ensure that that backup is going to work if it needs to? So this talks about all the variousissues that different organizations have had and broadcasted about. They seem to focus a fairamount of it on some of the issues that GitLab had.

[00:11:06] And here's a quote here. So, in other words, out of five backup replication techniquesdeployed, none are working, reliably, or set up in the first place, which is a really bad situation tobe in. So he talks a little bit more about some of the issues that were encountered and basicallythey didn't have a good insight into whether their backup was working or not. So a part of it isjust a rehash of kind of the GitLab incident, but it's also reminding you it's not just good enoughto have some notifications set up. Literally, you want to have something contacting you on aregular basis that is resilient to something failing. Like maybe you want to set up some sort ofjob that ensures okay, the size of the database backup is consistently where you expect it to be,and it's not suddenly zero, or it's not suddenly empty as well. As on a periodic basis, be thatweekly or monthly or some basis actually restoring that backup to ensure and looking at the datato ensure yes, all the data is there and it is up to date where you expect it to be. So doing thosethings is essential to make sure that you can restore from your backup in case catastropheessentially happens. But definitely a blog post to check out if you're interested in that sort ofthing. And this is from the hackernoon.com blog.

[00:12:33] The next post is Gracefully scaling to Ten K Post, which 10,000 PostgreSQLconnections for $35 a month. Part one. Now I was interested saying, why is he saying $35 amonth? I believe what he's referring to, this is basically a post on PG Bouncer and how bydefault, PostgreSQL allows 100 connections. And you can start bumping that up to 200connections, 300, 400. But once you get up to 400, 500, the performance or the memory utilizedfor those connections starts being really onerous and the solution to it is to use a connectionpooler. Now. You could use PG pool or PG bouncer. And I've had more experience withPgbouncer and it seems that's mostly been the go to one that I'm aware of. So this basicallygoes over PG Bouncer and how you could basically have that accept 10,000 connections butthen combine that down into maybe 100 or so postgres connections or maybe ten. It depends onhow fast your queries are. And for this purpose you're generally doing transaction pooling. So onthe application side, maybe you have 10,000 threads in use and you're getting queries randomlysent to each of those. PG Bouncer can then take those and funnel them into say ten to 100postgres connections so that you don't have to have as many connections on the PostgreSQLinstance. But PG Bouncer handles the creation and tear down of those connections. And for $35a month, I'm assuming that is it's not mentioned in the post, but I'm assuming they're projectingthat as the cost of one PG Bouncer server.

[00:14:17] Although in the post they talk about a scenario where they are using 40 PG Bouncerinstances. So that definitely wouldn't be $35 a month. But they're proposing if you're using thesefast efficient queries from an application, generally they can't be too long because it won't workthat well. But if you have fast efficient queries that each PG Bouncer can be figured to allow10,000 incoming connections than to only make ten connections upstream. But again, a lot ofthat depends on how fast your queries are. But this is an interesting blog post that goes into a lotof detail and thoughts about how best to configure Pgbouncer and how it can help you withscaling your database. So definitely a blog post you can check out. And this is from theFuturetechindustries blog on Medium.

[00:1 --5:07] The last two posts are basically from the Several nines.com blog, both related toperformance. The first is a performance cheat sheet for PostgreSQL and in it it's essentially thatit goes over what are some of the things you can look at if you are looking to improve yourperformance? One is looking through queries using Explain and then when you're wanting to getmore than just the projected cost and what the planner wants to do, you can use ExplainAnalyze to see, okay, what will a query actually do? And actually run the query. Then it also goesinto some vacuum areas and some considerations with regard to it. It goes over differentconfiguration parameters. You can set talks about some of the different logging you wouldprobably want to enable as well as talking about hardware considerations. So if you are a littlebit new to PostgreSQL performance, definitely a blog post to check out.

[00:16:04] The second post again from the Several nines.com blog is tuning Input Outputoperations for PostgreSQL. And this is all about IO as it relates to performance. And again, thisis kind of a more beginning to intermediate level blog post, but definitely something good tocheck over just to compare. All right, what are you doing with your PostgreSQL instance and areyou following some of the guidance in here and some general advice? So they talk aboutindexes partitioning. You can do the checkpointing process, vacuum and vacuum analyze, the fillfactor effects, different vacuuming tips, as well as I O problems related to disk sorting and eventhe database file system layout. In other words, how many disks are you using, how are theypartitions? And then considerations for running PostgreSQL on the cloud. So, definitely a blogpost to check out, just to look at their suggestions to how you are currently running yourPostgreSQL system. And maybe there are some tips you can find to help optimize yours.

[00:17:08] That does it. For this episode of Scaling Postgres, you can get links to all the contentpresented in the show notes. Be sure to head over to Scalingpostgres.com, where you can signup to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes.Thanks. --

Impact of Virtualization, Compressing Data, Parallelism | Scaling Postgres 18

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript