background

auto_explain Overhead, Postgres Data Lake, Citus Shard Rebalance, SSL Authentication | Scaling Postgres 157

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss auto_explain's overhead, setting up a Postgres data lake, rebalancing a Citus shared database and implementing SSL authentication.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] Hello. In this episode of Scaling Postgres, we talk about Auto Explain overhead,postgres data, lake, citus, shard, rebalance, and SSL authentication. I'm creston, Jameson. Andthis is scaling postgres episode 157 one.

[00:00:24] Alright, I hope you, your friends, family and coworkers continue to do well. Our firstpiece of content is can Auto Explain with timing have low overhead? This is from Pgmuster.com,and they're talking about an extension that is called Auto Explain. And what it does, when itdetects a slow query, it automatically generates an explain plan in the log so that you can checkit out and see the explain plan for that slow query. But as he mentions here, it says a quotemany places warn about its overhead, but I found concrete information difficult to come by. Sohe's heard anecdotal numbers in the range from around 3% to about a factor of two. So a prettywide range. But they wanted to look at this more in depth. One thing to keep in mind when doingan Explain analyze is that the system clock can have an overhead depending upon the systemthat you're using. And he has a link to an article from Angres that we actually covered in aprevious episode of Scaling Postgres that talks about this. And there's actually an applicationincluded with postgres PG test timing. So you can check out how slow your system clock is forthe system you're using it on. Now, for the particular one he used here, it looks like he's doing iton a Mac with about four gigs of Ram, two cores. So relatively small system. But running the PGtest timing, it looked like 90% of the calls were less than a microsecond. So that's pretty good.So the first thing he wanted to do is just take a baseline. So he used PG benched to get abaseline of different latencies, and he came up with 1.5 milliseconds in terms of the averagelatency. Now then he configured Auto Explain, adding it to the shared preload libraries, and set itto take a full sample. So sample everything for all statements. So basically every statement willget an Explain analyze. Now, when he ran PG bench again, he saw a 26% higher increase inlatency, so that's not insignificant. So of course, the thing you really want to do is only log theslowest. So here he's taking a full sample, but he's only looking at queries that run longer thanten milliseconds. When he runs PG bench again, he only sees a 0.8% difference in terms ofbeing slower. So that's very negligible. But it hasn't really started doing the analyze yet. So inthis next step of doing the analyze without the timing of each step, he turned a log analyze ontoTrue. And with running PG bench again, it's only 0.3% slower than the baseline. So again, theanalyze is not taking up a lot of time as long as you're only doing it for these low queries. Nowhe added the timing in to be able to track the time of each step. So we did log timing, true ranPG bench again and this time it's 2.2% slower. So definitely great information to have in that. Ifyou're using Auto, explain that you definitely want to focus on only auto explaining the slowestqueries that even if you have timing on it's not going to be a significant hit to your databaseperformance. Now, we actually decided to add a lot more parameters to the logging in terms oftriggers and settings and buffers, et cetera, as well as the JSON format. He ran it again and itwas actually a little bit faster than the 2.2% of the previous one that just turned on the timing. Hecan't really explain this, but this may be having to do with variation, depends on how many timeshe's run these tests and what the standard deviation is. But overall it looks like as long as you'renot trying to track every statement, the hit to the database is around the 2% mark, at least for theload he was putting on the database. In this example, your experience may be different basedupon your load, based upon your system, but definitely interesting information. And if you wantto find out more, I encourage you to check out this blog post.

[00:04:19] The next piece of Content build your own data lake for reporting purposes in a multiservices environment. This is from Fretlink.com. So they're talking about a data lake that theyset up essentially on postgres. Now, they start off talking about some of the reporting issues thatthey were encountering in terms of the demands for different types of data and trying to satisfythose needs from different customers, from internal users, et --cetera, as well as issues with data quality, with terms of producing spreadsheets and passingthem around. So it's a good read to kind of get an overview of what the environment was like.But the way that they approached resolving these situations starts off essentially in this graphhere. So basically they broke up their monolithic application into three separate services acareer service, a pricing service, and an expedition service. Now the main tables of theapplication, so tables are in green. You can see some of the table names here, here and here.They're all a part of the public schema. But what they did is they created a separate schemacalled reporting and then they created views within those reporting schemas. So basically, thedevelopers who are managing the service defined an interface essentially in views on what datacan be pulled out for reporting purposes and give the definition of the columns that are definedwithin that view. So they did it for each one of their services. The next step is to use streamingreplication to stream that data into a separate reporting server. So my assumption that this isone cluster that just has multiple databases that reflect what is coming from each of theapplication services. So essentially it's a one to one copy. Then with that data, in this separateinstance, they created a separate reporting database and that reporting database had a schemafor each service. So the reporting database has a carriers schema that uses a foreign datawrapper to access its carrier service database and it essentially accesses the reporting views,then the pricing schema in the reporting database accesses the pricing service database andaccesses the reporting views there to get the information needed and same thing for otherservices. So this essentially allows them to query this one reporting database and do queriesacross different databases, essentially by using these schemas. Now, one advantage of this isthat pretty much all of these queries will be against essentially real time data. Now, they didn'tmention that they're using materialized views or any kind of summary tables or things like that.Now that would cause the data to be not real time, but you could then optimize access to thatdata with queries. So this structure may work at a certain data level, but once you're hittingbillions, tens of billions of rows, you may need to implement something slightly different,something less real time to handle performance for reporting purposes. Now, they go ahead andinclude the different scripts that they use to set this up so you can check that out. And then whatthey also did is took it another step to integrate other external data into the system as well. So inthis example, they're showing how they have information located in Google Sheets or someother external service. Basically they set up a way to synchronize it with a separate databasewithin the reporting server and then they do the same process where the reporting database hasa dedicated schema for that data and reads it through a foreign data wrapper and the samething for any other external data that they access. And then they follow up the post determininghow they've automated all of this to be able to run and process data for them, following up withusing Metabase to give access to this data in a visual and graphical format. So, this is a veryinteresting post of a process they went through to essentially put data in the hands of their endusers and allow them to query it. So, if you are interested in that, you can check out this postfrom Fretlink.com.

[00:08:19] The next piece of content is scaling out postgres with the Citus open source shardrebalancer. This is from Citusdata.com and this is talking about again the open source release ofCytus Ten and how it now supports being able to rebalance shards that you have across thescale out of Citus. So they give an example here of why this is important. So maybe you startyour Citus cluster with two nodes and you have three shards here, three shards here. Well, ifyou add a node C, you need some way to move those shards to node C because essentiallynode C is not being used. Now that's where the rebalancer comes in. It can take shards fromother nodes and repopulate them to C and essentially rebalance across the cluster no matterhow many nodes you have. Now they also mentioned there's a couple of different rebalancingstrategies. So number one is by shard count, --so that's essentially what this is. Node A had three, node B had three. So you essentiallymove one from A and one from B to node C. So now they each have an equal number of shards.The other way to do it is by disk size. Now this can become important if you're say sharding bysay, customer set of data and you have large customers and small customers, well, you couldrebalance by the data size. So that you're essentially balancing the data size between yournodes like here. So you move a large customer three to node A and the small customer two tonode B. But they also have other examples of rebalancing strategies that you can use ordevelop your own. And so they have some information about that here. They also talk about away to actually shrink your cluster. So in addition to scaling out, you can also scale down if youdon't need all of those nodes that you've scaled up to. And there's a citus drain node commandto do that. So essentially it rebalances the shards, collapsing them into fewer nodes. Now onedownside to this they did mention is that during a rebalance process the data is readable but it'snot writable. So if you need to do that, you need to take into account that okay, if you're going tobe rebalancing a shard, you can't actually write to that data at that time. Now they did say on theAzure version of Citizens, there is a way around that, but not for the open source version at thistime. But if you want to learn more about this, definitely check out this post from cited.com thenext piece of content setting up SSL authentication for PostgreSQL. This is from CyberTechpostgresql.com. They're talking about just setting up SSL on your postgres server. So in thePostgresql.com file there are a number of settings that you can set. And basically the main oneis setting SSL on as long as you have a cert file and a key file. And of course there's a numberof other parameters that you can adjust for SSL. Now you can do a reload of the configuration atthis point, but it won't actually work yet. You also need to do a restart to get it up and working.Now the next thing you need to change on the server side is the Pghba comp file because youneed to use a type of host SSL for it to enforce those SSL connections. So you need to movefrom just a host type to a host SSL type. Now, the next part of the post actually talks aboutgetting that certificate file and certificate key. You can of course purchase those, but this runsthrough the process of doing your own self signed certificate and learning how to set that up.Now, the next thing they mentioned is that in terms of clients connecting to a postgres serverwith SSL enabled is that there's numerous different ways you can connect to it and the client candefine the following SSL modes all the way from disable to verify full. But of course the serveralso has to support SSL for these to work. So you can at the point of connection, say SSL modedisabled to not get any SSL. Or usually this I would say is one of the more common option isrequire. So basically it's going to require SSL connections when it connects. Or maybe youmight want to do prefer if your server still supports non SSL. But if you want the full protection,having a certificate on the client and doing a verify full does give you, as they say here, thestrongest protection possible because you are validating the actual certificate that the server isusing and the client trusts the signer of that certificate. So if you want to learn more about settingup SSL on postgres, definitely check out this blog post.

[00:12:49] The next piece of content enhancing PostgreSQL 13 Security with the CISBenchmark this is from Crunchydata.com and this is a new version of the CIS PostgreSQLbenchmark that's been published. And CIS stands for the center for Internet Security, which is anonprofit organization. So crunchydata has helped to develop a Security TechnicalImplementation guide for postgres and it covers things like installation and patches, directory filepermissions, logging, monitoring, auditing, user access and authorization, connection and login,PostgreSQL settings, replication and special configuration considerations. Now, this is onestandard. Another standard from a post, also this week is from Enterprisedb.com and it's howthe EDB postgres Stigs can be used to secure your PostgreSQL database. These are a set ofstandards. Or an Stig is a security technical implementation guide that is based upon t --he Defense Information Systems Agency. So this is the DoD, the government organization setof standards. So they have developed a Security Technical Implementation guide for postgres.And this post talks about all of that and using it to secure your postgres systems. So either ofthese two posts are great ones to look through to help you follow a standardized process tosecure your postgres installations.

[00:14:15] The next piece of content is monitoring PostgreSQL with Nagios and checkmake.This is from Haigo CA, and it's talking about setting up nagios for doing monitoring along withcheckmake, although I don't know if that's how that's pronounced, but they go through theprocess of all the commands you would need. To run to set this up on Redhead Enterprise Linuxinstalling Postgres and then setting up nagios with the command files to start monitoringPostgres as well as adding checkmake to be able to give a more, they say, user friendly view ofmonitoring your Postgres installation as well. And apparently this uses check postgres under thecovers which basically does queries using psql against the database to check number ofdifferent parameters. So it gives you more than just node information but also relevant databasestatistics as well. So if you're interested in that, check out this blog post.

[00:15:10] The next piece of content is musings of a PostgreSQL data pontiff. Episode One thisis from Crutchydata.com and this gives a little bit of the background where they're going to betalking about a data science blog series. So this mostly gives some background on the author.The second post was also made and it's musings of a PostgreSQL data pontiff. Episode Two hotMagic analytics and graphical output with PLR here. Basically they're using Postgres to do dataanalysis within it and using functions within postgres to do that data analysis and then using it inconjunction with the procedural language R to generate different plots such as this. So if you areinterested in learning more data science with Postgres, definitely check out these two postgresfrom Crunchydata.com.

[00:16:02] The next piece of content. Traveling salesman problem with PostGIS and PG routing.This is from Cybertechnposgresql.com. They're talking about essentially the traveling salesmanproblem where a salesman has to hit a number of points in their sales route and what is themost efficient route to do that. This post discusses using postgres and the PG routing extensionto be able to accomplish that and define the shortest route between multiple set of points. So ifyou're interested in that type of geo analysis, definitely check out this post next piece of content.Introducing Pgquery 2.0 the easiest way to parse Postgres Queries this is from Pganalyze.comand they've released their new tool PG Query in order to parse postgres queries. And it actuallyuses the postgres parser to do it because it's developed as a C library that works with postgresto be able to parse the queries and they actually use it in their tool PG Analyze. So they've opensourced the query parsing part. They have a number of libraries in other languages that can useit, such as Ruby and Go and others have written others that actually use the CD library to do thisquery parsing. So if you're interested in that, definitely check out this blog post.

[00:17:20] The next piece of Content cloud native PostgreSQL for application Developers this isfrom Enterprisedb.com and by Cloud native postgres, this is a name they've given to theirKubernetes operator at EDB. And this post actually talks about using this Kubernetes operatorfor application developers to be able to manage your postgres instances if you're doing testing, Iassume with multiple different ones. So if you're interested in that you can check out this blogpost next piece of content announcing the Crunchy Postgres Operator 4.6 with rolling updates,Pod Tolerations, Node, Affinity, and more. This is from Crunchydata.com, so if you want to learnmore about their new version of their postgres operator, definitely check out this blog post.

[00:18:03] And the last piece of content is the PostgreSQL Person of the Week is AnastasiaLubunyakova. So if you're interested in learning more about Anastasia and your contributions toPostgres, definitely check out this blog post that does it. For this episode of Scaling Postgres,you can get links to all the content mentioned in the show notes. Be sure to head over to --Scalingposgres.com, where you can sign up to receive weekly notifications of each episode,or you can subscribe via YouTube. RyTunes thanks. --

episode_image