background

DBMS of the Year, Better Data Migrations, Idle Connection Impact, Security Implementation Guide | Scaling Postgres 147

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss PostgreSQL as the DBMS of the year, running better data migrations, the impact of idle connections and a security implementation guild.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about DBMS of the Year better DataMigrations, Idle Connection, Impact and Security implementation Guide I'm Kristen Jameson,and this is scaling postgres episode 147. You all right? Hope you, your family, friends andcoworkers continue to do well. Our first piece of content is PostgreSQL is the DBMS of the Year2020. This is from Dbengins.com, and they have ranked it based upon its gain in popularity forthe last year. And they are defining popularity according to their methodology, which is basicallyabout how many people were communicating about it. So, for example, their list here, job offers,professional profile entries and citations on the web. So this is actually the third time thatPostgreSQL has been nominated to be DBMS of the Year. So that's another great achievementfor Postgres. So if you're interested in learning more, you can check out this site.

[00:01:08] The next piece of content is stop worrying about PostgreSQL locks in your Railsmigrations. Now, this is talking about Rails migrations and it does cover some Ruby and Railscode, but the information in this post is valid for anyone using Postgres. And if you use Djangoor Python or some other language, you can find important things to do and be wary of if you areworking with a large database. So they have a database that handles, they say, up to 15,000requests per second. And they have, I believe, over 100 full stack developers working on it. So itseems to be a pretty large application. And they have refined their procedures to do migrations,which are essentially DDL changes to their database safely. So the first thing that they talk aboutyou want to be aware of is adding an index. So if you just add an index and they are using aRails migration. So this is Ruby code, but it basically just adds an index. And of course, this is alocking operation until that index completes. What you want to do is create that indexconcurrently. Now, you can do that using the Ruby syntax by specifying the algorithm hereconcurrently when you add it. So you definitely want to do that. However, the other thing thatthey apparently have set for their application is a statement timeout for the user that's runningthese migrations. So to avoid that index operation being timed out, they actually first, as part oftheir migration, they actually store what the statement timeout is into a variable. Then they setthe statement timeout to zero. So basically no statement timeout for this session that they're in.They add the index and then they reapply the old value to the statement timeout. Now, one thingthat they also do have to do with this is that they have to disable the DDL transaction wrapperthat Ruby provides. So it basically avoids that in order to run this concurrent index migration. So,a few steps, but the main one here that is different than just creating your indexes concurrentlyis altering the statement timeout so that the index will run. So that's something they have to do intheir environment. The next area they talk about is adding a foreign key and with the Ruby DSLyou can just do add foreign key table and then the reference. Now again, this would preventsome writing on both tables. So one way that they could do it is set a statement timeout to avoidthat locking. So here they set the statement timeout to 5 seconds, try adding the foreign key andthen if that's successful, then revert the statement timeout to what it was before.

[00:03:45] However, the better way to do it that they will mention here is that you add thisconstraint without validation. So basically any rows moving forward are validated, but notpreexisting rows. And then as a second step that is not going to block operations, you validatethat constraint. So here's the process that they go to. They go ahead and save the old value ofthe statement timeout, they set the statement timeout to 5 seconds to keep it relatively short.They go ahead and add the foreign key and they set the validation to false. So that operationshould run relatively quickly. Again, less than that, 5 seconds. Once it's successful, you can thenset the statement timeout to 0 second or turn it off. Then you can do this command whichvalidates the foreign key for this table so it goes through and checks all the values to ensure thatit's valid. And then as a final step, you revert the statement timeout to what it was before. So it'sa little complex, but y --ou can see the logic in doing it. First you create the foreign key in an invalid state and then youvalidate it later. And the next area they're talking about is adding a column. So you can add acolumn like this and normally that happens very quickly, but where you can run into problems iswith a lock queue because it still needs to get an access exclusive lock on that table. But ifthere's other locks going on, what can happen is this gets put in a lock queue and then thingsbehind it have to wait. And they give a graphical representation of this, where basically,operations that require a lock have to wait until this add column query is completed before theycan start processing. So everything starts getting backed up in the queue. So the way to avoidthis is having a lock timeout. So basically you say if it gets timed out due to lock waiting, it'll goahead and cancel that operation, it won't move forward with adding that column. Now of course,if that happens, you need a process to retry it to make sure it's successful. And they don't showthat in this particular example here. They just set the lock timeout to a short value and thenrevert it. Now they also mentioned there's a way to add a reference, which is basically acombination of the things that were shown. So adding a new column, adding an index, addingthe reference, so multiple ways of doing it that can cause locking. And basically this is a way toget around this issue. Instead of using this very simple DSL to do it, you need to do one step toadd the column, one to add the index and then add your foreign key. Now, it would be great ifRails offered a way to do these very safe steps to do updates to the database. Unfortunately,they don't. Now, there may be other orms that offer a more sophisticated way to do it in this safemanner. And there have been a number of libraries created for Ruby for doing safer migrations.Well, they've actually written another library or a gem called Safepg Migrations that basicallytake that standard DSL, simple syntax and do all of these things behind the scenes. So this postis definitely beneficial for those who are using Rails or Ruby to create safer migrations. However,even if you're not following these steps and these processes and thinking through it can helplead you to creating better DDL statements for migrating your database as you're working with it.If you're interested in that, you can check out this post that's on Medium.

[00:07:15] The next piece of content is resources consumed by idle PostgreSQL connections.This is from the AWS blog and they're talking about connections that really aren't doing anythingand the resources that they use. Now, this is the first part of a post and the second one goesinto some of the performance impact. But this one does a very test driven approach, anempirical approach for measuring the impact of these idle connections. So first they do a test ofconnections with no activity. So basically they open 100 connections and just leave those idleconnections open for ten minutes and see what the impact is. And they also try doing a discardall statement to try and free up any resources to see if that frees up anything. And unfortunately,it really doesn't in any of the tests that he's seen here. So you can see the number of databaseconnections going from zero to 100 as tested, or maybe around 102 as tested. You can see thefreeable memory drops from about 5.27 to 5.12gb, so that's about 150 megabytes. So eachconnection is using about 1.5 megabytes. The next test, they actually did some work withtemporary tables, so they created one and dropped one, and then essentially leaving thoseconnections idle for ten minutes to see what the impact was. And here you can see the numberof connections going up and down. And then this is the second phase with the discard all. Andwith this one you can see that the memory dropped from 5.26gb to 4.22gb. So it definitely usesmore memory and those idle connections basically don't give anything back to the memory untilthose connections have been shut off. So those idle connections are still using memoryresources. Next, he took a look at some different select queries. So trying doing select queriesand then waiting for ten minutes for that idle state to see what happens. And again here you cansee that the memory does drop and it is only released back once those connections are droppedfor this ten minutes that they are idle, they're still --using these resources. They are not released back to the operating system. For example, thenyou combine both a temporary table and the select queries and you see the same impact whereboth activities were done. It drops the memory down and they are only released once thoseconnections are dropped. A discard statement still does not free them up until the point thatthose connections are stopped. And he says that you even get a CPU impact from these idleconnections. So here's an example where he compared different test runs on CPU utilizationcompared to how many connections. So for example, you can see here it goes up, I don't knowif it's maybe a percent for 100 connections, 500 connections, it gets up to maybe a 3% utilizationwith really nothing going on in terms of those connections, it gets up to about 5% utilization. At1000 connections, it gets up to about six or 7% at 1500 connections. And then at 2000connections, it gets around 8% utilization. So definitely even connections just sitting aroundreally doing nothing still uses CPU resources of the system. Now, the next post talks about theperformance impact of this. So he looks at measuring transactions. So he tested with 100 clientsand then a test with 1000 idle connections. And just using a standard PG bench test, you couldsee that adding 1000 idle clients drops the transaction throughput from 1249 to 1140 TPS, whichis about an 8.7% drop. He then said you'll probably get more of a performance impact looking atselect only workloads. So we did a select only PG bench again, looked at 100 client run andthen added 1000 idle connections. In addition to that 100 clients in here, he saw an 18.2% dropin TPS performance. So that starts becoming really significant. Then he looked at a custom PGbench workload where he did very large queries again did the 100 clients and then the same 100clients again, but adding 1000 idle connections to the database at the same time. When testingthat, he saw a 46% performance drop from TPS. The transaction per second went from 378 to206. So that's a huge performance drop from adding 1000 idle connections. And then he getsdown to the main point is that how you get around this is by using a connection pooler. So hetested using PG bouncer in this case to see what impact it had. So he configured a PG bouncerto allow 5000 client connections and then a maximum of 200 server connections. So 200connections to the database, but up to 5000 application clients could connect. And then he did aPG bunch run of 100. So you can see that there's 100 clients and 100 servers to get a baseline.And then he opened up 1000 connections and left them idle. Now, just adding those thousandidle connections, it's not really using any server connections. And when he ran that 100 PGbench, again, he got the exact same performance. So using PG bouncer, adding 1000 idleconnections really didn't have an impact on the performance. It's the exact same transactionsper second. So, for example, you can see when he was doing that 100 PG bench run with 1000idle connections, you could see the thousand idle connections here under the client active, and100 of them are the ones running PG bench, and they're using up the 100 active serverconnections. And again, that performance was identical whether you had 1000 idle connectionsor not. So using a connection pooler definitely helps with your performance when you have a lotof idle connections. Then he tried an interesting thing is that the system he was using only hadtwo virtual CPUs, and running 100 processes in parallel could result in a lot of context switching.So what he did is he tried dropping the number of connections down to 20. So the server sideconnections down to 20 and ran his benchmark again. And this time he actually got an increasein throughput to 426 transaction per second compared to 377. So this is an interesting case, isthat you drop the number of server connections, but you increase the level of performancebecause you're getting less context switching with two virtual CPUs trying to handle 100connections at once, when they can only handle 20 connections per one, they become muchmore efficient and can give you a better transaction throughput. So definitely something to keepin mind that he mentions here is that sometimes you think just adding connections will give youmore performance. Not necessarily. And then they close out the post talking about a toolavailable in AWS, which --is RDS Proxy, which is essentially a tool like PG Pool that you can work with. So definitelyinteresting insights and a lot of tests with regard to connections. So if you want to learn moreabout this, I highly encourage you to check out these two blog posts.

[00:14:14] The next piece of content is announcing the Crunchy Data PostgreSQL Securitytechnical implementation guide. Now, this was first released in 2017 in conjunction with workingwith the United States Defense Information Systems Agency, and they've come out with a newversion as of, I think, November of 2020. So here's the link to the Security TechnicalImplementation Guide that you can use to help secure your PostgreSQL installations. So if youwant to learn more about this and get a link to the guide, go ahead and check out this post fromCrunchydata.com.

[00:14:47] The next piece of content. Also security related is PostgreSQL database security.What you need to know now this is a very short post, but it says it's going to be the start of alarger series. So this is kind of the 50,000 foot view of it and they've basically broken downsecurity into these six areas. And more specifically, what Postgres deals with is regard to theauthentication, identifying who is allowed to access the database, authorization, what they haveauthorization to see, and then accounting for logging who's accessed what data, et cetera. Andthey go and break throughout the different features that are available within Postgres to handlethat authentication, that authorization as well as accounting steps. So if you're interested in thispost, go ahead and check it out from Percona.com, the next piece of content is how to limit rowsto at most N per category fix. So this is with reference to a post that was posted on Depom,where he limited to the number of addresses that were allowed to a user. And someone hadmade note that with a race condition or multiple people adding addresses at once, you couldactually overrun his previous implementation because his implementation just did a query to findout how many were there and if it already had three in it, then it would just raise this exception.But if you had the case where there's transactions going on or things are happening in aconcurrent fashion, you could easily overrun this. It had five or six or whatever. So what you doto avoid that is you introduce a locking. Now the first implementation he did did a four updatelock which essentially locks this individual row, but that locks it for everything going on. So youmay not necessarily want to do that because again, you can, like we mentioned earlier, get stuckin a lock queue and have things locking behind it. So you may not necessarily want to do that.The implementation he ultimately decided on is using an advisory lock. So I would say this isonly this specific purpose. So it's kind of considered an application level lock even though it is atthe database level. It's used a lot in application development. So here you have a specific typeof lock for this check address feature. So it does apply this lock with reference to that user IDthat's being used. Then it does the count, and if the count is three or more, then don't do theinsert, just raise this exception listed here. So if you're interested in that, you can check out thisblog post.

[00:17:22] The next piece of content is single user mode and Hyphen P flag. This is from fluca1978 GitHub IO and they're talking about if you have a case where you have damaged systemindexes in postgres, there is a way to rebuild them. But the problem is you can have troubleconnecting to the database if there is this type of corruption. And the solution is that you start thecluster in single user only mode and he has a link of how to do that here. And then you can starta background process, ignoring the system indexes here, and then run the command reindexsystem, the database that is having issues. Then you can restart the cluster in order to get itback. So if you run into this issue, definitely a blog post to consider checking out.

[00:18:11] The next piece of content is using R in postgres for logistic regression modeling. Thisis from Crunchydata.com. And this is another post in the series talking about doing data sciencewith Wildfires in California. And they're at the point where you're actually getting to do thelogistical regression. Now, for that purpose, they're actually going to be using an extensioncalled PLR, whe --re they made R into a procedural language within postgres. So basically you can install thisextension and then it can read R. So they built the model in R and then they can create afunction within postgres to run this model and process the data that already exists in postgres.So if you're interested in learning more about this, you can check out this post fromCrunchydata.com.

[00:19:00] The next piece of content is Improved OpenStreetMap data structure in PostGIS. Thisis from Rustproof Labs. They're talking about some of the enhancements that were recentlymade to the OpenStreetMap data where they're allowing a more flexible output, and some of theenhancements that have been made to the general data structure. So for example, some of theopinions or additions that were added to it is that every table has a primary key. More columnsget, not null, prefer proper columns over JSON b and prefer JSON b over H store. In otherwords, as he says, also known as goodbye hstore units, cleaned and matching OpenStreetMapdefault units and conversion to other units should happen in generated columns, views and evenmaterialized views. So if you're interested in learning more about this improved data structures,go ahead and check out this post.

[00:19:52] And the last piece of content is the PostgreSQL person of the Week is Amit Langot.So if you're interested in learning more about Amit and his contributions to postgres, definitelycheck out this blog post that does it. For this episode of Scaling Postgres, you can get links to allthe content mentioned in the Show Notes. Be sure to head over to Scalingpostgres.com whereyou can sign up to receive weekly notifications of each episode, or you can subscribe viaYouTube or itunes. Thanks.

[00:20:25] You close. --

episode_image