background

Connection Queuing, Connection Pool Size, Simple Audting, Fuzzy Matching | Scaling Postgres 154

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss connection queuing, estimating connection pools size, how to do simple auditing and how to run queries using fuzzy matching.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about connection queuing, connectionpool size, simple auditing, and fuzzy matching. I'm creston. Jameson and this is scaling postgresepisode 154.

[00:00:22] I hope you, your friends, family, co workers continue to do well. Our first piece ofcontent is connection queuing in PG bouncer. Is it a magical remedy? This is from Procona.comand it's talking about a PG bouncer. And specifically what it can do as a connection pooler is itdoes connection queuing, meaning that if you have a query that needs to run, but all of theserver processes or server sessions are being used, it puts those connections in the queue to berun when a server connection is available. Now, of course, most people use PG Bouncer so thatthey can get more connections without using a lot of memory and resources of the server. Butthis post talks about how the queuing component is equally beneficial. So if you are not usingsome sort of connection pooler, you could experience a number of problems. Problem numberone, a spike in load can jam and halt the server. The reason being is because there's a one toone mapping between active PostgreSQL sessions and the running processes on the box. Soevery new connection to postgres spawns a new process. In addition, with parallel execution,you can actually get many more processes running than sessions that exist. So you cansuddenly get a ton of different processes running on the box as you're adding more connections.And they simulated an example here where they had ten active connections on a two virtualCPU server, and the speed at which that they could connect and run a command compared tojust having two active sessions went up to 15 seconds when they have ten active sessions on atwo virtual CPU box. So that server is struggling to operate efficiently. And they say, quote, wecould see that as the number of active sessions approaches double the number of CPU cores,the performance penalty starts increasing heavily. So that's something to keep in mind as aproblem that you can experience if you have a lot more connections than your CPU cores. Forexample. The second problem is too many clients already errors. So this is basically you've runout of connections, you're trying to make a new connection, but it gives a fatal message, sorry,too many clients. The third problem that you can experience is a big max connection value andthe overhead that it causes. So there's been a number of posts that we've actually covered inScaling Postgres, and they link here about how just having a lot of max connections defined stillconsumes resources even though they're not being used. So you don't want to set your maxconnection value in your PostgreSQL configuration too high all the time. You want it optimized ata particular level. So now a solution to help you with all of this is using a connection pooler. LikePG bouncer. But we can also do is assist with connection queuing, which is what this post isabout. So for the example where you try to connect and there's too many clients, you've run outof connections. PG Bouncer allows you to run more active connections than exist at the serverside. So for example, here he's showing the pools of PG Bouncer, and you can see that theclient that is active is one because they only have one active server connection. And you seethat four of the client connections are waiting. So this is this queuing that he's talking about.They're waiting for their opportunity to use this one existing server session. So that's what PGBouncer can do for you. And if you actually bump up the number of server connections to twoand you run that same five, you can see that you can get more active running and less waiting.He then tried an example with 20 direct server connections and you can see that the averageload went up to about 17. So this is not using PG bouncer. And similarly, when trying to connectto that box using SSH, it was 17 seconds to connect. And I assume this is the two virtual CPUserver? Yes, but when they implemented PG Bouncer and set the pool size to four on the serverside, the load average dropped to 1.73. So about a ten times less load average. And connectingtook half a second via SSH, so it was dramatically better in terms of general performance. Nowthen the question comes, does this actually affect your throughput because you're havingconnections wait to do work, these client connections wait to do work --. And he actually did a test of database throughput and actually with using the queuing, it wasactually 5% better. So it wasn't worse. This waiting time, allowing the server to run in anoptimized state and just have PG Bouncer pass off the new connections when they're ready,actually resulted in better performance, not worse performance, even though you have somewaiting going on. And then when he switched to a transaction level pooling performance goteven better. So it's just something to keep in mind that PG Bouncer is not only for lowering yourconnections to save on memory in terms of postgres, but can actually give you more throughputto optimize the number of sessions allowed to connect to the database versus how many CPUcores you have. So if you're interested in that, you can check out this post.

[00:05:29] The next piece of content. Estimated connection pool size with PostgreSQL databasestatistics. This is from Cyprtechn Postgresql.com, and he's talking about there have been somenew connection statistics added to version 14 of postgres, mainly to the PGSTAT database view.And it covers things like a session time, which is total time spent by sessions, the total activetime of sessions, time spent in idle in transaction, the total cumulative number of sessions, thenumber abandoned, the number of fatal, the number of killed. So this is great information andmore statistics to have. But actually, where he finds this useful is in terms of estimating howlarge you should make your max connections on the database server if you're doing transactionlevel connection pooling. So exactly the problem. This previous post that was talking aboutsetting that max connections at an optimal value, just taking into account CPU cores like don'tset it too large. Here he actually has a formula where his recommendation is number ofconnections should be less than the maximum between the number of cores and your parallel IO limit. How many queries can you run at the same time? Given the I O storage solution thatyou have, divided by the session Busy ratio, how busy your sessions are multiplied by theaverage parallelism you're doing. And some of these numbers, they're a little loose sometimes. Ican imagine it being difficult to really nail the parallel I O limit or the average parallelism. But hesays you can actually calculate the session busy ratio by using some of these new statisticsgiven this query right here. And it will give you what that ratio is to factor into this function to giveyou an estimate of what's the optimum setting for your max connections. So if you're interestedin this, definitely check out this post.

[00:07:21] The next piece of content simply auditing your database changes. This is frommydbainobook.org and this post is talking about a technique where you can do auditing ofstatements where you don't have to use triggers, you don't have to use a special extension, youdon't have to use third party tools. You're just basically using the built in functionality of Postgresto do this. I found this particularly interesting. So number one is you need to log the statementsthat you want to track. So, for example, they recommend setting the log destination to a CSVlog. And she said, we'll see why later, turning on the logging collector, as well as set whatstatements you want to log. So you could do DDL or Mod or all, whatever you want to log. Youneed to set that log at that level. And then they ran a PG bench to get some data going into thelogs and then used a foreign data wrapper. So create extension foreign data wrapper, create theserver for the foreign data wrapper, and then created the foreign table giving its definition to thelog file that's being produced. So essentially, PostgreSQL is logging the file, the statements, andthen use the foreign data wrapper to actually read the log file that it's generating. And then forconvenience, they actually add a number of views that let you look at the DDL or the DML. Sothey're basically going into the logs, pulling out particular data to be able to view it moreconveniently. So, for example, when querying the DDL, you could say, look at the table in thedatabase bench in the query having to do with Pgbench Tellers. And you can see all theoperations that happen. There's a drop table, create table, truncate table, alter table. So all theDDL statements, similarly with the DML, you can look at a particular database name in a tableand look at all t --he different statements that are being run against that table. So that's a pretty interestingtechnique to be able to use the database to query and see what's happening in the logs of theactual database that you are logging. So if you want to learn more about this, definitely checkout this post.

[00:09:30] The next piece of content is fuzzy name matching in postgres. This is fromCrunchydata.com. So to get started, they actually created a table with 50,000 names using thefake name generator. So they give example of some of the data that's produced here andbasically they want to do fuzzy matching against these names. The first example they came upwith to do a prefix search is just using the like operator and they got results in about elevenmilliseconds. They tried adding an index, but of course it still came back in about elevenmilliseconds because like by default is not going to use that index. But if you index it using textpattern ops, then it will use the index and I believe only for prefix matching. But there's also acase issue. So you can do case insensitive searches, but if you do that and want an index,you're going to have to do a function index. So you actually define an index with that lowerfunction on it to be able to efficiently pull out those records. Now getting onto the fuzzy matching,they actually used the fuzzy string match extension, added that in and it does Levensteindistance between strings. So they did a query to actually use this Levenstein function to be ableto pull out records where one character was missing from the search. But the problem was ittook over 100 milliseconds. So pretty darn slow. But a solution to that is using a soundx function.And this algorithm reduces a word to a phonetic code. So for example, even though each ofthese Harrington words are spelled differently, they actually return the same phonetic code. Andfor example, out of 50,000 records doing a soundx check against Harrington actually only return46 records. So it helps you reduce the number of records are present. So to handle this, whatyou do is you create a function index using this soundx function against the name. And thenwhen you use the same query but with this new index in place and of course adding the soundx,it actually returns data in just one millisecond. So very efficient fuzzy matching. So if you have aneed of supporting fuzzy match searching, you may want to check out this post fromCrunchydata.com.

[00:11:47] The next piece of content PostgreSQL how to write a trigger. So this is a basic postfrom CyberTech Postgresql.com that talks you how to set up a trigger. So they created a tablefor tracking temperatures, has a few different columns here. Then the next thing you need to dois you need to define a function and then create a trigger that's going to use that function. Now,before creating the function, they mentioned that triggers can be either before or after. So youcan have as many before triggers as you want to before the actual insert or update, and then asmany triggers as you want after. And actually it does it in a defined order by the name of thetrigger. So you want to name it appropriately so your triggers happen in a particular order. So asmentioned before, the first step is to create a function. So creates a function called Ftemp,probably for function temp. And basically if the new value is less than zero, then set that newvalue to negative one. So that's the basic function. It's going to change the value of the rowbeing inserted if it's less than one.

[00:12:51] And then you create a trigger and you specify before whatever operation you want todo before update, before delete, before insert, and it's doing before inserts, and for each rowexecute the procedure that was defined up here, the Ftemp procedure. So with this trigger inplace, he did some tests and the result came out as he expected. When trying to insert anegative 196, it returned a negative one, otherwise it returned what was inserted. Now they usethe keyword here, new, and that refers to the new row being inserted. But there also exists in oldand it depends on what operation is happening, which keyword is available. So new is in insertsand updates. The old is present also in Updates but also Deletes and nothing is present inTruncates. Now, in addition to these keywords, there are some other ones you can use. So wementioned the TG relid schema, the table name, the rel name as well as the op or --the operator to say whether it's happening because of an Insert, Delete, update or Truncate,as well as another options that you can use in your triggers. So if you want to learn more aboutusing triggers with postgres, definitely check out this post. The next piece of content is postgresthrough all logical replication advantage and step by step setup. This is from Haigo CA, andthey're talking about the reasons why you would want to set up logical replication. Theymentioned a few here. One is sending incremental changes in a single database. Two isconsolidating multiple databases into a single one. Third is replicating between different majorversions of postgres. Like maybe you want to migrate data or do an upgrade and then sharing asubset of the database between various databases. So a lot of use cases for moving dataaround. And then they go through the process of setting up a basic logical replication setupwhere you set up the server to be the publisher, set up the server to be the subscriber, createthe publication on the publisher, create the subscription on the subscriber in order to startsyncing data over. So if you're interested in that, you can check out this post from Hago CA.

[00:15:03] The next piece of content, faster data migrations in postgres. This is fromCitusdata.com. He's talking about fast ways to migrate data from one database to another,primarily from postgres to postgres. And he says the number one tool to do that is PG dumpsand PG Restores, because they basically logically let you back it up. And you can do individualtables or individual schemas to be able to transfer data from one database to another. And thatwhat some of the keys to do it efficiently is using the Jobs option to specify the number ofprocesses that can run in parallel to do that export and then import job. But he says if you havevery small tables that may not benefit from parallelism, you can actually string a PG dumpstraight and pipe it straight into a PG restore. And that can actually transfer data pretty quickly.So I haven't actually seen this technique. So that was interesting if you want to check out thatmethod. But then if you have a lot of large tables, another solution they have is actually parallelexporting and loading. So there's a Python script he mentioned, and there may be one that doesanother language that's called parallel loader. So basically, it can take one large table, forexample, and use multiple threads to execute copy commands in parallel. So you can see it'susing PostgreSQL's copy to actually transfer the data. And using this parallel method oftransferring data, it went from a PG dump and restore for greater than one day to 7 hours and 45minutes. So it's definitely a lot faster being able to do it in parallel, because again, you just haveone process that can work on a table at a time. So with a large table, being able to break it upand run the process in parallel gives you a lot of performance. So if you're interested in learningmore, definitely check out this post.

[00:16:56] The next piece of content is first review of partitioning OpenStreetMap. This is fromRustproof Labs, and this goes over his review of whether he should partition the OpenStreetMapdata that he works with and imports into postgres. And again, he was using a lot of schemas forthis management, but he's wondering if he could partition it and make data management easier.

[00:17:19] Now, he covers some of the issues here, but the next post, round two, PartitioningOpenStreetMap describes how he ultimately did it, and here are some highlights from hisresults. So, the bulk imports that he does on a regular basis now, now generate 17% less wall.So that's good the bulk deletes generate almost 100% less wall because deleting data doesn'thave to actually do a delete. You just drop the partition table.

[00:17:47] And then they mentioned a simple aggregate query runs 75% faster. Now, they didmention that not all queries are running faster. I believe he said that quote partition tables aregenerally fast, if not faster. So it looks like the speed is pretty good or a little bit faster. So noteverything is this blazing 75% faster. I believe that was one query that he mentioned. But hegoes through a lot of the process of his testing in his thoughts about the best way to optimizethis. And this is really how you have to handle partitioning because it is a burden to managepartitioning. So you want to make sure --that the benefits outweigh the costs, of course. So this has been a great series of four posts. Ibelieve that look at that decision process of should I partition or not. So if you're interested inthat, check out this blog post.

[00:18:44] And the last piece of content is the PostgreSQL Person of the Week is TakayikuTetsunakawa. My apologies if that pronunciation was incorrect, but if you want to learn moreabout his contributions to postgres, definitely check out this blog post.

[00:19:00] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingposgres.com where you can signup to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes.Thanks. --

episode_image