background

Cleaning Up, Function Performance, 11 Million IOPS, Change Data Capture | Scaling Postgres 151

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss cleaning up your database, function performance, 11 million IOPS and change data capture.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about cleaning up function performance11 million I ops and change data capture I'm Creston Jameson and this is Scaling Postgresepisode 151.

[00:00:22] Alright, I hope you, your friends, family and coworkers continue to do well. Our firstpiece of content is the unexpected find that freed 20GB of unused index space. And this is fromHakibinita.com and he's talking about they were running out of disk space, so they normally justadd some more. But they started looking at where they could potentially clean up certain areasof the database to free up space. And they show this graph here where essentially this is freespace over time. So it's a little unusual that the higher it goes, the more free space it has. Butbasically they did a few operations that brought it up to here, had a little bit more growth andthen something additional. They found out that let them go up to this high, which is about 70GBof free disk space. So the first thing they did is they looked at unused indexes and they includethe query here. So you could just take this query and run it to look to see if you have anyindexes that aren't being scanned. No tuples read or no tuples fetched. And the thing to keep inmind is that this also uses postgres statistics, so you may want to reset them periodically tomake sure that you're including the most up to date data that indexes are indeed being used.And it gives a command that you can use to do that here. The next thing they looked at was anindex or table Bloat. So how you can address index Bloat is by doing a reindex. And ideally youwant to do this reindex concurrently, so you want to make sure you have this capability if you'regoing to do a reindex, to reindex concurrently, and I believe it's postgres eleven, might be twelvethat offers this. So make sure you're on a version that supports that. Failing that, you can alsouse Pgrepact to do it, but this is the preferred way. Of course, if it's included in there. And again,with recent versions of postgres, you are going to get some benefits in terms of compacting theindexes. And they mentioned in postgres 13, there's a lot of deduplication that can happen ifyou're indexing duplicate values. Now in the process of doing a reindex, the new indexes aremarked with an underscore CC new as they're being created, and if there's any kind of failurethat happens, you may need to go in and clean them up. They provide a query for doing that andthey took to look at some of the deduplication capabilities in 13 and then they moved on to tableBloat. And how you address table Bloat. One way you can manually recreate the table andswitch to using it. So there are techniques, but it's a very manual process to try and do that andyou probably have to use triggers to keep things up to date. The other option is to do a vacuumfull. The disadvantage is that locks the entire table. Other tool you can use is the PG repackextension to be able to repack tables. So that's another option you can use, but keep in mind ituses more than twice the amount of space of the table that you're working on, potentiallyincluding the indexes as well. So that's something to be aware of. Now in terms of the thing thatthey found to reduce their index size is that they were indexing a column that mostly containednulls. So for example, a canceling user, there's a very small amount in their database wheresomeone was canceling. So most of the values were null. Well, postgres indexes those, so howthey were able to save a lot of space is to a partial index only indexing it where it's not nullessentially. And in doing that they said 99% of the dead weight was shaved off the index. Sothat's a pretty big reduction. It went from 769 megabytes down to five megabytes. And you canalso use partial indexes if you say, have a status column that you never query for particularstatuses. Well you can just index only the values that you are querying to save space and theyeven give you a query to look for where you have a column with a lot of nulls that could becandidates for a partial index as they created here. So if you want to learn more about how tohelp clean up your database, you can check out this blog post.

[00:04:12] The next piece of content is cleaning up your postgres database. This is fromCrunchydata.com and he's talking about techniques you can use to monitor and then potentiallyclean up your database. The first thing he --suggests looking at is your cache hit ratio, predominantly how often indexes are being hit andalso how often tables are being used. And he gives the exact query to check your index hit rateas well as your table hit rate. The next thing he looked at is a PG stat user tables to determinewhat percentage of the time an index is being used for queries. And he presents it by table. Soyou could see a small table where the index is not used that much, maybe you don't need toworry about it as much, but a very large table where the index isn't being used a significantamount, that's probably something you want to check out and potentially add some indexes to.The other thing you could do is clean up unused indexes and he has a query here to look forthose that aren't being used similarly to the previous post. Now the other thing he mentioned iskeep certain things outside postgres. So he's proposing if you have things like messages, logs,events, things of that nature, don't put them in postgres because inevitably you're probably goingto take over most of your storage through this. Now, I actually store things like that in postgres,but I tend to use partition tables and then delete the partition tables when they're no longerneeded. I prefer doing this rather than run some separate data store, but this is a proposal thathe has and the last thing he mentions is to dig into performance with PG stats statements todetermine how well your statements are running and if anything needs to be optimized. So ifyou're interested in this content, you can check out this post from Crunchydata.com.

[00:05:48] The next piece of content is Boost your User Defined Functions in PostgreSQL. Thisis from Angres.com and they're giving some advice with regard to using user defined functionswithin postgres in your application. Now, they're not necessarily saying you should or shouldn'tput logic in the database, but if you do, you should follow these steps. First, they're saying usePL or PGSQL functions for simple SQL statements. So for example, you can put logic such asthis section of a case statement and make it a function to call. And when you do it, you want touse SQL to do it because this is basically just SQL as opposed to using PLSQL if you don't needit because the optimizer is able to optimize SQL in queries versus using a psql. So it's prettymuch doing the same thing, but the pure SQL is about four times faster compared to the Plfsqlfunction. The second item is unnecessary usage of select into clause. So basically you shoulduse the assignment operator in your functions as opposed to the select into clause because itgives you better performance. And again, it's about a four times better performance in thisexample they show here. The third is overusing the raise clause. So like any other language,having exceptions and using exception processing is a burden on that functions processing. Andhere you can see it's about four or five times slower to have that exception processing in there.Four is overusing the high level programming coding style for SQL activities. So basically it'sbest to express your queries in SQL versus trying to build your own. Like for example in thiscomparison, using a for loop to get what you could have done with a lateral clause, you can seeyou improve performance by about twice or two times by using that lateral statement as opposedto using a for loop in a PL SQL statement. The next set of recommendations are with regard tofunction properties. The first one is use parallel safe whenever possible. So if you can make yourfunction parallel safe here, and mostly in regards to readonly queries that don't accesssequences and have a number of restrictions, but if you can do that, it's going to give you somebetter performance.

[00:08:06] Secondly, use Immutable when possible, so if your function is truly immutable. Inother words, the same input will always result in the same output. Go ahead and mark thatfunction as Immutable. And the last set of recommendations is to monitor your performancefunctions using the PGSTAT user functions so you can check on your function's performance. Soif you're interested in learning more, you can check out this post.

[00:08:29] The next piece of content is achieving 11 million I ops and 66gb/second I O on asingle thread ripper workstation. Now, this is not database specific. He doesn't mentiondatabases in terms of analysis at all. But what he is saying is th --at modern hardware has become super, super high performance. And you may want toconsider a very recent, very fast set of hardware versus setting up a cluster of hundreds orthousands of lower powered machines because that may result in better performance. And Iactually took a look at this from AWS's perspective, and right now the max I ops provide theyoffer is about 64,000 I ops. They have a preview where they're offering a 256,000 I ops and youcan attach about 30 volumes to one of their metal instances. But you're still with that. Not goingto get near this 11 million I ops from this single workstation. So it's an interesting read on what ispossible. Now, this is a quite complex blog post that doesn't mention databases, but it's all aboutthe hardware and capabilities and optimization to get around different bottlenecks. But in this eraof distributed workloads and seeking to scale out versus scale up, well, here's a post that talksabout how far you can scale up, essentially. So if you're interested in that, you can check out thispost from Ten Lpoder.com.

[00:10:00] The next piece of content is change data capture in postgres with Debesium. This isfrom Crutchydata.com and essentially this enables you to use your write ahead log to trackchanges. So you don't have to put any triggers on postgres, but it just reads the Write ahead logstream and sends those data changes off, say, to another database or another way to capturethe data that is of interest. So they were saying this is a way you could potentially put logs andmessages and things of that nature. If you don't want to put them in postgres, you can sendthem off to a separate data store. And this is using Debesium to do it. And they go through theprocess of setting this up. Now they're setting this up with their crunchy bridge, but you can usethis information to be able to set up debesium for your postgres instance. So if you're interested,you can check out this blog post.

[00:10:49] The next piece of content is PostgreSQL what is a checkpoint? This is fromCyberTech Hyphen Postgresql.com, and this walks through the process of how inserts happen.Basically, an insert needs to happen. The first step is actually to write what's being inserted tothe wall so it commits it to disk that this is going to happen. It then writes that information to theshared buffers so it's in memory on a periodic basis. A background writer writes it to the actualdata files and that is the checkpointing process. It's basically make sure all the dirty buffers getwritten to the disks and then that checkpoint is written to the wall. And they mentioned two waysof doing the configuration. One is the checkpoint completion target, which by default is 0.5. Thatmeans with whatever duration the checkpoint is supposed to happen, it should be completedbasically 50% into that time frame. But interesting, you mostly bring this up to around zero nine.And he mentions a comment here. Quote in postgres 14, this parameter will most likely not existanymore. The hard coded value will be 0.9, which will make it easier for end users. So thatleaves two other variables you can adjust. One is the checkpoint timeout. So by default it's fiveminutes. But normally you make that 20 minutes, maybe 30 minutes, hour at most probably. Andthen the max wall size determines how much wall you're going to keep. So basically these twoparameters will determine how often checkpoints happen. So if they're happening faster thanyour timeout, you'll generally get a warning, or you should configure that warning if you're havingtoo much wall being generated. So then maybe you'll want to increase your max wall size. But ifyou want to learn more about checkpoints, you can check out this post.

[00:12:28] Next piece of content is actually a YouTube channel. About 20 or so videos have beenposted to the EDB YouTube channel and therefore Postgres Build 2020. So there's all sorts ofpostgres related videos approximately 30 minutes in duration that have been posted here. So ifyou are interested in video content, definitely check out this YouTube channel.

[00:12:50] The next piece of content is Load and query Pihole data from Postgres. This is fromRustprooflabs.com and Pihole is an application that lets you block ads at the network level and itstores information about its operation into a SQLite database. Well, he actually wanted toanalyze it. So what he did is he exported the database and then used the SQL --ite foreign data wrapper to be able to query that data from postgres. He talks about his setupand getting it set up to be able to query the data. And the thing he was noticing he had a little bitof a poor performance. So then he goes into actually loading the data from specific tables hewas interested in into postgres. So he used the foreign data wrapper to actually load the datainto tables he created in Postgres. So if you're interested in that, you can check out this blogpost.

[00:13:41] The next piece of content is PostgreSQL data security authentication. This is fromProcona.com and it's the second in a series of posts on authentication and they're talking aboutthe Postgres internal authentication. In other words offering trust, reject, MD Five Scram andCert authentication methods. Trust and reject. Basically, trust come in without any password atall or reject without asking for a password at all. MD five. Does an MD five hash for passwordauthentication? Scram uses Scram authentication and then using certificates. So if you want tolearn more about setting up authentication with Postgres, definitely check out this post.

[00:14:19] The next piece of content is Postgres and Crypto supply Chain of Integrity. This isfrom Crunchyday.com and they're talking about how do you get PostgreSQL to use Phips 142crypto and basically this is the Federal Information Processing standard 142. So if you'reinterested in that, you can check out this post from Crutchydata.com.

[00:14:44] The next piece of content is waiting for PostgreSQL 14 search and Cycle Clauses sothese are clauses that are part of the SQL standard that can be used for recursive queries. He'susing an example of a directory structure and you can actually use the search keyword to do asearch by depth. So basically it's going to show you the data laid out by depth. Or you couldsearch by breadth first, so it gives you the breadth of the directories and then it goes furtherdown. And then he also talks about cycle and that is where you're trying to determine all thedifferent routes between different cities. Say if you're planning a flight, once you load up all thedata in and all the possibilities, you can then use a recursive query to find all the different pathsthat you can go on. And what's great about Cycle is that it actually halts it once a potentialduplicate trip is stopped. So if you're interested in learning more about that, you can check outthis post from Depes.com.

[00:15:43] The next piece of content is how PostgreSQL inserts a new record with the help ofTable Access method API and Buffer Manager. This is from Higo, CA. If you want to learn aboutthe internals of PostgreSQL and how it does inserts using the Table Access API, definitely checkout this blog post.

[00:16:02] The next piece of content using GitOps to self manage Postgres and Kubernetes.This is from Crunchydata.com. So if you want to use GitOps to help manage your PostgreSQLdatabase system using Kubernetes, definitely check out this post.

[00:16:17] Next piece of content is Various ways to retrieve PG pool Two statistics. So this ismonitoring PG pool two. They present three ways. One is through show SQL commands youcan execute from a database connection. The other is using PCP commands which myunderstanding communicates directly with PG pool Two to give you information back. Or you canuse a PG pool adm extension to do so. So if you're interested in learning more about that, youcan check out this post from Bping Blogspot.com.

[00:16:48] The next piece of content is overlay ng and invalid geometry. This is fromLinearthinking Blogspot.com, and he's talking about some improvements that were made toPostGIS three One and GEOS 3.9. In referencing a previous post that we mentioned on ScalingPostgres, that PostGIS was still reporting errors with invalid geometry. And he goes into a littlebit of a clarification on why that is and the rationale behind it. So you can check out this post ifyou're interested in that.

[00:17:19] Next piece of content is PG. Timetable asynchronous chain execution. So this is anupdate to the PG timetable extension for scheduling tasks in Postgres. If you're interested infinitely check out this blog post that does it. For this episode of Scaling Postgres, you can getlinks to all the content mentioned in the show notes. Be sure to head over to where you can signup to receive weekly notifications of each episode, or you can subscribe via YouTube. Ridetunesthanks. --

episode_image