background

Partition Migration, Like Performance, Best Fillfactor, In-Memory Tables | Scaling Postgres 124

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how best to migrate to using partitions, like & ilike performance, determining the best fillfactor and the work towards in-memory tables.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about partition migration, likeperformance, best fill factor, and in memory tables. I'm Kristen Jameson, and this is scalingpostgres episode 124.

[00:00:23] I hope you, your family, Lane coworkers continue to do well. Our first piece of contentis partitioning a large table without a long running lock. This is from secondquader.com. And he'stalking about an issue where you have a large table in a database and it continues to grow insize and grow in size. And finally you realize, gee, I really should have partitioned this or I needto partition it. Now, typically what you would have to do is do this offline or somehow make thetable unavailable while you do this. But this blog post describes a technique you can use toactually partition this table while keeping it online. And so the first thing they do, they created atable they just call original table. They put some indexes on it and inserted about 100,000 rowsinto it, although here it says 10 million rows. Well, some number of rows. And then they goahead and create the partition structure. They create a table called a part table for the parenttable. They partition it by range and create four partition tables that are the child tables. Now,how they're going to keep this up to date between the partition table and the original table isusing a trigger. So this is the trigger that they're going to create to do this. So whenever there'san insert on this table, it's going to insert what the new data is into that partition table. Whenthere's a delete, it's going to delete it from the partition table and delete it from the original table.And if there is an update or in this else statement, it's going to go ahead and delete that valuefrom the original table. And if it's found when it's deleted, it's going to insert that data becausethis is an update, it's going to insert essentially that new data because it hasn't been found yet inthe new table with the new data. However, if it already exists, it's going to go ahead and updatethat data in the partition table. So by using this trigger, everything can be kept in sync. So toimplement this, it all happens as part of this transaction here.

[00:02:28] The first thing that they do is they're going to alter the table name to this old originaltable name. Basically, they're just renaming what the table name is. So this should executepretty quickly this transaction. So just renaming the original table to an old table, and thenthey're going to alter the table to disable vacuum settings. And then they're going to create aview with the same name of the table that it was renamed from original Table. It's going to selectall the data from the original table and union it with the data from the new partition table. Sobasically, this view can give you the full state of the data that exists across the partition table andthe original table. And then with the function that was mentioned up above, they create a triggerout of it and apply it for each row to the original table. Now, once this is done, things will be keptin sync and data will slowly move from that old table to the new partitioning structure. And thenwhat you do is do an update on that table, essentially the view for each ID that exists in the oldtables, and it will slowly migrate that data from the old non partition table to the new partitioningscheme. And they just have a script in perl that would do that. So you say they did like 10,000rows at a time to migrate the old data over and then once the old table is finished, there aresome cleanup commands that can be done here, basically dropping the view and the functionaltering sequence names and the table to rename it back to the original table name. So this is areally interesting technique and I think I might actually be using this in the future. So if you wantto learn more about this technique, definitely check out this blog post from Secondquadron.com.

[00:04:13] The next piece of content is PostgreSQL more performance for like and I likestatements. This is from Cybertechgresql.com and it's talking about like performance. So theycreate a table that is basically filled with a hash and they did it for about 50 million rows. Andthen they did a query looking for a string that looked like this. And they knew that there was onlyone row that had this string and they're looking it for anywhere within this column. And searchingfor that one row f --rom 50 million rows took a little over 4 seconds, which is quite a long time. Now, doing anexplain plan, it shows it's doing a parallel sequential scan, so it has to scan through all 50 millionrows to find this. So then the question is, how can you improve performance? And so what hedid is he used the PG Trigram extension. So it breaks up that column into Trigrams and he kindof shows how it gets broken up into Trigrams when you run it against it. And then the next step isto index it. Now, first he used a Gist index, but that was huge, 2.5 times the size of the actualtable and the performance was really bad, over a minute and 45 seconds. So it's longer doing asearch through this just index than just searching on the table. So clearly you would not want touse a just index. And most text searches that I've seen use gen indexes. So that was the nextthing to do is to use a gen index. And with this gen index, searching for that one row, using thatlike query finished in 75 milliseconds, so much faster than say, 4.7 seconds. But the thing tokeep in mind is that that is not too efficient when it comes to looking for an exact match on acolumn. So if you also needed the ability to do an exact match, you would want to also add a Btree index for this column because if you add a B tree index for this column and pull out a singlerow, it returns in less than one millisecond. So it's just something to keep in mind when you'reusing text searching. A lot of times the gen index can be much more efficient for things of thatnature, like full text search or even using JSON data. So if you're interested in checking out theperformance improvements you use for like, and I like definitely check out this blog post, thenext piece of content is what is fill factor and how does it affect PostgreSQL performance? So fillfactor is how full you're going to make a table or index and leave space on the page available fornew data to be inserted. So when would that data be inserted? Basically when an updatehappens. So you're reserving some of that space so updates can happen there as opposed tosplitting out, creating a new page for it. Now by default, the fill factor has 100%. So the questionis, is it advantageous to drop down the fill factor? Now if you never have updates to a table, a fillfactor for 100% is basically where you want to go. But if you have a lot of updates, maybe youwant to drop that fill factor. Now he talked about the test setup he used and he wanted to runsome tests to see what he could achieve. And he looked at fill factors of 190, 80 and 70. Butbecause you are actually leaving some of the page empty when you're decreasing the fill factor,you're actually going to run into a set of diminishing returns because your caching becomes lessefficient, because those pages are cached and if they're not as full, some of the performance isgoing to drop. And he looked at both transactions per second as well as the meantime to run thequeries. And it looked like in his test, the sweet spot for at least the sample size was a fill factorof 90% and maybe in some cases maybe drop it, you could drop it to 80%, but it's basically justa smaller fill factor could give you a little bit of performance boost. And I believe he said, quote,in average a 10% boost when decreasing the fill factor by ten or 20%. So that seems to be aboutthe sweet spot. But again, because of these diminishing returns, you want to take that intoaccount when adjusting this. So again, adjusting this value seems very specific to your usecase, so you definitely want to test it out. But in the test he did here a relatively minor fill factordecreased seemed to work best for at least getting a little bit of a performance boost. So if youwant to learn more and all the details about the test and the conclusions, definitely check out thispost from CyberTech postgresql.com.

[00:08:47] The next piece of content is approaches to achieve in memory table storage withPostgreSQL pluggable API. Now, there is not an in memory storage available for PostgreSQL,yet this post talks about working potentially towards that because with the release ofPostgreSQL twelve, they now allow, quote, a custom table storage access methods to bedeveloped. So if you've heard in previous episodes of Scaling Postgres, we talked about Zheep.It's a new storage infrastructure for storing data that could potentially eliminate the need fordoing vacuum. So they're actually using this --API to say, hey, can we store data in memory only? So this goes through the process of theirthoughts behind it, what kind of changes need to be made to develop a new storage method? Sothey talk about a number of the questions as well as how to handle the buffer manager becausethe buffer manager manages what's in memory and flushing it out to disk. And this is utilized to agreat extent in the other table storage methods that are being used, such as Zheep. But if you'regoing straight to memory, there's not really a reason to buffer it in memory. Why would you bufferit to memory to then save it to memory? So really you would just want to go to direct to memory.So it's essentially thinking about how to essentially bypass this or what makes the most sense.So really this is an exploratory post on how they're developing it. So it's basically a work inprogress. So if you're interested in finding out more about it, definitely check out this post fromHigo CA.

[00:10:25] The next piece of content is Webinar being committed a review of transaction controlstatements one through three follow up. So this is from Secondquader.com and it's a webinarand it talks about transaction control statements. They talk about transactions and persistence ofthe data, how to handle application retries, and getting into a little bit about transactions andvisibility. So this was, I would say, a basic presentation on transaction control statements. So ifyou're interested in that, go ahead and check out this webinar from Secondquader.com.

[00:11:01] The next post is monitoring system activity with the new system stats extension forPostgreSQL. This is from the Enterprisedb.com blog and we had covered this extension in aprevious episode of Scaling Postgres, posted I believe, on the personal blog of Dave Page. Butthis again covers the extension in detail and basically it exposes utilities that track CPU usage,disk usage, network activity, things of that nature through to the SQL interface. So it basicallygives you these new system tables where you can query and use these utilities to get backdetailed operating system information. So if you're interested in learning more about this, how toinstall it, and what kind of OS statistics you can get from it, check out this blog post fromEnterprisedb.com.

[00:11:54] The next post, also from Enterprisedb.com is maintaining PostgreSQL is more thanjust a maintenance plan. So this is a general post, but they talk about what are some commondatabase maintenance tasks. So they run through four here, basically vacuum analyze everyweek, keeping your stats updated, reindex your heavily updated tables every month, plan forvacuum full events when you need to say shrink tables or whatnot. And also monitor yourconnections and your load. And they also have a framework for addressing maintenance on anongoing basis. Talking about first examine what state your database is currently in, what systemissues are you currently experiencing, are queries waiting for a lock, any kind of performancedegradation? And then look at that and how to approach it to address some of these issues. Soif you're looking for a bit of a framework on how to tackle ongoing maintenance issues of yourPostgreSQL instance, you can check out this blog post from Enterprisedbay.com.

[00:12:54] The next piece of content is Unicode Normalization in PostgreSQL 13. So basicallyUnicode Normalization is basically converting multiple different forms of a character to aconsistent form. Now this post goes into more detail with regard to that, but it's basicallynormalizing a character that essentially looks the same to be the exact same Unicode character.And this option becomes available in PostgreSQL 13. So they go ahead and show you how youcan do determine is NFC normalized or is NFD normalized? And basically NFC is the default,and they say that most of the world essentially uses NFC. So if you want to learn more aboutthis Normalization feature for Unicode with regard to PostgreSQL, definitely check out this blogpost from secondquader.com.

[00:13:46] The next piece of content is how to monitor PostgreSQL twelve performance withOmnidb part Two. So this goes into more so of building a performance monitoring dashboard forPostgreSQL. So this is a very comprehensive and long blog post describing how to do it, alongwith a ton of graphs showing you what the output looks like. So if you're interested in building amonitoring soluti --on using their Omnidb tool, check out this blog post from secondquadron.com.

[00:14:17] Next post. Also from secondquadron.com is Barman two point eleven, barman CloudRestore and Barman Cloud Wall Restore. So this shows you how once you have a backup usingBarman, how you can do the restore process, including both the database and the wall. So ifyou're interested in using Barman for that purpose, definitely check out this blog post.

[00:14:39] The next piece of content is snapshot isolation mode. This is from Pgsqlpgpoolblogspot.com, and this is a new feature that's coming to Pgpool Two version 4.2 that basicallyallows atomic visibility across multiple PostgreSQL instances. So it looks like this is a featurethey're building towards where essentially you have globally atomic visibility across multipledatabase servers and they go into some of the advantages and how they're thinking about it.Basically, this would enable this atomic visibility no matter the version of PostgreSQL. So ifyou're interested in that, check out this blog post.

[00:15:18] And the last piece of content is the PostgreSQL person of the Week is Umair Shahid.If you want to learn more about Umair and his contributions to PostgreSQL, definitely check outthis blog post.

[00:15:32] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingposgrads.com, where you can signup to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes.Thanks, Sam. --

episode_image