background

Insert-Only Vacuum, Settings, Sequence Replication, Improvements | Scaling Postgres 111

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss insert-only vacuums, the Postgres settings hierarchy, sequence replication and desired improvements.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about insertonly vacuum settings,sequence replication, and improvements. I'm Kristen Jameson, and this is scaling postgresepisode 111.

[00:00:21] All right, I hope you, your family and coworkers are doing well. And for this week, ourfirst piece of content is PostgreSQL version 13 new feature tuning Auto Vacuum on Insert onlyTables. Now, you may be thinking why do you need to vacuum insert only tables? Becausegenerally vacuum is used to vacuum up dead rows that have been the result of rows beingdeleted or updated. So Postgres actually maintains the old rows for a time to make sure thatthey are visible for any transactions for other users until they're no longer visible throughout theentire system and then they are vacuumed up later. So basically an update results in a new rowbeing added and the old one being retained and then vacuumed later. So why would you need tovacuum insert only tables? And basically the two reasons are actually mentioned there in thelittle graph here is that because of the transaction ID age that has a limit and vacuum resets,that essentially freezing rows, allowing it to be available and also updating the visibility map. Sothis is important for insert only tables. So that's essentially what this blog post runs through isthis new feature that's being added to 13 that could be advantageous for insert only oressentially append only tables. So they mentioned it adds a new column in insert since vacuum.So essentially number of inserts since vacuum column of the PG stat all tables. And you canadjust an auto vacuum insert threshold and an auto vacuum insert scale factor so you can makeit customized. And apparently they even have one for toast tables as well. Next they talk aboutuse case one which I mentioned is to handle a transaction ID wraparound. In any activetransactions, there's 2 billion that are allowed. So once you hit that limit, you need to freezethose transactions so that more become available. Now, what normally happens is when theTXID reaches 200 million, it then triggers a special vacuum to handle it. The problem that youcan run into now with say, insert only tables is that vacuum doesn't get done in time, whichmeans you need to take the table offline and do a manual vacuum. And also a mandatory autovacuum will not give up when it blocks a concurrent transaction. So such a blocked operation willblock all other access to the table and process and comes to a standstill. And they give yousome options of some configuration changes you can set or recommendations when using thisfeature in version 13. The second case they mention is index only scans. And this goes to thevisibility map, which the index only scan needs to consult to determine whether it needs to go tocheck the heap tables or not. So if you want more index only scans, that don't have to checkheap tables, you want to make sure the visibility map stays updated. Essentially this new featurewill allow you to do that for more appendonly tables. Now they do say that if a table receivesenough updates or deletes so it's not strictly append only, you can set the auto vacuum scalefactor relatively low, but with an insert only table that's a bit harder to do. The last case theymentioned is a hint bits on insert only tables. So basically the first query that reads a newlycreated row has to consult the commit log to figure out if the transaction that created the rowwas committed or not. Now, if a lot of rows were recently inserted in a table, that can causeperformance hit for the first reader. Subsequent readers won't have to check the commit logbecause they can just look at the hint bit. Therefore, it is considered good practice to vacuum atable after you insert a copy a lot of rows into it. And essentially with this new feature they sayPostgreSQL automatically vacuums these insert only tables after large inserts so you don't haveto necessarily worry about it. So definitely interesting new feature and if you want to learn moreabout it, definitely check out this blog post. Now I should add, this is from CyberTechpostgresql.com.

[00:04:14] The next post where do my postgres settings come from? And this is fromMydbainok.org and they're talking about all the different places where you can make settingschanges in postgres. And they have this convenient graph here and it is in a hierarchy so the toptake precedence over those lower. So at the bot --tom you have the default settings. Up above that the start configuration for the postmaster. Upabove that is the PostgreSQL comp file. Above that is the PostgreSQL AutoComp file that youcan change with an Alter system command. Up above that for each database has configurationsthat you can set through an Alter database. The roles have their own settings. Through an Alterrole, the sessions have their own settings where you just set and then make a particular settingchange. And then lastly transactions you can set local within a particular transaction. So this is avery convenient graph to kind of understand where different settings can be made and theirhierarchy. You can also consult the PG settings table and it tells you kind of the source of thatsettings that exists. Then finally he presents this decision matrix in order to try to determinewhere a particular setting is coming from. So if you're interested in that, definitely check out thisblog post.

[00:05:29] The next piece of content is can sequence relation be logically replicated? And this isfrom Heigo CA and they're talking about logical replication where sequences don't get replicatedover. So the data tables get replicated over but not the sequences. Now first a address is what isa sequence. So basically when you create a serial column or you have a primary key that has asequence associated with it. It's a counter that determines what the next new value is for thatprimary key. That is an integer or a big int or something of that nature. And these sequencevalues are not replicated. And it says why they don't do it is actually because of how they savethings to the wall. So the wall logs sequences every 32 increments. So for example, if you'reincrementing by one and you are at 500, what it's going to log to? The wall is 532 and then it'sgoing to wait until 32 more increments pass before it logs again. And again it logs 32 incrementsahead. Now, the reason it does this is because sequences are not guaranteed to have no gaps.For example, you can see this if you do an insert and have an error into a table, you can see thatnext time you have a successful insert, you will have a gap. You won't have those failed inserts.And the reason that they do this is for performance, so you don't have to write to the wall asmuch. So they say in theory this could be logically replicated and the fact that gaps are okay,having a far forward sequence should be fine. So they actually went through and looked at allthe different code here and investigated how to do it. So I'm not going to go into the detail here,of course, but if you want to check this out, it shows you how this potentially could be added.And apparently they may be looking to do an update for Postgres on this. So if this is of interest,definitely check out this blog post.

[00:07:23] The next piece of content is Seven Things that could be improved in PostgreSQL.Now this is inspired by the article that we've covered in a previous episode of Scaling Postgrescalled Ten Things I Hate About PostgreSQL. And it's basically mentioning proposedimprovements. So this mentions seven things that could be improved. The first one mentioned isrigid query planning. Now, he's not necessarily mentioning the absence of query hints that werementioned in the previous blog post, but this is talking about having a more adaptive planner,because basically the planner just looks at the query, the table structure and the statistics thatexist, but doesn't make any determination based upon previous query runs in terms of decidinghow to do that query. So he's saying having that type of knowledge could be beneficial to helpadapt the plans the query planner is coming up with. So that definitely seems interesting.Number two is some levels of automatic tuning. Like for example, postgres mostly does not lookat the operating system at all because it runs on so many different ones. It tries to remainoperating system agnostic. But there could be benefits of setting some things based uponlooking at, okay, what is the memory that exists or the disk space that exists and things that arelogged to the logs, could they perhaps make automatic tuning changes? So that soundsinteresting on the surface. I'd be a little bit wary of it and would want of course, be able to turnany automatic tuning off, but definitely an interesting proposal. Number three is lost statisticsduring a PG upgrade. So when you do a PG upgrade, you basi --cally have to run statistics again to repopulate them. And since a lot of the data doesn't reallychange from an upgrade, it would be advantageous if we could retain those somehow. So Iwould say this is advantageous, but it doesn't really take even for large databases a long timefor just to run some statistics. Number four is the auto vacuum and transaction ID wraparound.Big, huge issue. Definitely agree that making this easier to deal with would definitely be a greatimprovement. Now things like the Zheep project that they're working on, where they're actuallyhaving a separate area and not just inserting a row with an update and then having to vacuum itup later, basically writing to a separate storage structure so that you could handle rollbacks orconcurrent visibility on the state of a particular row. Definitely looking forward to seeing what thislooks like once it's complete. Number five is disk footprint in Bloat. So basically look for betterways to optimize the storage of data on disk. Now they're saying Zheep could help contribute tothis because you'd have less Bloat. They're also mentioning Z store that could be beneficial forultra large data warehouses. There's also the C store foreign data wrapper. So that's anotherpotential to look at. And this could be advantageous, but again, you can store it on a morecompressed file storage system if you so choose to. So that is another way to kind of get thisbenefit. Number six is out of the box vulnerability against brute force password attacks. So herethey mentioned that there's no built in capability to do this, but you can add the Auth DelayContrib module to do it. So just be aware of that and add it to avoid this potential problem. Andseven, no query cache. Now, he does mention this is incredibly hard to do because anythingdealing with Caching is hard to do. But he said it generally would be beneficial if you had insightto the fact that no data has changed to be able to return the same query results for the sameinput. That could definitely be an advantage. So if you're interested in investigating these a bitmore, check out this blog post from CyberTech postgresql.com.

[00:11:16] The next post is a beginner's guide to basic indexing. And this is from Bannister Mewebsite. Now this is a very basic guide and it's basically how this primarily a Laravel developer,increased the performance of his application by 100 times. Now this is not surprising if you don'thave an index. Adding an index gives you, these kinds of huge performance wins. So if you're adeveloper and wanting to learn more about databases in terms of improving your application'sperformance, then this is a good blog post to review about how he did it. But essentially heidentified indexes that were needed and added them and even looked into whether to make it anindex only scan or not. So again, it is on the basic side, but if you're a developer and wanting tolearn more about indexing and the performances they can result in, definitely check out this blogpost. And I should mention it's not PostgreSQL specific.

[00:12:12] The next post is things I wished more developers knew about databases. And this isfrom a median.com under Software Engineering. Now again, this is not PostgreSQL specific, butmore generally talking about different types of database concerns like Acid Network, Uptime,Auto Incrementing. So again, this is more focused for developers than any DBA andconsiderations to take into account as you're developing applications. So if you're a developer,you've probably gained some insight looking over this post.

[00:12:45] The next post is stored. Procedures also have rights. This is from Higo CA, and it'stalking about the rights that you can assign to store procedures. So primarily you can assigninvoker rights or definer rights. So that procedure can be run based upon who defined it, or it canbe based upon who is running that procedure. And when you're creating them, you need to kindof take into account how you want it to operate. So this goes through an example about how toset it up. Two different users, two different procedures that have each of these procedures. Soyou can see how it works and how you get an error if one has an invoker right, but doesn't havethe permissions to the particular table to query. So if you're interested in learning more, definitelycheck out this blog post.

[00:13:32] The next post is automated rapid switchover with a BDR database c --luster in Kubernetes. So this is particularly interesting. So using Kubernetes so you have acluster of PostgreSQL servers in conjunction with BDR, which is their bi directional replicationproduct from second quadrant.com. So basically you have say, three database servers. They areall masters. So it's multi master and it shows how one server can die in that cluster. It canalmost immediately switch to one of the active masters and then proceed to rebuild the missingone. And it does this in conjunction with their cloud native BDR operator that helps operate theKubernetes cluster. They have a little presentation here titled a Hard to Kill Automated RapidSwitchover with BDR Cluster and Kubernetes. So this I find really interesting. Use case withKubernetes. Now, of course, the things you have to take into account with a multimaster setup isyour application development practices need to be different. So there's considerations you needto take with that. But definitely an interesting blog post.

[00:14:45] The next blog post is easy. PostgreSQL twelve and PG Admin Four set up withDocker. This is from Crunchydata.com. So this just runs through setting up postgres Twelve andPG Four admin on a Docker container. So potentially maybe you want to use that withKubernetes as well. The second poster related to Kubernetes, also from Crunchydata.com, isset up and run PostgreSQL Operator client PGO on Windows. Now it's showing you how to do iton Windows, but it's also saying how to set it up after you have the cluster up and running. So ifyou're interested in doing this on Windows, definitely check out this blog post.

[00:15:28] The next post is oracle to PostgreSQL start with Connect by this is fromSecondquadrant.com, and they're talking about how Oracle has the Start With or Connect By togive a particular result. And they're showing you how you can do that. Same type of query inPostgreSQL using a with Recursive Cde. So if you're interested in migrating from Oracle toPostgreSQL, definitely a blog post to check out.

[00:15:55] The next post is the PostgreSQL Person of the week is Mark Wang. So if you'reinterested in learning about Mark and his contributions to PostgreSQL, definitely check out thisblog post.

[00:16:07] And the last blog post is a chat with Luca Ferrari, and this is from Superhighway Dev,and he is another person in the PostgreSQL community. So if you're interested in learning moreabout Luca, definitely check out this blog post.

[00:16:23] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show Notes. Be sure to head over to Scalingposgres.com where you can signup to receive weekly notifications of each episode. Or you could subscribe via YouTube or bye.Thanks. --

episode_image