background

Postgres Releases, Useless Vacuum, Isolation Differences, WAL Compression | Scaling Postgres 101

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss new postgres releases, useless vacuuming, isolation differences between databases, and different ways to compress WAL files.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about a new postgres release UselessVacuum, isolation differences and wall compression. I'm creston jamison. And this is scalingpostgres episode 101.

[00:00:21] Alright, I hope you're having a great week. Our first piece of content is that newPostgreSQL versions have been released. This is from Postgresql.org and twelve point 211.710,point twelve, et cetera through nine six through nine four have been released. Now this is theend of life release for PostgreSQL 9.4, so if you're still on version 9.4 or even earlier, you shoulddefinitely upgrade to one of the more recent versions. Now there was one security issuedmentioned, CVE 2020 1720, where alter depends on extension is missing authorization checks.So yet another reason to upgrade to take care of this, although you could probably take somemitigation steps to reduce the risk of that. As well as this post also lists all the bug fixes andimprovements for this version, so definitely review this blog post if you want to learn more. Andalso there was a link to the PostgreSQL roadmap, so if you want to know when they're targetingthe minor releases, they do show it on the Developer Roadmap page as well as give a projectionwhen the next major release version 13 will be.

[00:01:37] The next piece of content is Useless Vacuuming. This is from the Robert Hossblog@blogspot.com and basically he's talking about a situation where vacuum is running. Itcompletes its process, but there's still a lot of IDs that haven't been cleaned up, so it appears tobe doing nothing and he's mentioning a way to recognize the problem. Of course, the worst partis that database not accepting commands due to a wraparound data loss issue potentially, whereyou've literally run out of your X IDs, your transaction IDs, or you have a warning with a lot ofrows that need to be transactions that need to be vacuumed. So basically he says this issuecomes from about four different situations and he goes into depth on the Xids, why they exist,the 4 billion possible transaction IDs because it's a 32 bit integer and how that exists. But thefour ways that an XID can be in use is that the transaction with the XID is still running. Sobasically you have long running transactions, which is generally a no no for postgres. You wantto keep them generally as fast as you can. Another possibility is that another transaction thatcares about the status of an XID is still running. So again, the long running transaction problem.The third issue this can arise is that the transaction with XID has been prepared using preparedtransaction, but neither commit prepared or robback prepared has been executed yet. So this isa situation using two phase commits. This is not like prepared statements, but it's preparedtransaction where you're going to be doing a two phase commit. Generally, this isn't used a lot inapplication development at all. So it's only specialized use cases. So I would hope you would notbe running into that. And then of course the fourth issue is that there's a replication slot whose XMen or Catalog X Men is that XID. So basically there was a replication slot and essentially it'sprobably orphaned now and the replica that was using it is no longer retrieving information fromthat replication slot. So the Xids are expanding now. He gives a few quick queries how you cancheck cases of one and two long running transactions looking for the age of the backend XIDand X Men from the PG Stat activity. If you're using prepared transactions, you can use the PGPrepared XaX table or Transactions table or you could check the PG replication slots table foressentially an orphan slot or some problem with replication. And he goes into monitoring howyou could monitor this to set up some queries using these parameters and determine a particularsize and be alerted to them. So definitely valuable advice to follow. And then lastly follows upwith how to recover from the situation. And he talks about basically long running transactions.Well, you'll need to cancel them if you have prepared transaction, you need to do the commitprepared or rollback prepared to the ones with the high transaction age. Or if you have aproblem with an orphan slot you should go ahead and drop that slot. And then he advocatesrunning a verbose vacuum, not necessarily freeze yet, but verbose vacuum to get things undercontrol and then you can move on from there. So definite --ly really great blog post about useless vacuuming.

[00:05:03] The next post is isolation. Repeatable read in PostgreSQL versus MySQL. And hegoes over and talks about the concept of isolation, how transactions that are occurring within thedatabases are isolated and that there's four versions according to the SQL standard. There'sRead Uncommitted, which a transaction can see the changes of other transactions beforethey're committed. And PostgreSQL doesn't implement this mode. He says read committed. Soa transaction sees changes from other transactions as soon as they're committed. And this isthe default in Postgres a repeatable read. When a transaction reads back a row that's alreadybeen read by a previous query, it must read the same values even if the row is changed byanother transaction that is committed in the meantime.

[00:05:50] And from what he says, MySQL follows this as the default, a repeatable read andthen lastly is Serializable and a transaction cannot see or produce results that could not haveoccurred if other transactions were not concurrently changing the data. So he goes over intosome very interesting differences. Number one, that Postgres's default is recommitted, whereasMySQL's default is repeatable read and he goes into examples of this. But even if you set it torepeatable read in both Postgres and MySQL, you get different results depending on how it'sexecuted. So the implementation of the engines is different and he goes on and shows furtherexamples of some of these differences using a third and a fourth example. So definitely if you'reconsidering moving from MySQL to postgres and you rely a lot on transactions, this is a definitevaluable post to look at, to look at the differences in how postgres versus MySQL handledifferent implementation issues. So, very interesting blog post to check out. And I should say,this is from the PostgreSQL verite Pro blog.

[00:07:07] The next post is Compression of PostgreSQL wall archives becoming more Important.And this is from Procona.com.

[00:07:15] Now, he's talking about literally compressing the wall files using some sort ofcompression tool such as Gzip. Or I believe he talks about P seven zip here as well, not somuch the wall compression setting in PostgreSQL. So there is a wall compression setting thatyou can set in postgres that does some things to compress. I believe it does a full page imagerights to the wall, and definitely that's advocated to go ahead and set that you burn a little bitmore CPU in exchange for disk space saving issues. But I've generally found it for largedatabase systems, it's much better to have that enabled and take a very minor CPU hitcompared to the disk space savings as well as potential network transfer savings with wall filesas you get a highly active database. Now, what he's also mentioning here is that compressingthem is part of the archive command. So he says there are ways to do this using PG backrest aswell as wall g to be able to do that, but you can also do it yourself just using a Gzip utility. So ifyou typically copy it to another location, you can use a Gzip to compress to a particular locationas well as seven z A, because it does really high compression as fast as possible. And heshows you where you can get it and install it for Ubuntu and CentOS. Then he talks aboutrestoring, he looks at the different ways you can restore with wall g and PG backrest becauseyou have to unarchive essentially the file and then use it again as well as using Gzip and sevenza. Now, with some of these compressions, he was able to get down from a 16 megabytestandard file down to 197 KB. So extremely high compression with the seven Za. So if you'rewanting to conserve disk space and maybe make some of your wall handling more efficient,definitely investigate this blog post to see if you can add some additional compression to howyou're handling your wall files.

[00:09:23] The next post is Configuring workmem in Postgres, and this is from Pgmuster.com.And it basically talks about the workmem setting. So it basically defines the amount of memoryto retain for each connection for work. Now, it's not that each connection uses up to that point.There are certain transactions you run that will use multiple amounts of work mem. So you needto keep that in mind as you are configuring it now by default. As he says here, the work mem isfor megabytes, and generally that can be boosted up depending on how many con --nections you have to your database and how much memory you have, you can adjust what itis. Ideally, you want to have as many query operations happening in memory as possible, sothat's the reason you would want to increase it, but you don't want to increase it so much.Essentially, you run out of memory in your system given how many connections and thecomplexity of the queries of those connections. So generally, if you have an online transactionalprocessing database, you probably want to keep work memory a little bit lower because youprobably have a lot of connections. Whereas if you have more like an online analyticalprocessing load, then you'll probably have much fewer users and you can ramp up that workmemory because presumably the queries are going to be more complex. And he says a quotehere, for example. Christophe Pettis suggests that 16 megabytes is a good starting point formost people, but he goes into ways you can set up for your session to try out different versionsof work memory. So if you're interested in that relatively short blog post to check out, next post isquickly load CSVs into PostgreSQL using Python and Pandas. So basically it's showing you aquick way to load CSV into postgres. And he's using Python and this tool called Pandas. Now itlooks like Pandas just basically makes it easier to create a table and you can use it to actuallyload the data. But he has some options down here working with larger data sets. And option twois the most interesting to me because it basically just creates the table, basically looks at thedata set to create a schema, and then it looks like it relies upon the copy command. Again, that'sthe most efficient way to load data into postgres, to actually load the data into the table that wasgenerated. So if you use Python, perhaps this is a blog post you'd like to check out.

[00:11:49] The next post is why dropping a column does not reclaim disk space. Or better, whyis it so fast? This is from Luca Ferrari at Fluco, 1978 GitHub IO, and he's talking about howwhen you drop a column from a table, it doesn't immediately reclaim the space. So he has anexample here where he creates a one column table with 346 megabytes of data in it. He thenadds a new table with a large default, and the table goes up to almost double the size, 651megabytes. Then he looks at some of the attributes of some of the system tables to look at thestatus of the columns. He then drops the column. It happens super fast, 20 milliseconds. So allthat data probably wasn't deleted because essentially they just dropped the reference to it. Soyou can no longer reference that column in queries, but he checks the size and it's essentiallythe same size and if he looks at the attributes, you can see PG dropped for this particularcolumn so it's not there anymore. As far as the system tables are concerned, it's consideredinvisible at this point. And then how do you get the space back? You do a vacuum full and thenthat drops it down to the original size, but if you look at the tables you can still see that referenceis there, so it looks like it essentially never goes away. And he says the only issue probablydealing with the potential drawback is that the dropped attributes probably count as normal onesgoing toward the limit of the table. Now I would think it would be quite rare to have that happen,but it's interesting thing on how Postgres works, so if you want to learn more about that,definitely a blog post to check out.

[00:13:28] The next post is migrating from Ms SQL to PostgreSQL uppercase versus lowercase.So this is going from Microsoft SQL Server to PostgreSQL and apparently Microsoft SQL Serverhas all caps for its table names and its column names and that's typically not done withPostgres, but you can actually set up a couple of queries to actually change your schema over.So for example, he's using a query here to query the system tables to change the table nameand rename it to the lower of a table name. And then using in psql the GEC function whichbasically takes this built query that you're processing here and then enables it to be run for eachtable. Now he does mention that this first implementation has an SQL injection issue and if youuse quote Ident it will avoid it. So definitely you'd want to use this version when working withthese types of queries. Now that handles the tables and then this is how you handle thecolumns. So it gives you a tool that allows you to be --able to go through each column and set it for lowercase. So if you have a use case for usingthis, definitely a very interesting blog post.

[00:14:49] The next post is how to Automate PostgreSQL twelve replication and Failover withRep Manager Part Two. So this is the second version. The first part is linked here and it covers athree node cluster, one primary, two standby postgresqls and then a witness node elsewhere tohandle the election process of who's the primary or not. And this post goes over setting upreplication failover and it goes into a lot of depth with Rep Manager being able to set it up. So ifyou're interested in setting up clusters in this way and you want to investigate Rep Manager,definitely a blog post to check out. Now related to that, there's another blog post covering RepManager and it's Failover and Recovery with rep Manager in PostgreSQL eleven. So I believethis one was version twelve. And this blog post, I guess they're still on eleven, and it goesthrough the process of how they set up a rep manager for their use case. So, two blog posts tocheck out if you are interested in using Rep Manager for cluster management.

[00:15:58] The next post is creating a PostgreSQL procedural language. Part Two embeddingJulia And this is from secondquadrant.com. Now we covered where the initial setup was done forrunning it setting up as an extension. And this is actually Embedding Julia because it says thatJulia provides an API so that Julia functions can be called from C. So they're going to leveragethis to execute the Julia code from its defined function store procedure. So he has an examplehere of setting up the next phase of it. So again, it's quite short, but it's the second part ofcreating an extension that will generate a Julia procedure language.

[00:16:40] Next post is can PG Bouncer survive transient network events? And this is fromEnterprisedb.com, and it's the fifth part of the series of PG Bouncer handling network outagesand again using their Enterprise DB IP Manager to do it. So this covers a lot of the conclusionsand what they covered. So if you've been interested in this series, here is the last post.

[00:17:06] And the last piece of content is how to map a PostgreSQL array to a Java list withJPA and Hibernate. So if you have an interest in doing that, mapping arrays to a Java list,definitely a blog post to check out.

[00:17:23] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can signup to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes.Thanks. --

episode_image