Outage from TXIDs, Breaking Scale, fsync Gate, pg_stat_statements

In this episode of Scaling Postgres, we review articles covering an outage from running out of TXIDs, breaking scale, that PG fsync issue and pg_stat_statements.

Content Discussed

YouTube Video

Podcast Audio

Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about an outage from running out ofTxids, breaking scale, f sync, gate and PG stats statements. I'm Kristen Jameson and this isScaling Postgres, episode 50.

[00:00:22] Alright, welcome to the 50th episode. And another milestone we recently crossed isover 700 subscribers on YouTube. So that's pretty good. It's a vanity metric, but interestingnonetheless. Now, there's a lot of content to go through, so I'm going to have to move prettyquickly through each piece. But the first article is actually from a MailChimp and they had anoutage basically on Sunday, February 3. One of their physical postgres instances saw asignificant spike in rights. The spike in rights triggered a transaction ID wraparound issue. Sothis is definitely a public service message. You definitely need to keep monitoring as you scaleyour transaction IDs to make sure that they don't wrap around. Now they said here it was due toa spike in rights, so that could be from inserts, that could be from updates. So if there was alarge job that was churning through a lot of data, doing updates that could have caused or evendeletes that could have caused this type of transaction ID wraparound issue, I believe this waspart of an email that was sent to customers talking about important information about an ongoingmandrel outage. Because mandrel is a service of MailChimp. And then there's a second emailupdate on the mandrel outage service restored, so it talks a little bit about it. And there may be amore in depth analysis forthcoming, but these are two links to get some information about it, aswell as a Twitter status thread that talked a little bit about it more in depth. But definitely asyou're scaling, transaction IDs are a key thing you need to keep track of to make sure it doesn'tbring your database down. Another article related to this is from Thebuild.com and it's titled DoNot Change Auto Vacuum Age Settings. So definitely related. And it says a PostgreSQL has twoauto vacuum age related settings. Auto Vacuum Freeze Max Age and Vacuum Freeze TableAge. And that by default, the vacuum freeze table age is set at 100 million and the Auto vacuumfreeze max Age is set at 200 million. So he said the recommendation that he used to make wasincrease this to prevent some auto vacuums kicking off too frequently. And I've seen somerecommendations up to a billion, sometimes a billion and a half, for things like Auto VacuumFreeze Max age. But if you do this, he was suggesting do manual vacuum freeze operations onthe oldest tables during low traffic periods, but he said the problem was people weren't doing thisand that with some enhancements in 9.6 and on, some of the I O penalty is not as severe. Soright now he suggests just leaving it at 200 million. So 2 billion is a limit you don't want to hit thelimit. So if you bring it to a billion now, you've just got a billion to play with. And I've seen somerecommendations, they bring it up to 1.5 billion. If you do that, you really need to keep closewatch over it and have monitoring in place to check for, particularly if it's a large table. You'redoing a lot of inserts updates, deletes if you have a job gone awry. So something like that mayhave happened on Mandrel because the issue happened on Sunday. So did something happenon Friday, Saturday that led to an issue on Sunday? Don't know right now, but definitely hisrecommendation here is to actually leave them at their default or maybe not push them as highup as previously recommended. So the next article is how long will a 64 bit transaction ID last inPostgreSQL? So right now the transaction IDs are 32 bit and that's part of the reason why thelimit is 2 billion. Because basically 2 billion transactions in the past are visible. And you can'tbasically run out of that because the transaction IDs are split in half. But if we were able to makethem 64 bits, and I've seen this is potentially an issue they're considering, that would significantlyincrease the amount of IDs available. And they have a comparison that they did in this blog post,if you want to check it out, is that if you're doing 1 million write transactions every second, youwould hit the Txid wraparound in 292,000 years. So if they eventually did this, this would be asolution to avoid this in the future. Now, related to that, there is in the documentation forpostgres a reference to a routine vacuuming and there's a section cal --led Preventing Transaction ID Wraparound Failures. And then right here it has a few scripts, orI should say SQL statements that tell you how to track what the oldest frozen ID is. So basicallyyou don't want these to hit 2 billion. So this is a way for you to track manually or there ismonitoring software that can do it for you. So definitely it's something you need to keep on top ofthe next article is postgres instances open to connections from the Internet. And this is fromPostgreSQL note, which is from Verity pro. And so this is basically recognizing the fact that withthe rise of database as a service, there are a lot of postgres instances going up where typicallydue to misconfiguration, they are left open to the internet. And so this has a set ofrecommendations to follow to make sure that you're locking down your instance. Like forexample, listen addresses specifies if you're going to be trying to access it, you need to limit thelistened addresses that are used. And then he makes reference to there was actually a bug in2013 where people could essentially bypass some of the configurations options set inPghba.com to connect to the database without authentication. So really try to do a defense indepth. If you are trying to connect to your database across the Internet, although the bestpractices don't do that. Just set your own isolated network at your data provider and have yourapp servers talk to your database server within that environment. Don't even open it up to theInternet. And if you're even going to consider connecting across it, use SSH tunnels or IPsec orsome other technique where you can't just connect up to postgres. Use a defense in depthstrategy. And then here they reference the Scarlett Johansson crypting malware that wasattacking PostgreSQL servers. And their last recommendation they're talking about here isforbidding non SSL remote connections. Basically only use SSL remote connections. Sodefinitely a blog post to check out if you're doing any of these practices currently because youwant to, of course, be safe on the Internet. The next piece of content is actually also fromthebuild.com and it was a presentation at Fostim called Breaking PostgreSQL at Scale. Now itsays Breaking PostgreSQL, and this is from Christophe Pettis, but it's actually a set ofrecommendations on scaling and what you would need to do at each stage of scalingPostgreSQL. So I thought this was a great analogy. I don't quite know how the scale is, butbasically he starts at 10GB. Okay, what would you need to do in terms of scaling with that? Andthen he moves up to 100GB. What would you need to do? Here what happens at a terabyte, tenterabytes and beyond. So each of the stages he's going through his recommendations on whatyou would need to do to your database to be able to handle the data from considerations of howdo you handle backup, what about indexes and different issues associated with it. So, in termsof scaling postgres, this is a great piece of content. So I highly suggest you check out thispresentation. Now, they have started releasing videos from Fostim, so that may be coming up.So I may be posting the YouTube video of this next week. We'll have to see if they post it.

[00:08:18] The next piece of content is the most useful postgres extension, PG Stat statements.So this is basically a description on how to set up PG Stat statements, where I'm creating theextension, and then how to use it to be able to track what queries are happening with whatfrequency and how much time their execution is taking in your database. So if you're not using it,definitely a blog post to check out. Now, this didn't mention it, but one thing you have to considerwhen you're using PG Stat statements, because it is an aggregate count of things, is how oftento reset it. And so that function you can use is PG Stat statements reset. So you're going to haveto think about how often you would want to reset that, if it's monthly or a quarterly or someperiodic basis to essentially start from a new base to collect your statistics.

[00:09:09] The next post is PostgreSQL Connection pooling with PG Bouncer and this is fromPG IO and it basically goes through how to set up PG Bouncer from CentOS to Ubuntu and howit's a connection pooler and a proxy for your database to be able to handle more connectionswithout using up a ton of connections in PostgreSQL. So this was a pretty easy walkthrough andit gives you some recommen --dations even on where you would typically set up your PG Bouncer and how to basically getstarted with it. So if you haven't got started with it, this is definitely a blog post to check out tostart learning how to use PG Bouncer. The next piece of content is a YouTube video fromFossim and it's PostgreSQL goes to eleven. So the first half of this talks about how committingand new features get added to Postgres. So that's perhaps the first half or 1st 20 minutes andthen the last half talks about the new features that came in at eleven and then what's potentiallycoming in twelve. So if that information of interest to you, definitely check it out. The next video,also from Fostim is PostgreSQL versus F sync. So this is talking about the F sync issue withPostgreSQL where it relies upon the kernel of the operating system it runs on to actually flushdata in buffers to disk and if there's ever an error with that, the data kind of gets discarded asopposed to being retried later. So this is an issue that they're dealing with and they're trying tocome up with a solution, I believe they said in PostgreSQL version twelve. Now this is a superrare issue or it has a low probability of occurring, but if you're not aware of this, this was coveredin a previous episode of PostgreSQL, but it's definitely something to keep in mind to keep an eyeon because maybe you want to try to upgrade your version sooner than you would normally. If afeature comes out in PostgreSQL version twelve that helps mitigate this issue.

[00:11:13] The next Post is using pgrep to rebuild PostgreSQL database objects online.

[00:11:20] This is from the Procona.com blog and this is about using Pgrepak. So depending onhow you have Postgres configured, it generates a certain amount of bloat during its usage.Because rows are not immediately deleted or updated, a new row is inserted and then the oldone gets deleted so that can generate some bloat. And if a vacuum isn't occurring frequently,that can cause bigger bloat problems. Now, trying to vacuum everything and resolve a lot ofbloat requires doing a vacuum full, but that locks the whole table for reads and writes. However,Pgripak is a separate extension that actually lets you do an online rebuild of a table that actuallywon't lock the table. So if you're interested in using this extension, definitely Blog Post to checkout. Next post is PostgreSQL with Passphrase protected SSL keys under system D. So this istalking about you can enable SSL with PostgreSQL and a lot of times you may not passwordprotect its SSL key because that means when you try to restart the service, there's no way to putin the key and you can't start PostgreSQL. So a lot of times if you use it, you leave the key off soit can just restart easily. But if you wanted to use Passphrases, it actually gives a fewrecommendations in here on setting it up such that you can do that using the SSL passphrasecommand. So if you're wanting to use passphrases with your SSL keys, definitely a blog post tocheck out.

[00:12:53] The next post is implementing as of queries in PostgreSQL. So this is basically usingtime travel to look at the state of a table based upon what existed in that table at a particulartime frame. So it's mentioned here that Oracle has this type of functionality but PostgreSQLdoes not. But basically this is a way you could do it. So it does require setting up a table in acertain way. So they've demonstrated using some time zone ranges in an exclude constraint,constructed a certain way to be able to mimic this functionality. So if this is something you'repotentially interested in, definitely a blog post to check out.

[00:13:35] And lastly, if you're interested in more PostgreSQL internals, there's sort supportsorting in Postgres at speed and this is from Brander.org and it basically talks about differentsorting methods that PostgreSQL uses and goes into a lot of detail about specifically sortsupport to be able to accelerate sorting. So if you're interested in a lot of the internal detailsabout that, this is definitely a blog post to check out.

[00:14:02] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show Notes. Be sure to head over to Scaling where you can sign up to receiveweekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks. --

Outage from TXIDs, Breaking Scale, fsync Gate, pg_stat_statements | Scaling Postgres 50

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript