background

Index-Only Scans, Lock Table, Hot Standby Feedback, Large Backups | Scaling Postgres 54

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering index-only scans, locking tables, hot standby feedback and handling very large backups.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about indexonly scans, locking tables,hot standby feedback, and very large backups. My name is Kristen Jameson, and this is scalingpostgres episode 54 one.

[00:00:22] Alright, I hope everyone's doing well this week. Our first article is 2019 databasetrends SQL versus NoSQL, top databases, single versus multiple database use. And this is fromScaleGrid IO blog. And this is a survey from asking hundreds of developers at Developer Weekto discuss these different trends. So they are talking about SQL versus SQL or SQL versusNoSQL, and they look at the ratio to what people are using, they look at the most populardatabases, so it looks like MySQL is coming out on top, followed by Mongo and thenPostgreSQL. And then they reference the DB engine rankings with regard to who's increasing inpopularity. And then they talk a little bit about single database versus multi database use andhow those are broken out by SQL and NoSQL, as well as time spent between different databasemanagement tasks. So if you're interested in looking at some of this developer survey data,definitely a blog post to check out.

[00:01:29] The next post is index only scans in postgres. And this is from the PG Mustard blogand it's talking about index only scans. These are index scans that can only look at the index tobe able to answer the question that's being queried. Normally with an index scan, depending onhow much data you need to pull back, it scans through the index and then goes and pulls thedata from the heap to be able to return the query results. However, if you're only asking for datathat is already indexed, it can just do an index only scan and return that data to you, so it canresult in much faster queries. And they describe two conditions that must exist for this to work.First is a compatible index. So btree indexes work. However, something like a gen index that youwould normally use on a JSON field or a full text field you want to do full text searching on.Actually you can't use an index only scan with that, and if you're using other types of indexes,you should consult whether you can use those or not. But the standard B tree indexes do work.The second condition is that the data columns for the information you're returning need to bepart of the index. So for example, they're showing an index here that is just indexing the chefcolumn. If you want to be able to return more data than just the chef column, you need to addmore columns to that index. Now a benefit of PostgreSQL version eleven is it allows coveringindexes. So you can still only index the column that you're interested in, but you can actuallybring back what I like to call a payload of what that index should include. So you can actuallyincrease the proportion of index only scans you can get. Now then it has this important caveathere. When is an index only scan not an index only scan? Well, because PostgreSQL is a multiversion concurrency controlled database, it has the concept of visibility. In other words, are you auser that can what's your state of usability of the data? You can see now that usability data ispredominantly stored in the heap. So sometimes when you want to do an index only scan, itactually needs to consult the heap to see what's visible. Now there is a visibility map whenyou're trying to do an index only scan to see if it needs to go to the heap. Hopefully you do notneed to, but if the data page has changed then you will need to consult the heap and when youhave a query plan output, it actually tells you how many heap fetches were necessary. So thiswas a good blog post that explained index only scans and it is definitely a technique you can useto make your queries faster.

[00:04:07] The next post is Locked table can harm your database's Health. And this is fromCybertechn Postgresql.com blog and it starts off just explaining table locks in general. Sonormally in your operation of the database when you do a select on the table, it will do a AccessShare lock automatically. You don't have to automatically lock at the select part does that orother types of locking depending upon what data operation you're needing to do. Like forexample an Access Exclusive lock when you're going to truncate a table. But of course itmentions you can explicitly lock a table with the lock table statement. Now they mention heremost people who are using this lock table stable don't know that the d --efault lock mode is Access Exclusive, which blocks all concurrent access to the table, evenRead Access. So that's probably not what you want to do to your database when you're trying toscale and that there's less restrictive ways to be able to do locking. And this part was actually agreat reference. So they talk about you don't want concurrent transactions to modify a rowbetween the time you read it and the time you update it. Use. Select for update. If you want toperform several selects on the table and want to be sure that nobody modifies the table betweenyour statements, then use a transaction with the repeatable Read isolation level. You want to geta row from a table, process it later and then remove it. Use delete with returning. You want toimplement a queue where workers should grab different items and process them. Use selectwith a limit of one for Update skip Locked and you want to synchronize concurrent processeswith database techniques, then use advisory locks. Now, how locking tables can really causeyou problems is regard to auto vacuum and that vacuum's share Update Exclusive Lock. A lot oftimes conflicts with lock levels people are using, such as Shared and Access Exclusive. Andhere's an important point they made. Quote now, Auto Vacuum is designed to be non intrusive.If any transaction that wants to lock a table is blocked by Auto Vacuum, the deadlock detectorwill cancel the Auto Vacuum process after a second of waiting and it actually posts an error inthe log and then the Auto Vacuum will start up again. But the problem is, if you regularly locktables as a part of your application's operations, you could be canceling Auto Vacuumsfrequently enough that it will never be able to finish. Then you will get table Bloat and yourtransaction IDs may start to be wrapping around shutting down the database, requiring a restartin single user mode with a manual vacuum. So you definitely don't want to get into that situation.And they phrase it as the ugly end. They say how can I avoid this problem? And if you alreadyhave the problem, just launch a manual vacuum, full freeze on the table. Wait until it's done toavoid the problem. Don't use lock on a routine basis. Turn auto vacuum to run moreaggressively. And also using PostgreSQL 9.6 or later also assists. So if you find yourself as adeveloper locking tables as part of your application's operations, maybe try to figure out a way tominimize that. Or use some of the techniques mentioned in this blog post to make yourapplication more performant and protect it from potential harm.

[00:07:30] The next post is I Fought the Wall and the Wall Won why hot Standby Feedback canBe Misleading so, this blog post is talking about hot standby feedback. So, if you have a primarydatabase set up that is Replicating to one or more Replicas, those Replicas are typically set upin a hot standby state, meaning they can be promoted to be the new primary database.Therefore they are known as Hot standbys. So you're doing streaming replication from theprimary to the Replicas. Now, a lot of times people do read only queries to these Replicas, butthe problem is that sometimes queries read only queries to these Replicas get canceled andyou'll see something to the effect of error canceling statement due to conflict with recovery.That's because, as they mentioned in this blog post, the job of the Replica is to follow theprimary database. And if you have long running queries where a table needs to be dropped or anumber of deletes, whatever changes are happening to the Master eventually need to bereflected in the Replicas. And typically there is allowed about 30 seconds for the readonly queryto finish before the Replica forces itself to keep up with the state of the current primary. Now,one way you can solve this is to use hot standby feedback which basically allows the primarydatabase to keep track of the back end X Men to keep track of where its replicas are at in theirstage of replication and whether they can get rid of wall files. Because all the queries havecompleted on the Replicas. Now, of course the trade off they mention here is that setting hotstem by feedback to on can incur some table Bloat on the primary but often it's not significantbecause basically you're asking the primary to delay rotating those wall files until the replicashave finished the queries they're running. Quote however, there are some cases of querycancellation that hot stem --by feedback cannot prevent. So if you have exclusive locks on the relation of the primary flakyor wall receiver connections or frequent writes on small numbers of tables and it goes intoexplain more of the details of each of these areas to consider if you have hot standby feedbackon but basically the hot standby feedback is not a panacea. It will help increase the probabilitythat a replica will allow readonly queries to finish but there are some cases where they still maybe canceled. So you need to prepare for that in your application to be able to handle thosecanceled queries. So it helps, but it doesn't help all the time. So if you're interested in using thisfeature, definitely a blog post to check out.

[00:10:11] The next post is PostgreSQL Tools for the Visually Inclined. Now, I read this title and Iwas expecting to see a breakdown of the different visual tools for PostgreSQL but basically it'stalking about how there aren't really many and you should actually embrace the text only tools.So it's more of a endorsement of PostgreSQL's direction of primarily using and you know thequote, here your best friend psql, and he talks about all the different capabilities you can do andhe contrasts it with using some of the visual tools of like Microsoft SQL Server and he goesthrough all the different ways you can do different DBA tasks. And primarily he comes down to isthat it's a matter of speed. A text based tool should always be faster than a Gui tool basically. Soif you want to find new and better ways to use the text based tools like psql and PostgreSQL,definitely a blog post to check out the next post. Managing PostgreSQL Backups and Replicationfor Very Large Databases part Two. And this is from the Lebanon Coin Engineering blog. Now,we covered this first part in a previous episode of Scaling Postgres. So this is a second part andhere he goes over the importance of restore tests and how no matter the way your backup is setup, you definitely need to set up restore tests. And they say they do it on a weekly basis andthey also go into a little bit about how their backups are set up. So they do logical backups. SoPG dumps to be able to have long term data retention. And they talk about information withregard to GDPR. But of course, they also do physical backups that are used for restorationpurposes, for restoring an entire database, and how they use Barman for this purpose, becausethey do have actually multiple data centers. They said which barman has the capabilities to dothe backups for? And then they talk about their physical backup retention. Estimating. Time tobackup. Estimating the Time To restore. So it really has a good overview of how to examine yourenvironment and what kind of backup restore protocols work best for you. So if you're interestedin learning more, definitely a blog post to check out.

[00:12:34] The next post is how shift left uses PostgreSQL extension timescale. DB. SoTimescaleDB is a PostgreSQL extension optimized for storing time oriented data sets. And itbasically uses this by using concept of partitioning, but partitioning many, many tables. And it's astructure I believe they call Hyper tables. So this organization uses TimescaleDB, which isrunning on Postgres, and they talk about their rationale for using it and how they use it. So if youhave a need for tracking time series oriented data and want to use PostgreSQL to do it,definitely a blog post to check out.

[00:13:15] The next post is settling the myth of Transparent. Huge pages for databases. And thisis from the Procona.com blog, and he's talking about a testing process where he goes andenables transparent huge pages to see if he can get a boost in performance or what's theperformance like. Because most recommendations that he's observed have said not to usetransparent huge pages, but he goes through some benchmarks and basically he comes to theconclusion, yes, you probably shouldn't use it. But what's even more interesting this blog post isthe discussion with regard to the comments. So, for example, the first post here, are youplanning to test non transparent huge pages soon? And they said they did that and it gave reallygood performance, although there's some back and forth with regard to how much it improved.Like the reference to this post is this one back from in December benchmark PostgreSQL LinuxHuge Pages. And it talks about their process of enabling huge pages in the results. But I remem--ber it actually may be this individual, the author here. There was a presentation by Procona atI think it was Fostem, and they were saying, well, some of those results with the Linux Hugepages weren't conclusive either. So it's still very much up in the air for me. How much at leastLinux Huge Pages can help improve performance and it may be dependent upon your workloadand what you're doing. Is it more of an OLTP or is it more OLAP? And it's probably the kind ofconfiguration that you need to benchmark how your database operates currently. Try changing orimplementing huge pages in a test environment to see if you get better performance using that.And if you're interested in doing that, these are definitely two blog posts to check out that does it.For this episode of Scaling Postgres, you can get links to all the content mentioned in the shownotes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weeklynotifications of each episode. Or you could subscribe via YouTube or itunes. Thanks. --

episode_image