background

Bloat vs zHeap, Composite Indexes, Unsafe Operations | Scaling Postgres 39

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering bloat vs. zheap, composite indexes, unsafe operations, and being a better developer.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about Bloat versus Zheep compositeindexes, unsafe operations, and better developers. I'm Kristen Jameson, and this is scalingpostgres episode 39.

[00:00:23] Alright, our first piece of content this week is dealing with significant postgresdatabase bloat. What are your options? This is from the Labs contractually.com blog. Now,basically, they were having a lot of problems with bloat, I believe they were saying, due to a lot ofdifferent update statements. And first thing this article goes into is what is Bloat? Essentially, dueto postgres's MVCC architecture and how they have it designed, whenever a row gets updated,actually a new row gets created and eventually the old row needs to go away. Well, if you weredoing, say, updates so fast, those old rows, maybe vacuum isn't cleaning those up fast enough.That essentially you're getting table Bloat. So you have data that is no longer really valid, but it'sstill filling up your table. And he talks about, okay, what problems does Bloat cause? Well,essentially, it can cause poor performance. Now, their example here is they say consider thecase where a table has 350,000,000 dead tuples but only 50 million active rows. So it's a Bloatscore of a tool that they use of seven. And when the query planner tries to query this table, itgives horrible instructions that can lead to slow queries because again, it has an accurate countof how many rows are in a particular table. And in this example they're talking about, in thisscenario, they had a query that was taking 14 seconds. But after they got rid of the bloat, thequery took 37 milliseconds. So a reduction 99.7% in execution time. So Bloat can really causesome problems. Then they talk about how does database bloat typically get cleaned up? Well,that's vacuum's job. But of course, what if vacuum is not keeping up or you're updating so fast?Vacuum struggles to keep up. Well, one scenario is don't update as often. Maybe you shouldswitch some those of updates to doing inserts from your application's perspective. And in thescenario they talked about here, in terms of what's causing the bloat, this may be anoverzealous background job that's updating records far too frequently or lack of rate limiting.Basically something in your application you need to try to resolve, to try to avoid so manyupdates. In particular, if you're trying to update the same record, number one, you'll have tons ofissues with locking, but you'll just create tons of bloated records in your table. Now of course,once you have a bloat problem, how do you deal with it? Well, one option is Vacuum full, but thatessentially locks the entire table while it does that. So that's only useful if you shut off entireaccess to a table. And then there are two other tools you can use. One is Pgrapak and one isPG compactable. And in the scenario that they're discussing, here they used Pgrapak. They talka little bit how it works, but unfortunately when they were using it, they were using logicalreplication to upload data to redshift and using that Pgrepact tool because their table was solarge, they actually ran into some out of memory errors and some issues. So basically they hadto drop the slot while they were doing this repack operation. So this was a great post that kind ofgoes into blow what it is, where does it come from, how to potentially deal with it. So if yoususpect that some of your databases being impacted by this, this is definitely a blog post tocheck out.

[00:03:49] The next post is actually also related to Bloat and it's about Zheep and this is aYouTube video. Now again, I had said that this is from the Pgcon YouTube channel and thisparticular presentation is Amit Capelia, robert Haas Zheep less Bloat, fewer writes and just plainsmaller from Pgcon 2018. Now I mentioned the YouTube channel of this previously, but this isone presentation I think you should definitely check out because I think it represents probablythe future of where storage is going for postgres. V. Heap is basically a reimagining of how tostore data in postgres and handle it to avoid these bloat issues. So in other words, in thescenario of an update you don't create a new row and then eventually delete the old one. You domore in place updates. But to handle that they had to implement essentially a undo and a redolog to be able to roll back changes. So this is a project that they're working on. They don' --t know if it's going to potentially be in version twelve or version 13 or 14, I don't know. But itkind of represents the future. And since it's related to bloat, I wanted to mention here now thisparticular YouTube video is not great. I just used it for the audio and I actually found in this pieceof content the presentation slides. So I use these in the audio from the YouTube channelbecause they never show the screen on the YouTube channel. But one thing I found superinteresting and the whole presentation is interesting, so should definitely check it out. But herewhere they're talking. Heap is just the standard Heap as it exists now for eight clients. Aparticular load was placed on it and this is showing the size of a particular table as it grows. Nowbecause they're not storing all of the transaction visibility in the new Zheep architecture. This isZheep starting from the beginning. The table size with all the data in it is actually smaller and asthey do transactions against it, updates deletes that essentially cause that modification. Youcould see the bloat increasing to this point here and it kind of stays high whereas with the Zheepit's consistent and then with a bring it up to 64 clients doing operations again, the Bloat growshigher in the existing postgres, whereas in Zheep it's consistent. So basically it's an opportunityto eliminate these Bloat issues and potentially not have to use vacuum anymore. So this wouldbe super awesome. And they also made note again performance changes. This is something indevelopment, so it's going to change. But the TPS transaction per second Z heap was 40%more than the regular heap and the above tests and they've seen an improvement of as muchas 100% for other tests. So definitely something to keep an eye out and definitely a piece ofcontent to check out.

[00:06:42] The next post is how to use composite indexes to speed up time series queries. Thisis from the Timescale.com blog. Now this is talking about for time series queries, and againTimescale is a time series extension that you can add to postgres. But I found this post justrelevant if you're working with any kind of time series data in terms of when typically you're goingto want to look at certain ID columns as well as a time to look at that information over time. Inthis example they're using of course a Timestamp, a device ID, device type, store ID, and you'regoing to want to query those in different ways. And how you create your multicolumn index isvery important based upon what questions you want to ask. So this goes over the differentquestions you potentially want to ask and how best to structure your multicolumn indexes to dothat in terms of the ordering, because the ordering is very important. Now they do show someexplained plans here, but I think some of these are related to the chunking that Timescale DBdoes. So this may be a little bit less relevant, but if you're wanting to make sure that yourindexes are being used appropriately, getting the order right is super important. So if you suspectyou need to brush up on that, definitely a blog post to check out the next post is safe and unsafeoperations for high volume PostgreSQL. Now this is an old post from back in 2016, but it'srecently been updated for the changes in PostgreSQL Eleven. So basically they say whatdifferent? So they say what different data type changes are you going to be wanting to do andwhether they are safe to do when you have a lot of data, a lot of activity in your database. So forexample, adding new column is safe. Adding come with a default in every version. BeforeEleven it was unsafe, but now with PostgreSQL Eleven's new feature it is safe. So essentially itupdates it. For this scenario, add a column with the default and add a column that is not nullable.Both of these are safe for PostgreSQL Eleven, but still unsafe for versions prior. So this is agood reminder in general about what operations are safe and unsafe for PostgreSQL. So if youneed a refresher on that definitely a blog post to check out.

[00:09:04] The next piece of content is being a better Developer with Explain by Louise Ranjon.So, this is an older video, but I actually watched all of the PG Day Paris that I think it actuallytook place in March, but they weren't posted up until June. But I watched all of them and I'm kindof prioritizing what I thought were the best presentations that were taking place there. So I wouldsay this is my top --one. And it basically covers utilizing Explain and utilizing PostgreSQL logging in yourdevelopment environment to be able to track what your queries, what your orm is actually doing,so you can understand and optimize it better. So again, she covers going over how to enable thedifferent settings in PostgreSQL to enable logging. If you just don't want to use application levelquery logging, she goes through using Explain and how to use it. And then she did somethingpretty interesting where she did a Python version of generating some data. So just pulling rawdata from the database and building the query in Python and got it running in 1.5 seconds. Andthen she built a query that uses the features of postgres to be able to pull that out, thinking itwould be faster, but it actually ran in 48 seconds, so over 40 times slower. And then she usedExplain to understand, okay, where is the time being spent, what is causing this to be slow?Once she added the necessary indexes to speed it up, that query then completed in eightmilliseconds. So dramatically faster than, of course, the one without indexes, but alsodramatically faster than the Python version she did trying to do the merges and the joins inPython. So if you're a developer, I would say this is definitely a presentation you can check out.

[00:10:54] The next piece of content is actually also from PG Day. It's change data capture for abrand new world. And again, this is from the PG Day YouTube channel. And this kind of goesinto a little bit of a history of exporting data from a database and potentially loading it into a datawarehouse or using it for different purposes. And they talk about kind of the first tools availableare the ETL tools or nightly dumps of data, or potentially using the copy command inPostgreSQL for exporting data and importing it elsewhere. And then how that kind of evolved tousing a trigger based mechanism. So you apply triggers that do change data capture to an audittable or a history table. But some of the disadvantages of those is that they can be brittle ifyou're going to be changing the schema of the source tables as well as essentially you'redoubling your rights. Because every right that needs to go into table also needs to go into theaudit tables. And then they said pretty much what most people use today is the logical decodingfeature. So basically creating a logical replication slot and then reading it using a tool such asWall to JSON which essentially reads the wall stream and pulls out the information that'sneeded. So they go over the different tools that allow you to do this and talk a little bit aboutKafka when you really want to get sophisticated. So if you have a need to extract change datafrom your PostgreSQL database, I would say this is definitely a presentation to check out thedifferent methods and tools that are available to do that.

[00:12:26] The next piece of content from PG day Paris is Constraints a developer's secretweapon. By will Lean weber so he's talking about the importance of if you're an applicationdeveloper, not just using your database as a kind of a dumb hash in the sky, but because formost applications, the database schema changes very little relative to application changes.Therefore, if you can put the appropriate constraints on your data tables, they can help youavoid problems in the application or introducing issues in the application that are causing dataissues. So for example, he's talking about using not nulls where appropriate, potentially usingunique indexes, using the appropriate data type. So not just using text and integers potentiallyfor everything, but utilizing Booleans or UUID data types. Basically use the data type for thepurpose intended to help you constrain the data. And then he also talks about you could evencreate particular constraints like make sure the price is greater than zero for example. So it's arelatively brief video, but if you feel like you need to add some more constraints to yourdatabase, definitely a piece of content to check out.

[00:13:40] The next post is Triggers Friends to handle with care. And again, this is another videofrom PG Day Paris and this basically covers all about using triggers and how to set them up andsome caveats with regard to them. So it's not necessarily have information that helps withscaling, but it's definitely an interesting post if you're interested in learning more about or want touse more triggers in your P --ostgreSQL database.

[00:14:08] The next post is Preventing SQL injections, and this is from the Tapoeh.org blog andhe talks about some capabilities that PostgreSQL has to avoid SQL injections and one way isbasically to parameterize your queries. So he talks about using the EQ exec params. Sobasically you separate the query text from the parameters and if you use the PGx paramprotocol to do that, you're going to essentially avoid SQL injections. The other way to do it isusing server side prepared statements. So basically you prepare the statement you want andhave essentially the parameters in here that it can receive and then you execute passing in theparameter. Now the thing to keep in mind here that actually someone mentioned in thecomments is that with Pgbouncer, so if you have scaled to the point you're using PG bouncer,you can't really use prepared statements. So with that at a windshield at a large scale, youbasically have to rely on these types of parameterized queries. So if this is a particular concernfor you, definitely a blog post to check out.

[00:15:19] The next post is zero downtime upgrade PostgreSQL ten to eleven. And this is fromthe Evolvemonkey Blogspot.com blog and with the advent of Eleven coming out, people areplanning their upgrade from version ten and this kind of goes through the different upgrade pathsthat are available. Number one is PG dump, but that for a large database would take anenormous amount of time. PG upgrades are usually the accepted practice and if you do theupgrade using hard links they can be completed very very quickly for 13 gigabyte database butstill they can be completed very quickly. The other option because of logical replication that wasintroduced in version ten is using that method. Now this is kind of new and he has this caveatsarea and there's a particular process you can go through to do it, but it's kind of new, a little bituncharted territory. So you definitely want to examine this method and check it out and maybethis blog post can help you make up your mind. So definitely piece of content to check out.

[00:16:23] The Last Post is five cool things I learned at the PostgreSQL Conference in Europe2018. And this is from the several nines.com blog and he talks about shared buffers and how tothink about setting them from this presentation as well as some VM settings for your Linuxdistribution. You installed it to talked about advanced logical replication. So all the differentenhancements that talk about using logical decoding, close to zero downtime upgradesreferencing the previous post, discussion about discussion about what's a new in PostgreSQLeleven, all the new features and then Zheep in answer to PostgreSQL Bloat Woes, which wasthe presentation that I mentioned previously. And lastly parallel query in PostgreSQL and hownot to misuse it. That was mentioned in our previous episode of Scaling Postgres. So if you wanta different individual's opinion on presentations they thought were great, definitely a blog post tocheck out.

[00:17:22] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scaling where you can sign up for weeklynotifications of each episode or you could subscribe via YouTube or itunes. Thanks. --

episode_image