LZ4 & ZSTD Compression, Avoiding Problems, Triggers Simplify, Indexes Can Hurt | Scaling Postgres 263
Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss LZ4 and ZSTD pg_dump compression, how to avoid problems, can triggers simplify and indexes can hurt.
- LZ4 and ZSTD pg_dump compression in PostgreSQL 16
- Nine ways to shoot yourself in the foot with PostgreSQL
- How Postgres Triggers Can Simplify Your Backend Development
- PostgreSQL Indexes Can Hurt You: Negative Effects and the Costs Involved
- In Defense of PostgreSQL MVCC and Vacuuming
- CREATE commands in PostgreSQL releases
- PGSQL Phriday #008 — pg_stat_statements
- Logical replication permissions in PostgreSQL 15
- Waiting for Postgres 16: Running EXPLAIN on any query (even with $1 parameters!)
- Solving N+1 Postgres queries for Ruby on Rails apps
- PostgreSQL Basics: Getting started with psql
- Structured Postgres Regression Tests
- PgOSM Flex for Production OpenStreetMap data
- Queues in Postgres
- Sébastien Lardière
- All About Roda With Jeremy Evans | Rubber Duck Dev Show 83
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "LZ4 AND ZSTD PG_DUMP COMPRESSION IN POSTGRESQL 16". This is from cybertec-postgresql.com. And they're talking about how LZ4 and ZSTD compression had been added to Postgres in previous versions for doing compression with toast storage. But now with Postgres 16, you can actually use that as options in pg_dump.
And this post is great because it shows you the different performances and the different compression sizes you can get with these different variants. And I've mentioned before on scaling Postgres that ZSTD has been phenomenal to use. I've been using it instead of Gzip for a couple of years now. And it's been fantastic in terms of how fast it works and how small the data sizes are. And this post shows that explicitly here.
So he tests pg_dump with LZ4, ZSTD, and Gzip and you can see the performance sizes. It looks like ZSTD is the fastest here. And these are with different compression levels. It looks like LZ4 wins here and again, these are just the speeds. And here ZSTD comes out on top again. but look at these compression sizes, where LZ4 is at 48 megabytes, and ZSTD is at 8 megabytes.
So that's a six times smaller file, it's three times smaller than the Gzip file. This is why I keep on saying how much I love ZSTD as a compression tool. So it's great that it's been added to pg_dumps. So now it will be built in, although I've just been piping it to ZSTD to do my backups. But if you want to learn more, definitely check out this blog post.
Next piece of content- "Nine ways to shoot yourself in the foot with PostgreSQL''. This is from philbooth.me. And this post does exactly what it's going to say: show you nine things you shouldn't do. Now I'm not going to agree with all of these and I'll actually mention some counterpoints for some of them, but the first point is to "Keep the default value for work_mem''. So basically you shouldn't do that.
And he says basically because as your database becomes larger and has more complex queries, you start running into disk merges if you have a small work memory. And that will definitely lead to a much lower performance for queries. And he gives some recommendations to try and check for that using a pganalyze, a paid-for tool, or PG Badger to look in your logs for things.
But generally, I don't agree with this first point. I actually like to keep my work memory at its default or maybe bump it up just slightly. But then monitor pg_stat_statements to see what queries are suddenly performing worse. You basically want to see if you're getting disk merges and then strategically determine if you want to up your work memory for that. Because a lot of times as you're growing or scaling your database, you're adding many more connections. And then that means that much work memory is being used for each connection.
So you don't want to run out of memory that way like a previous post, I think from thebuild.com, was talking about someone who had set two terabytes of work memory or something insane like that. So basically I like to empirically determine what it should be just by testing and running in real-world scenarios. And the other thing to keep in mind with work memories is that you can set it per session.
So if you have a database that does 90% real-time transactions, you could have a relatively low work memory. And if you have say 10% reporting, well then, create another user or set a certain session that has a larger work memory to be able to run those reports and not run into disk merges. So I don't a hundred percent agree with deciding what you want your work memory to be ahead of time. But to kind of test out what it should be.
The second point is "Push all your application logic into Postgres functions and procedures". Definitely agree that you should not be doing that. And we had a whole discussion on this in Episode 68 of the Rubber Duck Dev Show about how much business logic to put in your database. And basically, we came to the conclusion it should just be the business logic necessary to maintain your data integrity. So constraints, foreign key references, and maybe a little bit of doing like generated columns. That's probably the extent to which we would do it.
The next point is "Use lots of triggers". Again, you should definitely not be using a lot of triggers, or at least that's my opinion. Because I equate them to a kind of magic. So magical things happen when you insert a field or you update a field. Apart from doing things like audits, which I can see that use case for and I talked about that in a previous episode of Postgres, where we went over the trigger posts. But yes I definitely agree that we should definitely not use a lot of triggers.
The fourth point is "Use Notify heavily". I agree with this one too. I think there are just very specific use cases for that. And I actually haven't used that much in practice. And basically, I've just been working on a project where we were able to get really far just doing polling. Now this is a little bit different but you could get pretty darn far just using polling as opposed to waiting for something to push or notify.
So there are use cases for it but like he says here, don't use it heavily. In terms of building a queuing system that he mentions here, he showed you a way to do it just by pulling events off of an event queue and using FOR UPDATE SKIP LOCKED to be sure to just grab the next set.
Next point is "Don't use EXPLAIN ANALYZE on real data", which means you DO want to use it on real data. So ideally, a copy of production in some way. It's okay if it's delayed a few days but basically have that copy of production data when you are doing query analysis. Now you usually can't replicate active queries happening at the same time, but at least have similar data.
Point six is "Prefer CTEs over subqueries" and I think this was an issue with previous versions of Postgres but because with CTEs now, you can actually determine if you want a materialized or not. This reasoning doesn't really apply anymore. And he actually says this in his edit. Point seven "Use recursive CTEs for time-critical queries". So basically, he's saying you don't want to do that which I agree with because basically when something is recursive, essentially it's a loop.
It's in a for-loop. So whatever query that is, it's going to be running multiple times. So you don't want something that's going to take a lot of time because whatever query that is, it's going to be multiplied by how many times you're iterating over it. Eight is "Don't add indexes to your foreign keys". That's right, you absolutely SHOULD add them to foreign keys because if you have foreign key references or foreign key constraints, and you want to delete a row, well that needs to check the other tables.
So you definitely want indexes on your foreign keys. And nine "Compare index columns with IS NOT DISTINCT FROM". Now, I actually haven't used this statement but he gives an alternative that you can use instead that's better. So definitely recommend looking into that. But check this post out if you want to learn more.
Next piece of content- "How Postgres Triggers Can Simplify Your Backend Development". This is from themythicalengineer.com. And on this particular post, I'm actually going to disagree with the premise of it. And he says here quote, "By using triggers, you can offload some of the work that would normally be done in your backend service code to the database itself. This can simplify your code".
And a lot of times the developer is not going to know what's going on. And now you have to look at two different code bases to actually see what's going on. And personally, I prefer to explicitly do the transaction within the code so you see it there as opposed to relying on a trigger to do it. But there's this one perspective and at the bottom, he did include a number of comments that echo some of what I'm saying as well. So whereas triggers are a great tool in certain use cases, again like the last post I like to use them minimally. But you can check this post out if you want to learn more.
The next piece of content- "PostgreSQL Indexes Can Hurt You: Negative Effects and the Costs Involved". This is from percona.com. And this lists about 10 different points but I think some of these are repeated, so maybe there are about five issues with it that I'll cover. One is that, of course, anytime you add an index, it has to be maintained. So essentially, you are going to penalize your transactional throughput for each new index you add.
So basically, the benefits gained from faster queries need to be offset by that transactions per second hit for example. Now this is also going to increase memory usage, point number two, as you're using more indexes because they're going to be taking part of memory. And it may impact caching because now you're caching those indexes and memory and maybe less data. And third here, he talks about random writes which is that a lot of times when you're inserting data into the table, it's just dependent on the end. Whereas updating indexes, you're doing all sorts of random writes based on how many indexes are on that table.
Again, from all this activity, it results in more WAL that needs to be generated. So your system has to be able to handle that. It talks about more and more IO, again, the more writes every time you update, the more indexes that you have on it. There's an impact on vacuum and autovacuuming, and particularly, I've found the index maintenance time to be the longest period of time. So the more indexes on it, the longer it's gonna take that table to be vacuumed.
He talks about greater storage requirements. So again, more indexes, more storage, which mean less cached in memory but also a greater expense for storing things. And then indexes are more prone to corruption. Now I haven't really had just an index go corrupted. But I have had things here like, doing an operating system switchover and having the Glibc version change. Now you need to re-index all your indexes which means more indexes and that means it's going to take longer.
And plus there are a lot of point releases that change index functionality that requires reindexes. So that's something you need to watch out for and be aware of as you're doing point-release upgrades. But if you want to learn more about ways that indexes can hurt you, definitely check out this blog post.
Next piece of content- "In Defense of PostgreSQL MVCC and Vacuuming". This is from softwareandbooze.com. And he's discussing where he's heard people complaining about how Postgres uses MVCC. Now, not in the Postgres community but in other database communities saying this is bad. And he's saying that's not necessarily the case. Yes, it does cause issues because you have to deal with the table bloat because old versions are kept in the same table as current versions. You have transaction ID wraparound issues and overhead by the visibility map.
But he says one of the main reasons or the main advantages of MVCC is that basically nobody gets blocked, like readers don't block writers. I mean, if something's trying to write the same one, yes, locks are involved. But readers will never be blocked by someone trying to write something. Now in reading this, I was kind of thinking the biggest issue I think with MVCC is when you want to do a lot of updates. So maybe you have a single row you want to update 1,000-10,000 times. Well, you're generating 1,000-10,000 rows in that table when you do that.
And then those all have to be vacuumed up. So basically using Postgres for that use case might not be ideal and it's usually better to actually just do inserts and then delete the old ones later manually or some other process. Now there was a solution discussed for this called ZHeap. So it was a different way to do storage in Postgres. So they actually implemented a way to use different storage engines in a previous version of Postgres.
And it looked like ZHeap would be a way to have a separate redo log, not store the old versions in the same table. That could easily support updating the same row a million times in a table. All those extra rows will be generated in the redo log for example, as opposed to in the actual table. So you don't have to deal with bloat issues or vacuuming at least that up.
But unfortunately, it looks like ZHeap was transferred to another organization, I think, called cybertec-postgresql.com. They have been working on it and I haven't heard a recent status update on it, so I don't know the state of that. But personally, I think it would be great to have that kind of separate storage engine for that use case. Because a lot of what this blog post is about is that a lot of times, even though Postgres uses MVCC, it has some of those downsides.
95% or more of the time, it's fine, there are no issues. It's just certain use cases where it can cause issues. And if we had a different storage mechanism to handle that, then you could use Postgres for more purposes. And he also goes into this, talking about some changes that were made in the Yugabyte project, and some things that they did differently. Some things in the Orile DB project also. So definitely an interesting post if you want to check it out.
Next piece of content- "CREATE commands in PostgreSQL releases". This is from peter.eisentraut.org. And he did an interesting thing where he went by Postgres releases and said what new create commands have been added? So you could see no new create commands were added since version 12, which is a fairly long time. In version 11, you could see they added PROCEDURE. So they added the ability to create procedures. In version 10, they added the ability to create PUBLICATIONS and SUBSCRIPTIONS for doing logical replication as well as STATISTICS.
So these are for handling multi-com statistics. So it's just interesting, the history of what they've added in each version. And then the second table here is by what new system catalogs were added. So by each version, you could see what capabilities have been added by the system catalogs. So this is just a quick post, looking at things and how they've changed in different Postgres versions over time.
Next piece of content- "PGSQL PHRIDAY #008- PG_STAT_STATEMENTS". This is from pgmustard.com. And they're announcing the upcoming PGSQL Phriday is about pg_stat_statements, which is my number one extension for doing performance monitoring and assessments for Postgres. And if you're interested in participating in this month's blog topic, you need to get your blog post in by Friday, May 5th and they have the instructions on how to submit it. So definitely check this out if you're interested in that.
Next piece of content- "Logical replication permissions in PostgreSQL 15". This is from postgresql.fastware.com. And they're talking about how currently, only super users are allowed to create subscriptions. So if you want to do logical replication from one database to another you actually have to create that subscription as a superuser.
In addition prior to 15, quote "...if a SUBSCRIPTION was created by a superuser who was later demoted to a non-superuser, the subscription apply workers and tablesync workers would continue to apply the logical replication changes...". So basically, they considered this a security violation because someone no longer has permissions but yet the subscription keeps chugging right along, synchronizing data.
So a change was made in 15 to no longer allow this. So basically, if you demote a user to a non-superuser, you're no longer able to access that data. And this is in preparation to allow non-superusers to create subscriptions. So now that logical replication has a permissions-based concept, the idea is to now allow non-superusers to create subscriptions. So this is pretty great. They go into some of the changes with 15 and this kind of leads to the promise of allowing non-superusers to create subscriptions. But if you want to learn more, check out this blog post.
Next piece of content- "Waiting for Postgres 16: Running EXPLAIN on any query (even with $1 parameters!)". This is from pganalyze.com. And this is the next episode of "Five minutes of Postgres". And in this episode, he covers the explain generic plan that's new in Postgres 16 that cybertec-postgresql did. We discussed this last week but it basically allows you to get a generic plan for a parameterized query. So if you're interested in Lucas's perspective on that, check out this piece of content.
Next piece of content- "PostgreSQL Basics: Getting started with psql". This is from red-gate.com. And this is a very basic post about just how to get PSQL installed on your system to be able to communicate with Postgres. So they show how to install it on Linux, Windows, and Mac OS. Props to mentioning Linux first. But that's basically what those posts do, as well as explaining how to connect to a database.
Next piece of content- "Structured Postgres Regression Tests". This is from yrashk.com. And he's talking about a tool called pg_regress that does regression tests for Postgres. So basically, it takes an SQL file, sends it to Postgres, gets the output, and then on subsequent runs, it compares that output again to see if there are any regressions. So basically if you make a change, you want to make sure you haven't broken anything. And he said the issue we had with it is that the test looks like this. It just shows you the pure text output.
And ideally, you would want something more structured in your test, cause you don't want a space to be off here or something. So he actually chose to use YAML. So now with this much more structured YAML file for checking the results, he released a new tool called pg_yregress. So if you're interested in using this type of tool for doing regression tests, but using a more structured YAML as opposed to just a text output, definitely check out this new tool.
Next piece of content- "PGOSM FLEX FOR PRODUCTION OpenStreetMap DATA". This is from rustprooflabs.com. And this is a project that helps you load OpenStreetMap sets into PostGIS. And that's probably the extent of my knowledge. If you want to learn more about that, you can definitely check out this blog post.
Next piece of content. There was another episode of Postgres FM last week, this one was on "Queues in Postgres". So they discuss the pros and cons of, I guess, doing queuing in Postgres versus dedicated queuing tools and some tips for scaling. So if you're interested in that, you can listen to the episode or watch the YouTube video as well.
Next piece of content, the PostgreSQL person of the week is Sébastien Lardière. If you are interested in learning more about Sebastian and his contributions to Postgres, definitely check out this blog post.
And the last piece of content, we did have another episode of the Rubber Duck Dev Show this past Thursday afternoon. This one was on "All About Roda With Jeremy Evans". So Roda is a Ruby toolkit. He didn't call it really a framework but a toolkit for building web applications using Ruby. So if that type of content is of interest to you, we welcome you to check out our show.