The Worst Part, Pluggable Storage, A Busy System, Data Architecture | Scaling Postgres 264

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the worst part of Postgres, the status of pluggable storage, what makes a busy system and methods for managing different data architectures.

Content Discussed

YouTube Video

Podcast Audio


All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is “The Part of PostgreSQL We Hate the Most”. This is from And I believe this might be a follow on post from something done by last week where he was saying MVCC wasn't so bad. Whereas their perspective, it's how Postgres does MVCC that they hate the most.

So they go into a lot of depth of design principles and how Postgres is designed and does a little bit of comparison to how MySQL and Oracle does it. Basically, Postgres does some things that those other two engines don't and how that can result in your database being larger. Then it would, on those systems, as well as potentially cause some performance problems.

So they first described what multi version concurrency control is, at least how Postgres does it. And basically when you do an update or delete some sort of modification that's not an insert, it creates an entirely new row in the table. So it copies all the data and puts them in a new row, making the modifications that are necessary. And that is the new version of the row. Now the downside of this is you could have a thousand columns in that table and you only change one value in one column.

Well now you've done all of that duplication. Other engines simply make the change in place and store a delta of what was changed. So that's how you can get your multiple versions. And they talk a little bit about how you chain to find the different row versions that exist, and then of course with all of these new rows being created, you need a way to eliminate them and that is the purpose of vacuum.

And then they go into the section about why PostgreSQL’s MVCC is the worst. One reason is the version copying. So the copy of the entire row as opposed to just deltas, like I mentioned before. This results in table bloat where we have all of these rows in the database and that can impact things like full table scans because you have all these other dead rows if it's not taken care of.

And then even after you vacuum those rows, it doesn't reclaim space. So your data files are much larger. Now you can recreate indexes to make those smaller but the data tables, like they say here, are the only thing to do is to basically rewrite the whole file again. And you can do that with the vacuum full but that locks the whole table for reads and writes, or you could use the pg_repack extension to do it.

But again you need a ton of disc space to be able to do that work. And then the third problem is secondary index maintenance. So all of these different versions have indexes pointing to them as well. And they point out this is part of the reason that Uber in 2016 switched from Postgres to MySQL is that quote “tTheir write-heavy workload was experiencing significant performance problems on tables with many secondary indexes”. So these are the non-primary key indexes.

And the fourth problem is the vacuum management and tuning that appropriately to make sure you don't have tons of dead rows in your tables. So this was a very well-written post and I really liked the perspective that they gave. They really didn't have any solutions, they just kind of stated this is essentially the state of things. But they did mention something that I covered last week.

I brought up last week when doing the MVCC article and that was zheap. And here they say quote “There was an attempt to modernize PostgreSQL’s version storage implementation”. So to make some of these problems better. “EnterpriseDB started the zheap project in 2013 to replace the append-only storage engine to use delta versions. Unfortunately the last official update was in 2021, and to the best of our knowledge the effort has fizzled out”. So that's the perspective of this post. And again I highly encourage you to check it out.

But the next post on this topic is “Next steps for Postgres pluggable storage”. This is from And they're covering this issue as well from the perspective of pluggable storage. Those are the APIs that allowed different storage engines for Postgres and what zheap was ultimately going to be, another storage engine, that you could choose for your tables.

And they mentioned the current progress for pluggable storage here in that in version 12, they introduced basic support for pluggable storage. And the goal of adding zheap, so there's the zheap preference again. So he says quote “ZHeap was released with promising results providing an undo log and solving some long-term table bloat issues in Postgres”. But like the last post he says quote “ appears that the work on ZHeap and the table access method (TAM) API has stalled”.

So there are some current issues that kind of need to be addressed before we can make this pluggable storage as good as, if not better than, the current storage technique. Now one interesting thing they mentioned here is that OrioleDB started as a fork of Postgres to address many of their perceived issues of Postgres and have been quietly upstreaming code that would make pluggable storage a reality.

So they say in Postgres 14 their patch set was 5,000 lines of code representing the total number of changes required in their fork. And the upcoming Postgres 16 release, it is 2,000 lines of code. So that means basically 60% of the code changes to achieve what they're trying to achieve with OrioleDB have already made it into the Postgres core. And they say, quote “(Disclosure: Supabase started sponsoring OrioleDB in 2022.)”.

So I really like hearing this and hopefully different members of the community will continue to push forward and make a revamped storage system or at least storage option available. But I highly encourage you to check out both of these pieces of content.

Next piece of content- “What does ‘a busy system’ mean?”. This is from And she's talking about working with the database system and the system gets busier and busier and the developers need more connections and more connections. Well, now you may be running out of memory. And then the DBA looks at the database and essentially it says, well you've got all these idle connections. What's what's going on?

And she says quote “The first question we should ask is, 'What makes your system busy?’”. So she said she actually talked to their application developers, showing them the percentage of idle sessions and the resource consumption. And they dropped the average number of connections from 2,700 to a little over 1,000. And she kept insisting they continue to work on this.

And she says “...after one more week of digging into the application code, the average number of connections was reduced to sixty!”. So they went from 2,700 open connections to 60. So clearly, as she talks about here, it's an application architecture problem. Or there's something in the code causing this problem. And she says something down here that's pretty interesting- “tuning system parameters may improve performance up to ten times,...”.

I frankly usually say two times or three times if you're lucky. “...tuning individual queries may make everything run ten times faster,...”.That's possible, I think, if the database system is being bogged down by certain queries. I mean I've definitely seen something like that. But you can definitely get queries a thousand times faster. But maybe overall that may result in a 10 times faster system.

“And in optimizing the system architecture can make things run hundreds, up to thousands of times faster!”. So definitely it's clear to say there's always room for optimization. And you don't just necessarily have to keep getting bigger servers and more memory and more connections, et cetera.

Next piece of content- “Thinking Fast vs. Slow with Your Data in Postgres”. This is from And this is basically talking about setting up a better data architecture, kind of alluding to the previous post.

And he talks about some common workloads for Postgres, which are basically the primary system of record, primarily an OLTP based system. You have metrics analytics type data you're storing and want to do analysis of. And then there's logging event related data. So he thinks there's different ways you can set up each of these three different use cases.

The primary system of record, or the OLTP system, typically has a smaller working set of data that often fits into memory. You have small fast queries with single digit, millisecond or less response times. And you ensure data consistency using primary keys, foreign keys, and data validity constraints.

The second is metrics or analytics type systems and generally what you'll see is consistently growing data volumes over time. The large amount of data does not fit into memory anymore. Customer interactions require responsive queries and older data is not read as frequently as new data. And in terms of setting these up and maturing, typically you start with raw inserts, then you move to doing multi row inserts and then lastly doing micro-batch inserts using copy as opposed to the actual insert commands.

Now that's for the ingestion side or the insert side in terms of querying better. The first step is to start with partitioning. Then moving on to a limited number of targeted indexes, and then thirdly, creating roll-ups, or summary data, to be able to query. And lastly, talking about logging and event tracking because this is usually used for internal purposes. You do ingest data like the metrics analytics.

It's definitely a large amount of data and it won't fit in memory, but your query response times can be a bit lower. So you don't need to be super fast with it and you just need to efficiently store all of that data. But definitely an interesting post about some recommendations on how you can handle different types of data architectures in Postgres.

Next piece of content- “Running PostgreSQL on two ports”. This is from Someone had a situation where they had some clients that could connect with TLS, whereas some could not. And they wanted to separate them connecting to the database using two different ports. Well, as he says here “Postgres can only listen on one port”.

So what do you do about that? Well, the recommendation here is you could set up PgBouncer with TLS turned on. Therefore those that have that configured, they can connect to the PgBouncer as opposed to the actual Postgres database. And then the other clients could use the un-encrypted one. They also talked about using Stunnel to listen for TLS connections and route them to PostgreSQL.

And actually, I have a third option. If these clients are separated out with an IP address space, you could actually configure different IP address ranges coming in to go to a TLS connection or a non TLS connection. So that's another way to handle this as well. But if you have a need to do this you can check out this blog post.

Next piece of content- “DOCKER AND SUDDEN DEATH FOR POSTGRESQL”. This is from And they're talking about a situation they were working with a client. And suddenly PostgreSQL in a Docker container was crashing. And it was saying the server process was terminated by signal 13: a broken pipe. So they're like that's strange.

A broken pipe signal shouldn't be crashing Postgres. So this blog post goes into detail about how they look through and try to identify what the problem was. And basically what happened was that they set up their Docker container using a nonstandard way of doing it. So they set it up themselves and they set up the Postgres postmaster as the first process to get started.

So it starts with a process ID of one. Now the problem with that is if any other process doesn't have a parent, one becomes its parent. So they actually had a PSQL job that would start up periodically that became a parent of the postmaster process and it ran into issues, basically receiving the signal 13. The postmaster process monitors its children processes. And then takes Postgres into recovery mode if any of its child processes fail. And so this failure brought down all of the Postgres.

So this failure in a PSQL process, unrelated to the server process, is only associated with it because it lost its parent process. It crashes the whole database. So in terms of avoiding this problem, they say a quote “The official PostgreSQL docker image does it right”. And it does it right by having another process being the process that starts the container, not PostgreSQL.

So basically the postmaster process won't inherit processes that have lost their parents. And another recommendation is that you shouldn't be starting PSQL processes within the container itself; a container should basically be like a closed unit, as they say here.

It's only running that service and that's the only thing it does. So you could start a PSQL process outside of the container and connect to it for doing work but it shouldn't be done within the container. But if you want more information about this you can check out this blog post.

Next piece of content- “FORCING A JOIN ORDER IN POSTGRESQL”. This is from And what they're talking about here is that basically Postgres doesn't have query hints. So the planner assesses the SQL it's received and it reorders and structures things to try and find the most efficient path to return your data.

So it's not going to maintain the table order for your joints that you have specified. It will frequently reorder them. So how could you get Postgres to do what you want to do? Now there is an extension called pg_hint_plan. So basically, you could install this extension and use query hints, apparently in the style of Oracle here.

But he offers some other techniques you can use here. The first technique is using offset zero to enforce the join order. Now, this seems quite unusual to me and I don't know if I'd recommend it. And as they say here quote “It would not be hard to teach the optimizer to ignore this ‘useless’ clause,...”. And that would disable this trick or this hack as it were. The next method is using a common table expression to force the join order.

So basically you use a CTE and you materialize it. So I personally like this method better if you have a need to do it. And the other option is to set a join_collapse_limit. So if you drop it down to one, it will then try not to flatten your joins together and it will go ahead and follow the joint order that you've specified. And it looks like you can do this at the session level. So you could do this within a given session then. But not other sessions to your database.

And they do have a recommendation query to quote “run the query in an explicit READ ONLY transaction and use SET LOCAL to change the parameter,...”. And that way it'll revert to its previous setting as soon as the transaction is done. But if you want to find out more about ways to give hints to the planner, you can definitely check out this blog post.

Next piece of content- “ Waiting for Postgres 16: Logical decoding on standbys”. This is from And in this next episode of “Five minutes of Postgres”, Lukas discusses the post we talked about last week where Postgres 16 looks like it will have the ability to logically decode on the standby. And this also includes doing logical replication from the standby as well. So if you want to learn more about his perspective you can check out this blog.

Next piece of content- “Understanding PostgreSQL Parallel Query”. This is from And this explains, in a fair amount of detail, how PostgreSQL does its parallel querying and the different parameters that you can change to affect it.

Next piece of content- “Postgres 15 on Encrypted ZFS and Ubuntu 22.04 LTS”. This is from And he said this is his first time setting up Postgres. So maybe more of the value of this post may be from setting up the ZFS pool. So he goes into detail on how he sets it up in an encrypted fashion using lz4 compression. And goes into detail on how to set that up as well as installing Postgres as well.

Now the only thing to be cautious of is that I have seen with some clients is that, at least the compression performance of ZFS, probably encryption as well, really slows down some database aspects. So just be aware of that or test it as you're evaluating it.

Next piece of content. There was another episode of Postgres FM this week. This one was on “auto_explain”. So they talked about what it is, how it can help, and how to check its overhead. You can listen to the episode here or watch it on YouTube here.

And the last piece of content, the PostgreSQL person of the week is Hécate. If you're interested in learning more about Hécate and his contributions to Postgres, definitely check out this blog post.