Postgres Releases, PostgreSQL Survey, Partitioning vs. Sharding, Bulk Loading

In this episode of Scaling Postgres, we discuss new Postgres releases, taking the 2023 State of PostgreSQL survey, partitioning vs. sharding and the fastest way to do bulk loads.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Well, you take a week off and there is a deluge of posts in the last two weeks. So there's a lot of content to go over there for these past two weeks. But the first piece of content- "PostgreSQL 15.4, 14.9, 13.12, 12.16, 11.21, and PostgreSQL 16 Beta 3 Released!". This is from postgresql.org. These particular releases address two security issues. The first one is an extension script; substitutions within quoting allow SQL injection.

The second issue was MERGE fails to enforce UPDATE or SELECT row security policies. So you may want to update for that purpose. As well as there have been 40 bug fixes for all of these releases. The most significant one to be aware of is if you're using BRIN indexes, so they have a fixed for handling NULL and BRIN indexes and it will require you to re-index your BRIN indexes, so keep that in mind, if you're going to update to this next version. And then lastly, there is a notice that PostgreSQL 11 has End of Life on November 9th, 2023. So check this post out if you want to learn more.

Next piece of content- "The 2023 State of PostgreSQL Survey Is Now Open!". This is from timescale.com. And they did a survey last year where they show the results here but there is a link to take the survey here. So please go ahead and take out the surveys so we can see what the perspective of the community is like this year.

Next piece of content- "PGSQL Phriday #011". This is from pgsqlphriday.com and they announced the new blogging event which has already happened. The event was about partitioning and sharding. So this next post talks about partitioning versus sharding in PostgreSQL from tomasz-gintowt.medium.com. And this is essentially the challenge post about what it should cover. The next post related to the partitioning and sharding event is actually the one I relate to the most, meaning I kind of agree with a lot of what was mentioned here.

And this is "Partitioning: we need it!" from hdombrovskaya.wordpress.com. And she says quote "Like many others, I never did sharding 'for real', that is, in a production environment". And I too have not done that, I've only done partitioning. And just to level set where things are, partitioning is where you take a table and you break it up into multiple tables using typically declarative partitioning in Postgres. So it has a built-in capability to split out tables into multiple child tables.

Sharding is usually where you take a table and you want to separate it or put certain tables on another database outside of your existing one. So you can basically think of Citus, where you're scaling out to multiple database shards. Essentially, you're sharding your database and putting the data in more than one database server. So she doesn't really have experience with sharding, but in terms of partitioning, she says quote "Whenever I suggest somebody partitioned their large tables, I must warn them that partitioning will not make queries run faster and that I will do my best to ensure the queries will not slow down...".

And this is my experience as well. So many times, people say they want a partition for query performance whereas I've found that very hit or miss. Now it's not that they never get fast but it's very hit or miss, and sometimes you can make things worse. And this is even if you're still using a key in your queries that the shard is based upon. So we haven't really figured out why some are better and why others are not. I haven't seen any one report on it or understand why that's the case. But that's been my experience and we'll look at a post that shows this in detail.

But she says the reason to do it is for "maintenance and archiving" and I totally agree with that. It makes things easier to maintain when you have huge tables that you're working with and if you need to archive them. So you're not going to need that data, it's much better to just drop an old partition than to actually try to delete it. Potentially causing table bloat, an index explode, et cetera. So if you're going to be archiving data, definitely go ahead and partition your tables.

And she talks about the importance of you needing to change all your queries on these partition tables to include the shard key so that you can properly exclude the unnecessary partitions. So you're only targeting the partition that you need to get the data from. So like I said, I think this post most closely relates to my experience with partitioning in particular. So definitely feel free to check that out. In terms of other contributions, softwareandbooze.com did thoughts on PostgreSQL partitioning and sharding.

And the thing that was interesting about this one, he mentioned some things that timescale does. So they partition a little bit differently for their time series data. They don't really use hash and list partitioning. So it's basically by date and they don't call them partitions, they call them chunks, as it says here. I also talked about Citus, which is, of course, the solution for sharding out your database. Talks about some of the advantages of partitioning and it's basically from a maintenance perspective, in my opinion. Also, some potential gotchas are mentioned in some of these posts.

And those include having to include your shard key and every query, essentially, to get optimum performance. The next one is you can't use unique indexes or there's no globally unique index for a set of partition tables. You can create individual unique indexes on the child tables but it can't exist within the scope of the whole partitioned table from the parent down. And then thirdly, no concurrent index edition or removal. So basically, there is a process to get them added in a current fashion, basically, you add them to each child table concurrently and then you add a final index to the parent table. But it's not as convenient as just being able to run a single command, there's work involved with it. So those are the downsides that I'm particularly aware of that some people mentioned.

Next post- "Understanding partitioning and sharding and Postgres and Citus". This is from citusdata.com. And they've linked to a very good article. I suggest you check out Brandur's advice on partitioning. They have a link here and they even have a quote that he gave here on using partitioning. So encouraging you to check that out as well as their perspective on sharding because they are the sharding database solution basically. So that's the reason to check out this post.

The next one- "PGSQL Phriday 011- Partitioning vs Sharding in PostgreSQL" from engineering.adjust.com. This is from Andreas Scherbaum and this gives a real-world perspective on using partitioning and sharding and they actually use both. They don't use a solution like Citus, it seems like they do application sharding. So the application determines where certain data is stored. And they use shards for that purpose but they also partition larger tables within those shards. So you can check out this if you want to learn more about that.

Next post on sharding and partitioning from andyatkinson.com where he covers a lot of the same things but he actually talks about application-level sharding. He has experience in Ruby on Rails and he has some perspective on how Rails can actually handle some of this application sharding. Some of those features were added to Rails for that purpose.

Next post is related to it is "PARTITIONING AND SHARDING IN POSTGRES". This is from pgmustard.com. And he has a very similar perspective when he says, quote "...query performance as one of the major benefits of partitioning. This feels like something I should be on board with, but honestly, I think that faster maintenance tasks is by far the bigger benefit". And I'd even say similar to the first post we looked at, some of those query optimization benefits can be highly elusive and maintenance is a bigger reason to do it.

The next post is "Partitioning as a query optimization strategy?". So this isn't part of PGSQL Phriday, but he actually is looking at this question about query optimization and how partitioning can help. And here you can see what I'm talking about when you look at query one which was discussed and query two. And you can see the yellow is the partition table using a partition-wise Join and this is a logarithmic scale of execution time.

So it is faster and he says maybe about four times faster using partition tables compared to non-partitioning, which is in the blue. But when you look at query two, maybe there's some difference but it's very subtle. So it depends on the query and the tables that you're partitioning the results you're going to get. So sometimes it works, and sometimes it doesn't have much of a change at all. So it's just something to be aware of. Oh, and I should say this is from ashutoshpg.blogspot.com.

Next post is "Mastering PostgreSQL Table Partitioning". This is from fragland.dev. And this is an overall post talking about partitioning tables. And this post I wouldn't say necessarily helps you master table partitioning but it basically gives you an overview, showing you the different types and how to set them up. So it's more of a quick height how-to pose to get partitioning setup. So you can check out this blog post if you're interested in that.

Next piece of content- "BULK LOAD PERFORMANCE IN POSTGRESQL". This is from cybertec-postgresql.com. And this is a set of experiments. It's actually similar to a video I did, oh my gosh, a number of years ago looking at bulk loading with regard to Ruby. So essentially they did a lot of replication of what I did but added some new things as well but basically the same conclusion. And if you have experience in this, you won't find anything surprising. So we tested six techniques. The first technique was single INSERTs in autocommit mode.

So this is just doing insert statement after insert statement and committing after each one. Second one is single INSERTs in one transaction. So, presumably doing a begin transaction then doing all your INSERT statements and then committing at the end. The third one is single INSERTs with a prepared statement and one transaction. So basically doing a PREPARE command and then doing the EXECUTE command for that statement and do that in one transaction. Another is multi-line INSERTs in one transaction. So do an INSERT with multiple rows of data in it, as shown here.

Another is doing that multi-line INSERT with a prepared statement. So doing it that way. And then the last one is using COPY and here are the performance results. Basically, COPY destroys them all. The single inserts with a commit transaction for each one are the slowest, immensely slow, almost 9,000 seconds. Followed by single INSERTs and one transaction, followed by single INSERTs with prepared statements, followed by the bulk INSERTs, meaning a multi-line INSERT, and then bulk INSERTs with prepared statements is actually a little bit longer. But definitely, COPY is the way to go if you want to bulk-load data. And failing that is using a multi-line INSERT.

Now related to this is "Optimizing bulk loads and Postgres, and how COPY helps with cache performance". This is from pganalyze.com. And this is the post that Lukas covers this week in his "Five minutes of Postgres". But he also talks about cache performance and how COPY actually uses a ring buffer which is different from the shared buffers that INSERTs uses. So it basically helps preserve your shared buffers and doesn't monopolize them if you're doing a very large bulk load when you use COPY. So basically you can preserve your shared buffers for queries or things like that if you do your bulk upload uploads using COPY. So definitely check out his post if you want to learn more about that.

Next piece of content- "WHY DOES MY PG_WAL KEEP GROWING?". This is from cybertec-postgresql.com. So pg_wal is basically the write-ahead log in that it's the directory that all the WAL files are written to maintain integrity. It is used in case the system needs to be restarted or used as a reference for sending WAL to replicas. And the most important thing in this post is quote "Never manually delete any files in pg_wal" because you will bring your system down. But in terms of what causes the pg_wal to grow, there are a few reasons that can happen. One is that there's a problem with your archiving.

So you have an archive command set up and something's broken about it and the files just keep growing in pg_wal. It can't archive them to the other destination that you've selected. The other thing that can happen is that it's too slow. Maybe you're generating more WAL files than your archive command can process. Maybe you're trying to compress them and that compression is slow or maybe you're trying to send them over the network to S3 or something and that transfer process is slow.

So speeding that up will help keep your pg_wal under control. Next, it could be a replication slot issue. So maybe you have a replication slot where the replica has shut down. Maybe it can't communicate over the network and those WAL files keep building because that replication slot won't allow them to be deleted until they've been consumed. So definitely be aware of that situation. And then the last one is your wal_keep_size is too high. So you've actually told it to keep too many WAL files so that can cause this type of problem as well. But check out this blog post if you want to learn more.

Next piece of content- "JSON vs JSONB in PostgreSQL". This is from dev.to in the Francesco Tisiot section. This basically covers the difference in Postgres between JSON and JSONB. JSON basically just stores a text representation, a pretty much exact representation of the JSON, it doesn't alter it. Whereas JSONB stores a binary version and adds some alterations to it. Some of those include removing white space, merging duplicate keys, as he says here, and then ordering that the keys could be altered.

Another benefit of JSONB is that you can also index it and it makes it super fast for being able to retrieve information from that JSONB column. And it's also much easier to edit data within that JSONB column. So basically you use JSON if you want to preserve the results of the JSON. You don't want it altered in any way whereas if you want to use it for pretty much all other use cases, go with the JSONB.

Next piece of content- "PostgreSQL 16: Playing with pg_stat_io(1)- extends". This is dbi-services.com. This was a great post talking about the new view pg_stat_io coming in 16, he walks through each of the processes about how to understand how it works, how you can look at changing block sizes based upon data inserted into tables, and how to query this to understand what's going on in the file system. So definitely encourage you to check out this blog post if you're interested.

Next piece of content- "Slow & Steady- Database Constraints with Andrew Atkinson''. This is from andyatkinson.com. He's talking about a podcast he was on where he discussed different types of constraints such as Default constraints, Not Null, Primary Key and Foreign Key, Check, and Exclusion constraints. So if you want to learn more about that, you can check out this blog post.

Next piece of content- "The Cache that Wasn't: Optimizing Postgres for SELECT Performance". This is from enterprisedb.com. They're talking about a situation where you want to be able to cache something to give better SELECT performance. So they had a particular set of tables here, a particular query, where they were maxed out at 10,000 transactions per second. They wanted it to go faster so basically, they made an aggregate table updated by triggers. And then essentially creating this cashed table, their performance was able to increase by 2.5. So if you want to learn more about how they did that, definitely check out this blog post.

Next piece of content- "What do you consider Postgres best practices?". This is from hdombrovskaya.wordpress.com. And that's actually posted questions, but the answers are actually in the comments. So I thought this was particularly interesting. So feel free to check this out if you want to learn more about that.

The next piece of content- "When Did Postgres Become Cool?". This is from crunchydata.com. And this is actually more of a history post about Postgres. So it goes through all the different histories. The point at which he felt it became cool, he reveals in the conclusion here, is basically when Heroku made Postgres pretty much its own database. As well as Postgres getting JSON support about the same time. He feels that's the point at which Postgres became cool.

Next piece of content- "Comparing Postgres Performance across Cloud Providers". This is from blog.perfects.engineering. And he actually compared three different providers and did some query tests against them for different levels of transactions and TPS rates. And from the results he showed here, he showed AWS coming on top compared to DigitalOcean and Google Cloud. But then he says that AWS was highly variable and sometimes the results look like this. So whenever I see this, I definitely take it with a grain of salt.

I was actually a former scientist so for any type of variability like this, I would actually like to see standard deviations to determine where this should actually sit. So if you have this high variability in it, I would like to see those standard deviations from the attempts to get a better idea of what the results actually could be in real-world situations. So again, you probably want to take this with a grain of salt but this is some information you may want to check out.

Next piece of content- "Introducing PGMQ: Simple Message Queues built on Postgres". This is from tembo.io. So this is a new extension that has come out. So if you're looking for a message queue, maybe you would like to check this out.

Next piece of content- "'Rusting; a new Postgres C extension that records subtransactions counters". This is from bdrouvot.github.io. And he's released a new extension called pg_subxact_counters. So this does a counter of how many subtransactions are happening within your system. So check this out if you want to learn more.

Next piece of content- "pgvector: Fewer dimensions are better". This is from supabase.com. And they're looking at different embedding models and looking at some benchmarks that have been set up by Hugging Face. And they're comparing different models and what they say is that each model's dimension size has little to no correlation with its performance. So even though you're using higher dimensions, that doesn't mean you're going to get faster results. And what they actually found is that smaller number of dimensions for some of these models actually resulted in greater performance. So if you're using AI, definitely check out this piece of content if you're interested.

Next piece of content- "XKCD Bad Map Projection with PostGIS". This is from crunchydata.com. And I'm not going to attempt to explain this but if you're interested in graphical information systems and Bad Map projections, definitely check out this pretty long post.

Next piece of content- "A Possible Way to Implement a Shift Function in PL/PgSql''. This is from fluca1978.gihub.io. So check this blog post if you're interested in that as well as his part two here.

Next piece of content- "Handling outliers in BRIN indexes with the new multi minmax operator class". This is from pganalyze.com. And we covered that in the previous episode of Scaling Postgres, so this is Lukas's perspective on that post.

Next piece of content, since I took a break there've been two Postgres FM episodes. The first one is on "High availability". They cover quote "...what it means, tools and techniques for maximizing it, while going through some of the more common causes of downtime". And their second post is "Data types". So if you're interested in either of these two episodes, you can listen to them or watch them on YouTube.

Postgres Releases, PostgreSQL Survey, Partitioning vs. Sharding, Bulk Loading | Scaling Postgres 277

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript