TPS Benchmark, Partition-wise Join & Aggregate, Partitioning Billions, Posgres 16 Features

In this episode of Scaling Postgres, we discuss how pgbouncer can impact a TPS benchmark, partition-wise join & aggregate performance, partitioning a table with billions of rows and cool Postgres 16 features.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Well, I wanted to thank everyone who took the time to respond to my call for recommendations for Scaling Postgres because I was thinking about making some changes to the format of the show. I am always looking for feedback and a number of people did make comments on the video as well as sent thoughts and recommendations via the email list. So, thank you so much for that, I really appreciate it. And the recommendations ranged from, some of what I suggested, makes sense in terms of maybe having some interviews or maybe doing a question of the week type thing.

But then there were also people that said don't change anything. Because they really like the focused format of the show. So, basically, everything you need to know in a week in 10 or 15 minutes or so. So I'm definitely going to give this some thought, I may do a survey as well, but I'm not quite sure yet. But at this point, I'm thinking of keeping the core show as it is except reducing it down to just the top 10 blog posts. But keeping all the links of course for all the content I've found.

And maybe just incorporating a question of the week to get feedback from viewers as well. And if I do interviews, I might follow someone's recommendation to have that as a separate video. That way, I don't have to feel the burden of having to get an interview every week. It will be a kind of special event whenever I get one and I'll post it to the channel. But I'm still thinking about that and I'll provide more information on future episodes. But again, feel free to leave a note in the comments or reply to the email if you have any other thoughts with regard to that.

The first piece of content is "EDB Postgres Advanced Server 15 TPS Benchmark". This is from enterprisedb.com. Now this is their advanced server which, of course, is based on open-source Postgres. So it's definitely not Postgres, the community edition. But what I found interesting is that most of this post was actually talking about PgBouncer and how important it can be. So they are basically analyzing TPS performance with and without PgBouncer and a couple of different configurations. So they have the systems they set up here. They have the example database host and their PgBouncer host. Their configuration information and there are two different pieces of information plotted on each of these graphs.

So basically the blue points go together and the red and green go together. So the dark blue is TPS with PgBouncer. In their test here, you could see it maintained over 750 TPS for the particular test that they're doing. Whereas when they didn't have PgBouncer, the TPS looked like 200 or less. So it was four times faster with PgBouncer. Now you'll notice the top here, it says with connection overhead. So that means they're including the connection overhead with doing these transactions. So it's not just that a connection was established and then they just did transactions, they made the creation of the connection part of this test.

I should also mention that they're running PgBouncer in session mode. Now the red and the green plots are latency. The green is with PgBouncer and the red is without PgBouncer. So as you can see once you get up to a thousand connections, the latency without PgBouncer is huge. Again about four times the latency compared to PgBouncer. This definitely suggests that if you have a lot of connections with the database, PgBouncer definitely eases the burden on the database if you use it. Now this next plot does it without connection overhead. So presumably you have connections that are being maintained and it's just putting transactions through.

And here, the results normalize quite a bit where it may be so close, PgBouncer or not. It may be in the margin of error. The only exception is you have probably three times higher TPS throughput when you're not using PgBouncer going directly to the database. And you only have 50 connections. So at 50, 100, and 152 connections, it looks like not using PgBouncer gives you more throughput. But again, this is assuming you're not including connections with it. But if your application, for example, has its own Pooler and it's going to maintain those connections, then maybe this is something you will see without having a connection overhead.

But definitely an interesting piece of information. And then they just ran the tests for a duration of 10 minutes to see how things changed with connection overhead and without connection to overhead. And there's a little bit of randomness to the results but I didn't see anything else too out of the ordinary with the previous charts. Although that spike without PgBouncer when you're going directly to the database has disappeared after that 10-minute point. So maybe you can't really say that that is a sure thing you were going to find. And then the next chart actually looked at whether you should have PgBouncer on the database host or on its own server.

Now without knowing the CPU utilization, I'm not sure we can gain too much information from this because of how loaded the system was when PgBouncer was running on the database host. Clearly, on its own system, it's not going to run into contention with the database. But they plot those results here. So definitely running PgBouncer, even in a session-based mode, allows much higher transaction throughput seemingly due to the overhead of connecting to the database a lot. So if your application has that type of profile, you should definitely consider using a connection Pooler to help out and give you better transaction throughput.

Next piece of content- "How partition-wise joins and aggregates improve query performance". This is from pganalyzed.com. And this is the post we did cover, I think, two weeks back on Scaling Postgres. That was the episode in which I went through 24+ different blog posts. And this particular one that Lukas has covered, I blew through super quickly. But he goes into much better detail about what this post was covering. Specifically, the performance improvements that they were looking at were enabling partition-wise joins. And as Lukas says, this is disabled by default. Because when you enable it, you incur a relatively high impact on the planning time.

So if you have a very short query that doesn't need to do a lot of calculations, it seems the Postgres team has said "Well, let's not introduce this as a burden'. But it's an option you can turn on. He shows how you can turn it on basically per session and then run the queries that have that benefit from it. So overall, your query performance will be better. The planning time will be increased but the execution time, hopefully, will be reduced. But Lukas goes into a lot more detail about the blog post. Also some other information with regard to partition-wise joins and aggregates and I definitely recommend you check this out.

Next piece of content- "Presenting 'Partitioning Billions of Rows' at SFPUB August 2023". This is from andyatkinson.com. This is a little over an hour of YouTube video where he did a presentation for the San Francisco PostgreSQL users group and talked about his experience at his company. Partitioning a billion-row table while avoiding downtime of the application. Now for this particular one, they were using pgslice which is a utility that has some tools for people to do partitioning without downtime. The caveat with it from his presentation is that it only really works for a pend-only table.

So if your table is receiving deletes and updates, you can't really use this particular technique. This tool is not the best option to do your live migration partitioning. And I found it interesting, that at about the 20-minute mark, he makes the comment that so many people say that the reason they were partitioning was for better data management. So they were retaining a lot of data and they wanted to start only retaining, say, two years' worth of data. So again, it's much easier if you partition your data into multiple tables to be able to drop the oldest partition periodically.

So they were partitioning monthly. And then at the end of every month after two years, so the 25th month, got removed from the database by just being able to drop the table. But he did say he hoped for better performance.But that wasn't the main goal or the reason that they were partitioning. And what I thought was really great about this talk is that he talks about the successes but also the issues that they ran into because a lot of times, those have a much higher benefit for learning. But definitely encourage you to check this out if you're interested.

Next piece of content- "POSTGRES V16: 14 COOL NEW FEATURES". This is from cybertec-postgresql.com. And we'll just run through the features. The first one is everybody's favorite- you no longer need an alias for subqueries and FROM. So I definitely agree, I'm looking forward to that although it's mostly just an irritation. I usually just put an X at the end of it if I ever realize that's the issue. Two is improved performance for vacuum freezing. So it looks like 16 will be able to freeze all rows in a block. So that should lead to some better performance. Three is a new system view "pg_stat_io".

We've talked about this a lot in previous episodes of Scaling Postgres on better IO tracking with regard to what's going on with your database system. Four is a new predefined role "pg_maintain". So basically now you can easily add someone to this group to allow them to do vacuums or analyze, refresh materialized view, or do reindexes. Five is to make the CREATEROLE property of a user secure. So basically, it was easy for the user to become a super user with this capability. But now this has been adjusted so you can't do that. Six is first tentative support for direct IO via the debug IO direct parameter.

So this is kind of on the path to direct I/O via the "debug_io_direct" parameter. So this is sort of on the path to direct I/O. I guess it's a little cool but I don't think there's anything that a general user would use with this in version 16. Seven is regular expression matching for database enrolled names in "pg_hba.conf". That's a great feature. Eight is enhanced support for ICU correlations. So they continue to add more capabilities for more easily using ICU coalitions and the support just keeps growing with that.

Nine, streaming replication standby servers can now be logical replication publisher servers. So this just allows you greater flexibility in building out your logical application infrastructure. Ten, logical replication now allows cycles with disabled statements. So apparently, this is on the road to achieving some type of multi-master replication. It enables you to disable sending statements from logical replication to prevent, I guess, circular logic which they call an echo chamber problem. So that's interesting.

Eleven, in COPY FROM, you can specify a string that makes PostgreSQL insert the column default value, so that's good. Twelve, specify entity constants in hex, octal, or binary. Thirteen, allow the client library "libpq" to specify acceptable authentication methods. That's great for security. And then fourteen, use the lz4 and zstd compression methods for "pg_dump". So that's awesome, that's great. But if you want to learn more about that, definitely check out this blog post.

Next piece of content. There was another episode of Postgres FM last week. This one was on "Self-managing". So basically, this is what you need to be aware of when you're managing your own system. So you're not using RDS, you basically have brought up a server or you have a local server and you want to install Postgres on it. How do you manage it? So they discuss the practicalities of doing so as well as some of the managed service style tooling available. They have a whole recommendation down here of different ones. So you can listen to it appear or you can watch the YouTube video here.

Next piece of content- "PGSQL Phriday #012: What Excites You About PostgreSQL 16". This is from softwareandbooze.com and it looks like the next PGSQL Monthly blogging event is on Postgres 16. So check out this blog post and the guidance with regard to blogging for this event.

Next piece of content- "Introducing pg_later: Asynchronous Queries for Postgres, Inspired by Snowflake". This is from tembo.io. And this is a new extension that enables you to send a query to the database and the database processes it asynchronously. It says it uses PGMQ, which is a messaging queue system as a part of it as well. So it starts a background worker, produces the results and stores that in a JSONB column so it can be retrieved. They have an example of using it here. First, you install the extension, of course, and then you initialize it to create a job queue and some metadata tables. And then you queue the job you want to run using the syntax here. Later when it's complete, you can use this command to fetch the results back. And they have a whole set of features they're looking to potentially build out for this as well. So if that's of interest to you, definitely check out this blog post.

Next piece of content- "Postgres Subquery Powertools: CTEs Materialized Views, Window Functions, and LATERAL Join". This is from crunchydata.com. And really, these are all different tools I would say help you build subsets of data. So the first thing they cover is what a subquery or what a sub-select is. It's basically a select statement that you put within parameters, typically in your WHERE statement. So here you're selecting particular skews from the products table and you want to display product orders where the skew is in that list. Now you can also achieve this through Joins as well and sometimes, as we've seen in previous episodes, those can actually be more efficient depending on what you're doing in the subquery.

Next they cover Postgres VIEW. And a VIEW is basically a virtual table, it doesn't really exist. But it is the definition of a query that you can run, essentially calling on that virtual table. So if you can think of it kind of like doing a subquery-type role. And I should mention that each of these has its own perspective on when you would use these particular tools. Next is a materialized view. So as opposed to being virtual, it's an actual table. So you create this materialized VIEW. The benefit is that now you can add indexes to it. And you can refresh it on a periodic basis as they showed the command to do that here. Next is a common table expression. So these are labs like subqueries but it looks more like you're defining a function.

So with huge_savings, you could think of this as a function and then it's called down here to produce the results of that function. So I think, for this reason, a lot of programmers really like using CTEs. The next one is recursive CTE. So these are really beneficial when you are looking to deal with hierarchical data. So you can recursively have a CTE call itself to produce the data that you need. Next is a window function. So basically this is a function that operates on a specific window of data. So only the row above and below each row do a particular calculation based upon it. And this one here they're just doing a summary by the SKU. The last one is a LATERAL Join and as I like to think of it, it's a four-loop that runs for each row you're pulling back. So for each row, it's going to run this query against it to give you whatever calculation or whatever information you're looking for. So if you want to learn more about this definitely check out this blog post.

Next piece of content- "The Stare of the Database Landscape". This is from softwareandbooze.com. And this is another database survey, but it's not Postgres-specific. It's for all types of databases. So if you want to participate in this one, the link to the survey is right here.

And the last piece of content is "Configuring PostgreSql and LDAP Using StartTLS". This is from percona.com. And they're talking about how you can configure your Postgres server to authenticate with an LDAP service. Specifically, use TLS to secure that communication between Postgres and the LDAP server. So if you want to learn more about that, definitely check out this blog post.

TPS Benchmark, Partition-wise Join & Aggregate, Partitioning Billions, Posgres 16 Features | Scaling Postgres 279

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript