background

Rust for Extensions, Timescale vs. Postgres, Uninterrupted Sharding, Data Flow | Scaling Postgres 234

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss using rust for Postgres extensions, performance comparisons of TimescaleDB vs. Postgres, uninterrupted writes when sharding in Citus and the Postgres data flow.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgresML is Moving to Rust for our 2.0 Release". This is from postgresml.org, and I'm not that familiar with machine learning, but what I did find interesting about this post was this is an extension for Postgres, and they're moving to Rust from, I believe, Python. They chose this route because they did some testing to calculate how fast a dot product could be produced. If you look here, the performance of Rust compared to Python, PL/pgSQL, or SQL was astounding and they showed the different test scenarios that they went through to come up with this. If you're curious, BLAS is Basic Linear Algebra Subroutines. But based on this performance, they went ahead and rewrote their extension using Rust as opposed to Python and they are using an extension that I've covered in previous episodes of Scaling Postgres.

They're using the pgx extension, which is a great way to package up extensions and deploy them. And that is written in Rust as well, apparently. But look at these performance improvements from version one to version two of this extension. You can hardly see the bars in this chart for version one, whereas the performance gain in 2.0 is immense from this transition to Rust. And that follows through on a lot of these other graphs as well. The improvement is quite remarkable. So if you use this extension, it's definitely something to look at. And if you're considering extensions, maybe you would want to write it in Rust too, because maybe some of the functions that you would rely on would be better performant. But if you're interested, definitely check out this blog post.

Next piece of content- "PostgreSQL + TimescaleDB: 1,000x Faster Queries, 90% Data Compression, and Much More". This is from timescale.com and part of this I would call a marketing post for TimescaleDB, but it does have performance comparisons between stock Postgres and Postgres with the Timescale extension. And frankly, I would prefer to go this route vs. some other solution, you can only get in a hosted environment like AWS or Google maybe. I like the fact that they're using stock Postgres and developing Timescale as an extension to it. And I would say the other post that we're going to look at that does it the same way is Citus. They are using stock Postgres and they have an extension citus that allows them to do essentially their scale out. But it should be emphasized that this is for time series queries, this immense performance improvement over stock Postgres.

And it's not for all queries. One of them got up to 1,000x faster, some of them are smaller and we'll take a look at that in a second here. Now, how they achieved this is by a number of enhancements they've added to this extension. One is handling continuous aggregates as opposed to doing something like a materialized view. Another is their immense compression that requires less data transfer up to a 90% or 95% compression ratio with certain data. Part of that may be due to the column or format that they can utilize, but if you're able to compress your data size down sufficiently so that you're writing a lot less or reading a lot less from the disk, that'll give you better performance in and of itself. Now, I should say a lot of these are optimized for time series queries, so think of append-only data with a date timestamp. If your data doesn't necessarily follow this, maybe some of these features wouldn't be of benefit to you. And here's where their query latency deep dive is.

And you can see one of them was a 1,000-fold improvement, but there were some fours, some fives, and some seven performance improvements, so not everything was a thousandfold faster. One query was, but they found relative improvements across the board. Now, what I really found interesting, if you scroll down a little bit further, it talks about their ingestion rate. Timescale had done a post on this a number of years ago using a different version of Postgres, and it showed the ingestion rate of Postgres dropping off a cliff at a certain point, whereas now it seems relatively consistent. They're showing a little bit less than Timescale, but that's fine. Basically, it shows that Postgres eventually gets around to improving and coming close to matching the performance of a lot of these other add-on solutions like Timescale or Citus.

So presumably, they see what the market is doing and eventually move to it. It's just with open-source software, it takes a little bit longer than a company can do on its own, but I think this is expected. So if you need this kind of raw performance today, go ahead and get a Timescale, go ahead and get a Citus or another solution. Eventually, it seems that the open-source version of Postgres will catch up in various different areas to it. But if you're interested in learning more about that, definitely check out this blog post.

The next piece of content that I alluded to is "Citus 11.1 shards your Postgres tables without interruption". This is from citusdata.com, and they have added a number of new features to Citus 11.1 that basically allow you to move the shards all around without blocking rights to the database so things can still operate normally. You can distribute Postgres tables without blocking writes. So that's basically targeting a table and say 'Okay, now scale it out to multiple shards'. You can isolate tenants without blocking rights. So my assumption with this is that you place a tenant in its own shard. Like if you have a really, really large tenant that's taking over a particular shard. You can dedicate a shard to it, I believe. The third is increasing shard count by splitting shards without blocking writes. So again, this is moving the shards all around transparently without blocking rights as the cluster is running.

Next is rebalancing the cluster in the background without having to wait for it. And finally, this is a big significant part that I find appealing about Timescale and Citus is that they're extensions. It doesn't take them very long to catch up to the most recent version of Postgres. So with this release, they are supporting PostgreSQL 15 (beta4), and they've mentioned lower down that they are going to be doing a recent release of Citus once Postgres 15 is finally released. And as a reminder, this is all open-source, all their features are all open source now. Or you can of course get their cloud-hosted solution. Same thing with TimescaleDB as well as far as I understand it. But if you want to learn more about either one of these solutions and their enhancements, definitely encourage you to check out these blog posts.

Next piece of content- "Postgres Data Flow". This is from crunchydata.com and this was a super interesting post where he watched the flow of a statement from the application layer all the way down to essentially the data on the disk and how many layers of caching it goes through. So he mentions all the different ones, but it basically comes down to this really complex diagram that he mentions at the end here, where you have all these different layers that it goes through and at each layer, pretty much, you can encounter caching. So there's browser with potential caching. There are edge, web cache, or proxy servers. There's the web server itself that can cache things.

You can do things at the application layer that goes through an application pooler, then a database pooler like pgBouncer and then to the individual Postgres backends which consult the shared buffers in the memory. And if there's not a hit there, it could hit the OS cache, which is what Linux uses. If any memory is still available, it uses it as a disk cache. And then even once you get down to the actual hardware you can have on-device disk cache and then eventually the physical storage. So this was a great post that breaks down all the different levels that statements pass through as they're being processed in Postgres from a high-level view. So if you're interested in that, definitely recommend checking out this blog post.

The next piece of content- "POSTGRESQL: SEQUENCES VS. INVOICE NUMBERS". This is from cybertec-postgresql.com and they're talking about using sequence numbers potentially for invoice numbers. And their requirements for the invoice numbers are that they're constantly increasing and there are no gaps. Now of course, once you hear no gaps you definitely don't want to use a sequence because sequences frequently have gaps. So how do you avoid the gaps, that's what he covers here. And he goes through several different implementations. The first way he does not recommend doing it is doing an INSERT into the invoice, selecting the max_id from the invoice table, and then adding one to it. That'll give you the next invoice. Now, the issue with this implementation is, as he mentions here, concurrency. So if someone does the SELECT at the same time trying to insert the value, they could potentially insert a duplicate value which will, of course, fail because of the primary key constraint that's on the invoice ID.

So you wouldn't want to use this route unless you wanted to rescue it. The next scenario to avoid that type of problem is to actually LOCK the table. But this is a super heavy lock— ACCESS EXCLUSIVE LOCK— and that's going to hinder reads to the table. So you definitely don't want to do that. That'll definitely slow things down. But he actually has two ways that are viable solutions to this. The first solution is to use serializable transactions so it ensures that only one transaction can be completed at one time. Basically, you start a transaction, then begin the transaction with the isolation level of serializable, and then you simply do your INSERT into the invoice table, selecting the max invoice number from that table and you won't run into any conflict issues as long as you're doing serializable.

But he did mention another alternative, and that's creating a dedicated ID table so that you can minimize locking. You essentially only have one value in this table and you just update that value. Now you could potentially run into lock constraints because there is only one row in this table, so I don't know how that would perform versus serializable, but it's another type of implementation. What you do is use a CTE and you basically update that value, return what the value is and do that INSERT into the actual invoice table. So this could be a potentially viable implementation, although you may have lock issues. I guess my personal opinion is I'd probably go with a serializable if you're looking for something that's gapless. But if you want to learn more, definitely check out this blog post.

Next piece of content- "PostgreSQL 14 Internals, Part III". This is the third release of the free PDF book PostgreSQL 14 Internals. This one focuses on, as they see here quote, "...the diverse world of locks...". So if you're interested in getting this type of free content, you can download the ebook from here. And this is from postgrespro.com.

The next piece of content- "Fun with Postgres Functions". This is from crunchydata.com and they're covering a variety of functions that are available in Postgres that you may not be familiar with. He covers an assortment of date functions, so you can do date calculations within the database without having to worry about doing something at the application layer and he goes through several different use cases. He talks about text functions, how to manipulate text and how to combine first names and last names into a full name. Doing an INITCAP for the first letter of a particular name. It talks about aggregate functions and different ways you can aggregate the data to return what you need as well as different math functions. So if you're interested in learning more about that, you can definitely check out this blog post.

Next piece of content- "What is LATERAL, what is it for, and how can one use it?". This is from depesz.com and they're basically talking about using lateral joins and how I think about a lateral join. It's basically an internal for loop you can do to a separate table of the table that you are querying. So for example, he's querying an events table and this is essentially what's happening in an inner for loop. It's counting up how many rows exist for this given value in the separate table he's consulting. So if you want to learn more about that, definitely check out this blog post.

Next piece of content- "Column lists and logical replication publications- an overview of this useful PostgreSQL feature". This is frompostgresl.fastware.com. They're talking about an enhancement coming into Postgres 15 where you can essentially do filtering at the publication level for logical replications. So essentially you can send less data to these subscribers. Now, they talked about the WHERE feature that basically reduces the rows that are sent to a subscriber. This one focuses on the lists of columns that are sent to the subscriber and it's called a column-list. So basically you can define the specific columns that you actually want to send over logical replication for a given publication. So if you want to learn more about this feature, definitely check out this blog post.

The next piece of content,- "Pipeline mode in Psycopg". This is from blog.dalibo.com, and they're talking about the release of Psycopg 3.1 which supports the pipeline mode of Postgres. So it basically allows you to talk to Postgres in an asynchronous fashion. Normally when you send a query, the client has to wait until a response comes back, but with pipeline mode, you can send multiple queries over the same connection and eventually, you'll start getting the results returned to you when you're ready for that. And they've added this functionality to, essentially, the main Postgres client for Python. They're talking about where this is particularly useful when you're doing UPDATES, DELETES, OR INSERTS, you're trying to stack up together. If you're doing a SELECT, it may not make that much sense because you need to do something with the data that you get back. However, if you're running multiple selects to try to render some type of page, there could be an advantage with pipeline mode with that. But this is a great enhancement and if you want to learn more about it, definitely check out this blog post.

Related to that is "5mins of Postgres E36: Optimizing Postgres query throughput in Python using pipeline mode and psycopg 3.1". So Lukas covers this post in depth and goes over a lot more detail, so check his episode out as well.

Next piece of content- "Using the Timescale gem with Ruby". This is from ideia.ma, and if the Timescale post and all the performance gains are appealing to you and you're using Ruby or Ruby on Rails, maybe you would like to check out this post to see how you can get started using the Timescale extension for it. And he actually has done some live coding on Twitch and the episodes are here, so definitely check that out if you're interested.

Next piece of content- "How to make Pgpool-II Leader Switchover Seamless on AWS- Updating Route Table". This is from b-peng.blogspot.com. So definitely check out this blog post if you want to use Pgpool II for this purpose.

Next piece of content. There was another episode of Postgres FM. This episode was on "Index maintenance" where they say "...how do we know if or when we need it and what we need to do". You can listen to the episode here or click here to look at the YouTube video.

Next piece of content. The PostgreSQL person of the week is Peter Smith. If you want to learn more about Peter and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was on "Rails & Postgres Scaling with Andrew Atkinson. So if you're looking for additional scaling ideas and you're okay listening to our long-form podcast, we definitely welcome you to check out our show.

episode_image