background

2 To 4 Times Faster With Vectorization | Scaling Postgres 291

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss Timescale's performance improvements from adding a query vectorization pipeline, doing blue-green deployments for databases, using reserved_connections and two improvements to null handling.

Content Discussed

YouTube Video

Podcast Audio

Transcript

I don't normally choose a non-open source post for the first article that I cover on Scaling Postgres, but I had to make an exception in this case because I thought the engineering was quite interesting and educational. But before we get into that, I hope you, your friends, family, and coworkers continue to do well. Now, once again, at the end, I'm going to have some more information on my Postgres optimization course that's coming up, so be sure to stay till the end to learn more about that.

But the first piece of content is "Teaching Postgres New Tricks: SIMD Vectorization for Faster Analytical Queries". This is from timescale.com, and what they've done in TimescaleDB v 2.12 has added a vectorized query pipeline that makes use of SIMD or Single Instruction, Multiple Data vectorization. So let's kind of walk through what that means. So Timescale has had columnar storage, and I think in an earlier release this year, they added high levels of compression to it as well. They also offer a hybrid solution where you get recent data at the row level, but then older data is put in this compressed columnar storage, and they show an example of what they mean by columnar storage. A typical row-based disk output looks like this. You have a row with each column of information but in column-based storage, you store data a column at a time. So all the device IDs are in, say, a dedicated file. All of the timestamps are in a dedicated file, so it makes it much harder to find all the information for one record in this format.

But it's great for doing aggregations and averages, calculations that are taking place over a particular column or counts. Now what's interesting about this, if you look at it, the data type is the same, so the device ID has all the same data type and the same data type for the timestamp. So if you're going to be doing any kind of operations on a particular column, it will be the same operation. And because of that, they could leverage this Single Instruction, Multiple Data. Maybe I should just call this SIMD if you know what I should call it. Put it in the comments, please. And CPUs have SIMD instruction sets, it was developed in the late 90s, and was predominantly used for media work with audio and video, or, of course, video games. As opposed to taking two values, and conducting an operation to get a new value, the CPU can work with vectors. That's what it means by vectorized. So you're converting multiple values into a single vector or an array of values, and you're doing the same operation on all of those values in parallel.

And the CPU has a way to do this which was needed for the media processing and video games. So they wanted to develop a pathway of using this for their column stored-based data, because it's all the same data type and they can conduct the same operation, but just do it in parallel. So they developed this new vectorized query execution pipeline and they go into more detail about it, and you can read up about that if you want to learn more. But they're talking about the different stages. So they have an example query here. Basically, they devised ways to do vectorized decompression of the data because it's typically already compressed or column storage. A way to do vectorized filtering so in the WHERE clause to do that. Do the expressions that they run, and this is for some of their time, bucket functions, and then vectorized aggregation, which is doing the sum of the value and adding it all up. They had these kinds of results that are shown here. So they have just SELECT the sum of a value from a metrics table. And the other one, they're just adding a filter, a WHERE clause to it, where a metric is greater than some value.

So in the first example for the sample of the data they had, it took 2 seconds to do this with row-based storage, doing columnar compression dropped it down to 1.4 milliseconds of the same query. So compression did give an advantage and then vectorizing dropped it all the way down to 353 milliseconds. So from the point of row base to the vectorized aggregate, it was 5.8x faster, with the vectorized portion making it about 4x faster. And in the query below, having a filter in does slow it down a bit more. The row-based is actually faster than without the filter, but it's slower in some cases with columnar storage and being vectorized. But still, the vectorization was about twice as fast. So I found this super interesting. Just the technology involved in doing it. Definitely encourage you to give it a read. And if you have a need for column storage and analytical queries, maybe you want to check out Timescale.

Next piece of content. There was another episode of Postgres FM last week. This one was on "Blue-green deployments". So Nikolay and Michael were talking about a blog post on AWS about how RDS and Aurora now offer a blue-green deployment procedure. And they have a link to that post right here. I personally kind of find blue-green to be a little bit of a misnomer, because looking at the blog post, they're essentially doing logical replication to do an upgrade. So basically creating a new instance, getting it logically replicated with the current instance, and then doing a switchover procedure to it, which is what I've done for a number of customers recently to do their upgrades.

But now it looks like RDS and Aurora come out of the box with being able to do something like that. And Nikolay and Michael discussed that. They seemed a little bit iffy on the blue-green terminology as well. And I think Michael was also saying that normally you think about supporting rollbacks in that case, so if there's anything that's happened, you're able to roll back, whereas, with the solution they're proposing, that doesn't seem as easy to do. And Nikolay gave some of his perspective on doing logical replication-based upgrades as well. So definitely encourage you to check out their podcast, which you can listen here, or check out the YouTube video down here.

Next piece of content- "CONNECTION MANAGEMENT IN POSTGRESQL IN POSTGRESQL- RESERVED_CONNECTIONS". This is from cybertec-postgresql.com. They're talking about the new feature in 16 that allows you to reserve additional connections to be able to use the database for everyone else they've run out. So if you look at max_connections, you just join show max_connections and by default, it's set to 100. Well, for the longest time, there have been superuser-reserved connections and the default is three, meaning that the system keeps three connections around at all times to allow a superuser to connect to the database to do whatever they need to do. Now what that means is there are actually only 97 connections available for all other users, but the new feature is reserved_connections.

This allows you to set up reserved connections for people who are not super users. By default, it is set at zero, but that means if you set it to 10, well now as opposed to 97 connections being available for everyone else, it will be 87. So you have to subtract the superuser reserve_connections and reserve connections from the max_connections to see how many are normally available. And they show how to set that up here. Basically, you just create a user, it's creating Joe, and then you're granting the pg_use_reserve_connections to Joe. So now Joe has the capability to always connect to the database. But check this post out if you want to learn more.
Next piece of content- "Two Exciting PostgreSQL Features to Improve NULL Handling". This is from pgrs.net. The first thing it covers is NULL values in unique columns and that NULL != NULL. Therefore, a unique column can still have multiple NULL values. But what if you don't want that? Well, in Postgres 15 there's a new feature called UNIQUE NULLS NOT DISTINCT, and that allows only one NULL to be present in the column's values. The next feature is ensuring a single column has a value. Now what he's referring to here is when you have, say, a polymorphic table where you have a notifications table, and there are many types of entities that can have notifications. For example, a company, an order, a user, and ideally you just want one of these references populated for each row.

So this is a company notification or this is an order notification. Now historically, you could do a constraint check that the company ID is not NULL or the order ID is not NULL, or the user ID is not NULL. But that just makes sure that at least one is populated. But it doesn't prevent multiple ones from being populated. Like there's an order and a user ID with the same notification, which in his case he didn't really want. But there is a function called num_nulls and num_nonnulls, and you can create a check constraint using the non num_nulls function and include the different columns and say it must be equal to one. And now you get an error message if you try to insert a value in all of these. So this row insertion would have failed. So if you want to learn more, definitely check out this blog post.

Next piece of content- "Top 8 PostgreSQL Extensions You Should Know About". This is from timescale.com and we'll just take a look at the list here. And their first is PostGIS, which I'm a bit surprised by. The number one that I am used to seeing is pg_stat_statements, but PostGIS is definitely one of the most popular extensions. Number two, they do have pg_stat_statements. Number three, they have pgcrypto. So I could see that I haven't used it in a lot of projects. And actually, there was a post maybe not in the last weeks, but a week before last that someone was complaining about pgcrypto. And I think some of the key management capabilities. I know what I tend to do is actually use encryption keys on the application servers and encrypt the data there before storing it in the database.

So I haven't used pgcrypto in my projects. Number four is pg_partman. So a lot of people like using this for partitions. I tend to like to roll my own solution. Next is postgres_fdw. Yep, I could definitely see that for being able to access an external Postgres database. Next is pgvector. Definitely, with all the AI startups, this is super popular nowadays. Next is hstore, although ever since JSONB came out, why is anyone using hstore anymore? I don't know. And eighth is pgpcre, which I've never heard of and I've never used, but it provides Perl Compatible Regular Expressions. So I think probably the last two of these are definitely not super popular. Maybe it should have been in the top six, but one thing I'm surprised by is it didn't have something like pg_trgm. That's something I know I've used a lot and other clients have used it as well, but check out this post if you want to learn more about it.

Next piece of content- "GitLab's challenge with Postgres LWLock lock_manager contention". This is from pganalyze.com and this is Lucak's "Five minutes of Postgres", and he's covering a post from about seven months ago where GitLab was having some issues with the lock manager and basically I guess they were doing a lot of joins or had a lot of indexes or maybe even partitions and a lot of activity, which is probably preventing the fast path to happen in query planning and maybe execution. So he goes over their post and some insights that they found out. He even includes some potential solutions for dealing with this. So he says in Postgres 16, according to Andres, where some of the behavior was improved with LWLock lists, so that's faster. So moving to 16 could help alleviate some of this problem. The other thing he mentioned is using prepared statements more heavily because this issue tends to happen mostly at planning time. So if you could reduce your planning time by using prepared statements, that would help. So maybe if you use some of the newer versions of PgBouncer or other poolers like PgCat that support transaction pooling with prepared statements, that could help alleviate some of this contention too, if you're seeing it. But check out this blog post if you want to learn more.

Next piece of content- "Vector Indexes in Postgres using pgvector: IVFFlat vs HNSW". This is from tembo.io and this is a post about the pgvector extensions, talking about the different indexes of ivfflat for large language model work as well as the Hierarchical Navigable Small Worlds Index. So you could definitely review this part for general education. And then at the bottom, they do a comparison of them to give you kind of a decision table you could use that's shown here where if you're looking for short build times, definitely go with the ivfflat vs. HNSW. If you're looking for the smallest size, ivfflat tends to be smaller than the HNSW. If you're looking for speed, HNSW is definitely it, ivfflat is slower, and the change in recall. This is kind of a proxy for accuracy upon updates. It gets really far off with ivfflat, whereas HNSW tends to remain consistent in its recall. Also, from other posts we've looked at, the recall tends to be higher in HNSW too, but check out this post if you want to learn more.

Next piece of content- "A Complete Guide to Creating and Storing Embeddings for PostgreSQL Data''. This is from timescale.com. This is more pgvector AI-related content, but what I found interesting is the scenario that they were proposing here or the use case. So they had a blog and they really didn't want to change any of the table structure or alter it too much. So they came up with this architecture for creating a semantic search solution. You have your blog table and the recommendation would be to apply a trigger to it that updates a blog work queue table. So basically any change that happens here is recorded in this work queue table to know a change was made and you need to do something about it.

Then you would have an embeddings job that would pull work from the queue, create the embeddings, and update the blog embedding table. So I thought there was an interesting architecture to add embeddings to existing applications without altering the way the existing application works. They are showing it in the same database, but there's no reason why you couldn't put this blog embedding table on its own dedicated database, which I've seen recommendations for. I think CrunchyData tends to say to put the embeddings in a separate database. But I thought this was interesting and they went into the Python code to get it set up in this example.

Now there's a related post to this, and that's this post- "How We Designed a Resilient Vector Embedding Creation System for PostgreSQL Data". They repeat a lot of the first post but then get into some of the actual database functions related to it. So if you want to learn more, definitely check out this blog post.

Next piece of content- "PostgreSQL 17: Reset shared statistics at once with pg_stat_reset_shared()". This is from dbi-services.com. Now pg_stat_reset_shared() has worked in versions before Postgres 17, but you have to put in what you want to reset. So here you're resetting the bgwriter or the archiver and the stats for each of these areas, the pg_stat_wal and the pg_stat_io get reset based upon that. Well now in 17, if you give it an empty parameter, so you just do that, it actually resets all the stats. So check this blog post out if you want to learn more.

Next piece of content- "Fun with Postgres Floats, Positioning, and Sequencing". This is from crunchydata.com. Now this contains spoilers about how he solved the 2022 Day 20's challenge of the Advent of Code. So this is quite a long time coming. If you want to check it out, feel free to review this post.

Last piece of content- "KUBERNETES: RUNNING THE CYBERTEC POSTGRESQL OPERATOR". This is from cybertec-postgresql.com and it seems crazy that there are so many operators nowadays for Postgres. So EDB has one they're supporting. There's the Percona operator, the CrunchyData operator, and the Cybertech operator. I'm sure there are other ones. This is an area I would kind of like to have less choice available in terms of running Postgres and Kubernetes, but they talk about how to get started setting up Minikube to get started with the operator. So definitely check out this blog post if you want to learn more about that.

And before we go, I do have a few things to mention about the Postgres optimization course I'm planning. So there is a courses section on the Scaling Postgres website now that has the free course; I set up the PostgresQL Performance starter kit. Well, starting on Black Friday, which is November 24, 2023, I will be posting the course there that has information about it and how you can become an early adopter of the course if you want for a 50% discount. But I will be sending out an email announcement as well on Friday, November 24, 2023. So if you want to receive that announcement, go ahead and sign up for the email list at scalingpostgres.com.

And just to reiterate a few things about the course, it is focused on performance optimization, so not so much on configuration, and I intend to create as massive a database as I can. So we're talking multiple terabytes with billions of rows for some of the tables to try to get a real-world sense of how to optimize queries in that type of environment. It won't just be logging into the course and watching videos and being done with it. It's going to be released one module at a time and there will be some live sessions associated with it, so I really want to make it as engaging as I can. So look for the announcement this coming Friday.

episode_image