222 Times Faster Analytical Queries With Columnar Storage | Scaling Postgres 283
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss how to get 222 times faster analytical queries with columnar storage, a Postgres 16 feature review, the birth of a feature and fuzzy search.
- Hydra 1.0 Generally Available
- ClickBench — a Benchmark For Analytical DBMS
- Hydra on Github
- Citus: Row store vs. column store in PostgreSQL
- Discussing PostgreSQL: What changes in version 16, how we got here, and what to expect in future releases
- Active Active in Postgres 16
- JSON Updates in Postgres 16
- Postgres 16 is released!
- Major Features: Postgres 16
- Grouping digits in SQL
- “hired” vs. “fired” – fuzzy search in PostgreSQL
- Logical replication
- Enter the matrix: the four types of Postgres extensions
- Citus: 7 commonly used advanced SQL tools
- Determining the Optimal Postgres Partition Size
- PostgreSQL Performance Tuning, Part I: How to Size Your Database
- PostgreSQL Performance Tuning, Part II: Key Parameters
- Things you can, but probably shouldn’t do in PostgreSQL: Tables in tables, types in types, and arrays on top
- Why giving talks at Postgres conferences matters? Highlights from a podcast
- FOR loops automatically declared variables in PL/PgSQL
- Frustratingly simple ETL for Postgres
- Lantern — a PostgreSQL vector database for building AI applications
- Embedding near the edge: pgEdge Distributed PostgreSQL with pgVector
How would you like to have 222x faster analytical queries? This is something that Hydra has shown is possible. But before we get into that, I hope you, your friends, family and coworkers continue to do well. And also be sure to watch until the end because I actually changed my outro for the first time in over five years. But let's take a look at Hydra. So this is from hydra.so and this blog post is announcing its general availability. It is an open-source extension, kind of like Citus or TimescaleDB is now, but it focuses on column storage for analytical workloads. But it says it also supports hybrids as well, apparently. So this particular post goes over what columnar storage is. It basically stores columns first, meaning the same type of data is all stored together. This example, OLTP, is here, and it's easy to pull out one record of information. So if you want all the information about Joe, that's easy to get, all the information about Felisha, that's easy to get. However a columnar-based storage stores similar data types together.
So if you look down here, the order of the data shows everybody's name first and then their IDs, then the colors, et cetera. Now that makes it fast for counts, sums, and averages for particular columns of data. Plus you can also get really efficient at compressing that data, which can lead to further speed improvements. So, like I said, Hydra is an extension that you can use. By default, the tables it creates are columnar, but you can create standard heap tables using the USING heap command and you can convert them back and forth. I don't know how efficient this would be if you have hundreds and millions of rows, but it says you can do it. Now, this is the five minute video included on the page. I'm going to focus on one point where they selected all the rows from a heap table with 100 million rows in it and then did the same count from a columnar table. And the columnar table was 222x faster. But if you notice this query, I don't think there are any indexes on it as well as it's not narrowing down any data at all.
So I think this might be a very specific use case where you get this type of immense performance. And then he showed something that is about 600x faster, where he did a sum of one column, an average of another column, and just got a count. It was over 600x faster using the columnar storage compared to the heap storage. And in the blog post, they have a link to benchmarks. So this is a link from a benchmark. clickhouse.com. Looking at ClickBench, which is a benchmark for analytical DBMS. And for the comparison listed down here, they have Hydra TimescaleDB with compression, Citus, Aurora, TimescaleDB presumably without compression, and standard PostgreSQL. And at least for the queries that this benchmark is testing, Hydra beat every other solution here. Then if we look at the GitHub page, they have a quick start. So you could use their Hydra service in a free tier, or you can run Hydra locally using a docker container.
But I wanted to focus on the FAQs here where they say why is Hydra so fast? It's basically a columnar storage, query parallelization, vectorized execution, column-level caching, and tuning Postgres. So there's a lot of things they're doing to get all of that speed, and of course, there's compromises for doing that. But they say, what operations is Hydra meant for? It's basically great for doing aggregates (COUNT, SUM, AVG), WHERE clauses and bulk INSERTS, UPDATES, and DELETES. So the key is bulk. They intend for you to load data in a bulk fashion. This isn't to replace your transactional database. And they say here what is columnar not meant for? Frequent large updates, because those don't really work well with the column-oriented structure of the data and small transactions. So this is not going to replace your typical transactional database.
There are some unsupported features at this time, such as logical replication, and columnar tables don't typically use indexes. Now, as a contrast to this, cybertec-postgresql.com did a blog post on "CITUS: ROW STORE VS. COLUMN STORE IN POSTGRESQL", and they discussed the difference between row store, column store, the focus on analytical use cases, and by default, Citus creates heap tables. But you can specify using columnar to use their column-based storage. He is using the scale-out capabilities at Citus to create distributed tables and load in some time series data. And the other thing I should mention is that columnar storage does tend to save on space. So he witnessed 5x smaller storage for columnar-based data. Again, similar data types, I believe, can compress better than doing the heap row storage. The same size savings can be seen in Hydra and of course TimescaleDB as well.
Now, his counts aren't as impressive. He only sees about a 3x faster query between heap storage and columnar storage. But he wasn't using any WHERE clauses or anything like that. He was still just doing accounts from a table, the way Hydra was doing. Then on a single column, he could even count the value GROUP BY, ORDER BY, and it still gave better performance, although not as big as before. So I always have to take these performance improvements with a grain of salt and test out these solutions with your particular workload to see what kind of performance benefit you can get. But it's great to see this new option in the community to support analytical workloads. And if you do a lot of those types of queries, maybe you want to check out Hydra or even the capabilities of Citus or TimescaleDB, but let me know in the comments if this is something that would be of potential interest to you.
Next piece of content- "Discussion PostgreSQL " What changes in version 16, how we got here and what to expect in future releases". This is from postgresql.fastware.com. And out of all the Postgres 16 feature posts that have been coming out, I think this is the best one and covers the most. Now, he does a little bit of a historical review as you can see here, as well as looks at Postgres 17. But for the 16 features, he really covers everything. I mean, I'm not going to go into everything, but he talks about all the different changes to logical replication, enhancements to storage systems, wherein we got the potentially up to three times faster copy happening, as well as the pg_stat_io view, new SQL features, including adhering to the new SQL JSON standard, as well as the collation changes, additions to security and different privileges, as well as general miscellaneous performance improvements.
But of course, what also interests me about this post is what's potentially coming in Postgres 17. And there are a lot of exciting improvements coming to logical replication in particular. At least I hope they can get most of these in by version 17. The first is DDL Replication. So this is huge, being able to send table changes to a subscriber. Replication of sequences. So this is a long time coming, particularly if you want to try to use logical replication for upgrades, you always have to bring over the sequences. This is another huge one- Synchronization of replication slots to allow failover. This is another hugely important one because once you fail over, you kind of have to start logical replication from the start again. But there are some other interesting ones too, like incremental backup, just a lot of interesting changes. So if you're interested in that, definitely check out this blog post.
In addition, if you are enjoying this content, feel free to like and if you're not subscribed yet, hit the subscribe button.
Next piece of content- "Active Active in Postgres 16". This is from crunchydata.com and they're talking about what, historically, has been called multi-master replication. So basically having two primary databases running and synchronizing changes between each other. And this is using the logical replication changes in 16 that we discussed before, where you can now track the origin. Basically, only synchronize data that hasn't come from another replica. So that enables this Active Active replication to happen. So he talks about the origin filter that's now present in the communication between them to be able to discern where an origin comes from.
He sets up a sample environment, inserts some data in there, gets the logical replication set up with the publishers and the subscribers, and emphasizes origin none to be able to support that Active Active. Then a replication test, where he updated some data, and it updated appropriately in each system. But this is so new that you can't say we have multi-master replication today, so there's a lot of things to be careful about and confirm that it's working as intended. So he presents an example where you could actually get conflicts between the two that will not eventually be resolved. So basically you need to manually go and fix corrections such as these. So again, I like where this is going but it's definitely still early days with regard to it, but check out this blog post if you want to learn more.
Next piece of content- "JSON Updates in Postgres 16". This is from crunchydata.com and this covers changes to Postgres to match the new SQL JSON standard. So he shows things like checking if the value is a JSON object, looking at JSON_ARRAY, JASON_ARRAYAGG(), JSON_OBJECT, and JSON_OBJECTAGG(). He goes through examples of each of these and use cases for them. So I'm not going to go through all of this, but definitely a great resource if you are looking for the changes in Postgres 16.
Next piece of content is "Postgres 16 is released!". This is from pganalyze.com and this is Lukas's take on the changes to Postgres 16. So I encourage you to check out his piece of content with regard to that.
And lastly is the slides that Bruce Momjiam did for a presentation on major features in Postgres 16. So these are some additional resources to check out the new features in Postgres 16.
Next piece of content- "Grouping digits in SQL". This is from peter.eisentraut.org and this was a very interesting blog post where he covers, as he says here, the story of a feature. So this is how this feature came to be. The feature is where you can put a delimiter in your numbers to be able to make it easier to read. So for example, there are two underscores here, but Postgres still reads it as 10 million. He says normally, you would propose this patch to Postgres, but sometimes that's a bit of a struggle because they don't really want to make changes that are going to go against what may be coming in the SQL standard. So he actually turned it on its head and decided to quote "Get this into the SQL standard first".
So first they got this into the SQL standard and after that was done, he covered the issues of actually implementing the feature, where Postgres would do some crazy things. It would parse this- SELECT 123abc; as being SELECT 123 and call the column abc, which is a little bit crazy to me, but apparently, that's how the parser worked. So they actually had to make changes to adjust that as well as a number of other issues including having to deal with some SQL JSON issues. So definitely a fascinating post of how a feature got into Postgres. So check it out if you're interested.
Next piece of content- "'HIRED' VS. 'FIRED'- FUZZY SEARCH IN POSTGRESQL". This is from cybertec-postgresql.com. They're talking about fuzzy search and particularly some of the functions available in the fuzzy string match extension or the fuzzystrmatch extension. So he just set up a table, inserted five rows into it, and tested some different functions. The first one is soundex, so this is supposed to phonetically sound similar. And he demonstrates when you put it through a soundex function, you can get fired returned, even though you misspell it. So that's one way Fuzzy String Search will work, and this will work with an expression index. So you could generate the soundex as part of the index build. Then I looked at the Levenshtein function, which basically determines how many letters you have to replace between the word in the system and what was in the query. And he shows an example of doing that here. And then lastly he looked at metaphone search, which he says is similar to soundex, but you could define a length of output. So you could see that it takes a string and converts it into what it sounds like and you specify the length. So this is a four-character length of this phrase and this is a six-character length of this phrase. And then, of course, you can compare them with each other. But check out this blog post if you want to learn more.
Next piece of content, there was another episode of Postgres FM last week. This one was on "Logical replication". So they talked about logical replication in general. They didn't spend too much time talking about the Postgres 16 features but spent a lot of time talking about logical replication in upgrade scenarios to upgrade a database. And even called out this Twitter thread right here where they say, quote "GitLab upgraded multi-terabyte, heavily loaded clusters with zero downtime". Essentially using logical replication. So if you want to listen to that episode, you can check it out here or watch the YouTube channel down here.
Next piece of content- "Enter the matrix" the four types of Postgres extensions". This is from tembo.io, and he's talking about extensions that pretty much fall into four buckets. Either they require LOAD or they don't, or they require running CREATE EXTENSION or they don't. And there are different reasons for those use cases and how you can actually get and enable extensions typically. And apparently, they're working on some features to be able to load any extension in Postgres easily. So I don't know if the intent is some type of library installer. Like, for example, I use Ruby, and they have Gems. Installing a Ruby Gem is quite easy to install, using a toolkit called Bundler. So I don't know if that's kind of where they're going with this, but how I install extensions today is basically to ask if it needs to be in the shared preload libraries that they mentioned here? Put it there if it needs to be. You need to restart Postgres. And then secondly, do I need to run CREATE EXTENSION for it or not? That's basically how I install extensions now. But this goes into a lot of detail about extensions, so if you're interested in that, check out this blog post.
The last piece of content- "CITUS: 7 COMMONLY USED ADVANCED SQL TOOLS''. This is from cybertec-postgresql.com, so he identified seven commonly used tools for Postgres and how it works in Citus. The first one he talks about is naming Citus databases, and basically, it's difficult to create a database within Citus. And the takeaway he posts is quote "Just use the Postgres database and you'll be OK". Okay. Two is loading data using COPY and it works just fine to use COPY in Citus. Three, advanced SQL: using ordered sets. The use case for this is if you want to find the median, not necessarily the average or the mean, but you want to find the actual middle value, and for that, you have to order the set of data. This does work in Citus. Four are window functions, and he was able to get a window function working in Citus. Five were grouping sets and those really didn't work. He ran into an error, but he was able to accomplish the same thing using UNION ALL. Six is using triggers with Citus, and what he found out is that it's not possible to use triggers on distributed tables. So definitely some things to keep in mind if you're exploring using Citus.