background

Better Plans With Statistics | Scaling Postgres 293

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how create statistics can help you get better plans, issues with subtransactions, all about extensions and best practices when indexing timestamps.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Most of the time when you have a bad plan, it's usually due to the statistics. Maybe they're out of date, maybe you need them at more granularity. Well, another case is when there are some dependencies between columns that Postgres doesn't realize, and this week Lukas covers how to handle that. But I hope you, your friends, family, and coworkers continue to do well. Also would like to thank and welcome everyone who decided to join the course which was open for early adopters this past Black Friday through Cyber Monday. I'll provide more information in the coming weeks, but open enrollment for everyone else is starting in January, 2024, so I'll talk more about it then.

But the first piece of content is "How to use CREATE STATISTICS to improve a plan rows estimate, and why it matters". This is from pganalyze.com and this is part of Lukas's "Five minutes of Postgres", and he's talking about a situation he actually saw in his own database at PgAnalyze. That issue was with statistics. With statistics, the system automatically collects single-column statistics, but it doesn't do anything about dependencies. That may exist between two different columns, but you can add those using CREATE STATISTICS and they're basically called extended statistics. And he had two columns that were dependent upon one another. Normally, I think the example use case is between a city and a zip code. So there's an alignment between zip codes and cities, or there's an alignment between cities and states.

But in his case, I think it was something like servers to the organization that owns those servers and they were getting a bad plan where there were close to 500 rows that existed when he ran EXPLAIN ANALYZE. But when you just ran the EXPLAIN plan, it thought there was one row. The reason was that it was doing a bitmap and combining two indexes together. So it was looking at these statistics independently and clearly, it was giving you an inaccurate estimate. So to handle it, he ran a CREATE STATISTICS command. He also needed to increase the amount of statistics that were being collected for this table. But after he did that, it gave a more accurate estimate of how many rows were actually in the table. So you have to watch the video to get more of the details behind it, I definitely encourage it.

He also explained how the statistics worked and how it was coming up with one row instead of 500. But the reason this is important is because if these statistics are off based upon what you're querying, you could get a bad plan, which means there exists another plan that's more performant. So if you're seeing some performance issues and you think it's related to the statistics being collected, it may be a case that you need to increase how many statistics you're collecting. Or maybe you need to add some of these extended statistics to give the planner more information to pick a better plan but check out this piece of content if you want to learn more.
Next piece of content. There was another episode of Postgres FM last week. This one was on "Subtransactions". So these are independent transactions within an overall transaction of the database. Usually how you create these is you begin a transaction block and then you can save points within it. It basically saves your work part of the way through. Now, the main advice was basically don't use them or try not to use them if you can avoid them. Because particularly once your database gets active, you can run into problems with them. And I think Nikolay mentioned one way he does use them, and that is if he's doing some DDL changes that have to be able to lock things, he may create some save points during that process. And then iterate over a particular, say, busy one because it's trying to grab a lock. Once it's locked, able to make the change, then it can continue. That's the only example use case I think they mentioned, but otherwise, potentially try to avoid them.

They talked about four issues that they've seen with regard to them that are mentioned in some of the posts down here. One was that it burns through your transaction IDs. So every transaction uses a transaction ID, but every subtransaction creates another transaction. So one single transaction could have multiple subtransactions and you're just using up your transaction IDs faster. The second issue is too much nesting. I think there's a limit on how many subtransactions you can have within a transaction, so you just need to be careful of that. There could be some issues with multi-transaction IDs, particularly with I think it's a select for update that they were mentioning so that's something to look out for. Then they also mentioned a potential replica issue that can happen as well. But a great piece of content, I definitely encourage you to check it out. You can listen to the episode here or watch YouTube down here.

Next piece of content is actually a presentation PDF. This is "Postgres: Extensions Shape the Future". This is on rustprooflabs.com and Ryan Lambert created it. This is a really long 70-slide presentation and because it's from RustproofLabs and Ryan, who's the author of Mastering PostGIS and OpenStreetMap definitely covers a lot of geographical information systems-related content in Postgres. But it talks all about extensions, their history, particular ones to use, how to create them in different ways, and using pgrx here framework. So definitely a lot of great information about extensions you may want to check out.

Next piece of content "Indexing TIMESTAMPS". This is from momjian.us and he has a scenario where you want to put an index on a timestamp in your table, but you want to be able to query it by the date. And maybe you have an idea- you want to cast the index to a date. But the problem is this won't work in this particular case because you're using a timestamp with a time zone. So presumably, this will work if you're using timestamps without time zones. But a way to get it to work is to give it a specific time zone when you're casting it to a date. Then it will allow you to create this functional index. But the most important thing he mentions is don't really even need to do that.

You can just have a basic index on the timestamp and then just query it using part of the date without the time component and basically, it will default to, as you can see here in the explain plan, all zeros for the time portion if you're only including the day portion and just do greater than or equal to and less than. This is kind of something I do intuitively already, but apparently people were trying the previous possibilities and he said it even works to cast this particular one to a date and add intervals to it and it still does the appropriate conversion to give you the answer you want to. So this is the most flexible option, essentially. As he says here, this particular technique gives you unlimited granularity so you can define how granular you want to be with your data or time. And it's a flexible time zone. So check out this piece of content if you want to learn more.

Next piece of content- "What We Learned From Benchmarking Amazon Aurora PostgreSQL Serverless". This is from timescale.com, and apparently, they had some people who have been converting from Aurora to Timescale, so they decided to take a look at cost differences, performance differences, and things like that. Now, they did use a time series workload to test this, so you'll have to take that fact into account when looking at this. Using this type of workload that Timescale was designed for, they saw that it was 35% faster to ingest and the query times were roughly equivalent to a little over one to 16 times faster.

But two queries were actually slower, so Aurora won. It was 95% more efficient at storing data. This is predominantly due to the column store compression that Timescale offers, but they achieved pretty significant cost savings even with this better performance in terms of 52% cheaper per hour for computing and 78% cheaper for the data store. Presumably, a lot of this is due to the compression that Timescale offers, but if you use Aurora and you have a time series-related data set for the most part, maybe you want to check out using Timescale instead of Aurora for some cost savings.

Next piece of content- "Join the preview of Amazon Aurora Limitless Database". This is from aws.amazon.com. So this is a new offering by AWS and it looks to be their sharding solution. So basically it's one instance that can grow limitless, they say, and it's in preview now, so not generally available. So presumably you do have to generate sharded tables and shard keys. So this essentially seems to be what Citus is doing. It's a sharded database. So a scale-out solution and I will say what was pretty funny was that I did find this on Postgres weekly and this is the weekly newsletter talking about Postgres and under their heading, they mentioned "So not only can your database be limitless, your AWS bill can be too". So that's kind of funny.

Next piece of content- "Accelerate HNSW indexing and searching with pgvector on Amazon Aurora PostgreSQL-compatible edition and Amazon RDS for PostgreSQL". This is also from aws.amazon.com and they basically took the newest version of pgvector 0.5.1, compared a little bit to 0.5.0 as well, and their new Graviton2 and Graviton3 instances to see what their performance differences are and of course, used the new HNSW indexes. Scrolling down to the results here, first, they looked at ingestion and as the number of clients climbed, a more recent version of pgvector looked to be over three times faster.

So definitely if you're using an older version, move up to the latest version of pgvector if you want better insertion performance. They also looked at insertion performance with regard to their instance types and they saw it was generally 20-30% speed-up improvement. So new instances do better. Apart from the insertions, they also looked at query performance, not with pgvector 0.5 vs. 5.1, but with regard to their Graviton instances. And that looks like they saw a 40% to 60% performance improvement with those instances. But check this post out if you want to learn more.

Next piece of content- "Creating a Fast Time-Series Graph With Postgres Materialized Views". This is from timescale.com. They created a scenario where you have an electrical provider that is tracking the demand for electricity for their service and they wanted to do some sort of dashboard to present the information. They had a particular query that did it so they materialized it into a view. They created a view that basically creates a real table behind it and of course, whenever you do this you're pre-generating the data that makes it easy to query upon it's much faster.

The downside is you have to refresh that materialized view periodically, which means you have data constantly flowing in but you need to take the time to actually refresh the view periodically. Now you can do it concurrently, but basically, if you have years of data within this view, it has to calculate everything. But Timescale offers the opportunity to basically aggregate continuously. Now it takes a little bit of performance hit, but basically, it means you don't have to rebuild all the data. But looking at what this does, I don't think there's anything necessarily stopping someone from rolling their own solution that does kind of like these continuous aggregates in Postgres itself. But if you want to learn more, check out this blog post.

Next piece of content- "Compression Faceoff: Postgres TOAST vs Timescale Compression". This is from timescale.com. Now this probably isn't a quite fair matchup because I think the primary responsibility of TOAST is just to move large data out of the main table area. So you have the huge JSON document or huge text document; you want to store all that data outside of the normal row storage. And I think compression wasn't top of mind when designing TOAST, necessarily. It does it, it's just not to the degree at which Timescale has focused on compression. Plus, there are only certain types of data that make it into the TOAST. So when they are looking at the example here, where presumably you have columns that aren't so TOASTable, there's essentially nothing in the TOAST. So essentially nothing's compressed versus 73% compression using Timescale. But once you store data that gets TOASTed, although looking at this chart, I'm not quite sure these are zeros for the TOAST size because it said they added a JSONB data type.

So I would expect the TOAST to be populated. So I don't quite understand that. But it still says they get about double the compression. And some of these wins are when you look at exactly how they do it. And they're basically using a kind of a column store to do it. So as you can see here, they batch up to, say, 1,000 rows into a single row. And what they do is they take those thousand rows, take all the timestamps, put them in this one column as an array and all the device IDs in this column as an array. Status codes, temperatures, and similar data types together in an array in a single column. And then they use specialized compression techniques given the data type to compress it as much as they can. And with this solution, they get much higher compression rates. So that is what makes Timescale better for compression. But check out this blog post if you want to learn more.

Next piece of content. It's time for PGSQL  Phriday again. This is #014 and is the invitation from pgsqlphriday.com and this is the one from CyberTech. Basically, it's focused on PostgreSQL events. Now, these aren't events in PostgreSQL, but they're events to meet other people, using SQL to talk about things like reflecting on personal experiences, looking at particular highlights, comparing virtual vs. in-person experiences, networking opportunities, et cetera. So feel free to participate in this blogging event for December 1, 2023.

Next piece of content- "Fun with Postgres Looped Functions and Linear Progressions". This is from crunchydata.com. This is Day 21 of the advent of Codes 2022 Challenges, and I won't mention it further because it'll spoil how it's being done.

Next piece of content also from crunchydata.com is "Fun with Postgres Text File Mazes, Charts, and Routes". This is the Day 22 challenge, so you can check this out if you're interested in it.

And the last piece of content- "Database generated columns: Django & PostgreSQL". This is from paulox.net and this focuses a lot on Django, but it does talk very briefly about the generated field capability within Postgres. I don't know if you necessarily need this feature in Django to use generated columns, because I know I use Rails and as soon as the generated column feature was available, I could start using it in my database. So I didn't have to wait for a particular Rails feature to use it. But if you want to learn more, definitely feel free to check out this blog post.

episode_image