100x Optimizations, Auditing, Document Schema Designs, Checksums

In this episode of Scaling Postgres, we discuss 100x optimizations, how to audit table activity, designing schemas and how to set up table checksums.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "How we optimize PostgreSQL queries 100x". This is from towardsdatascience.com and he's talking about some optimizations that he did for his application that led to up to 100x in performance improvements. Sometimes it was ten x, sometimes it was two to five. However, he goes through several different things he did to improve query performance. Now he makes a note here that some of the queries that he was optimizing had a lot of joins, 3-10, yielded up to a million rows. So that's quite a lot. And then he had to do it within specified UX time limits because I believe this is a web application. So here are some of the things that he discovered. One is that in his cases, sometimes a LEFT JOIN can be more performant than an INNER JOIN, particularly when there are some columns that are correlated.

So just changing an INNER JOIN to a LEFT JOIN wasn't sufficient but he also had to include correlated columns. So for example, the two tables that are joined here have a repository ID that is the same. And when he was able to add that in the LEFT JOIN was giving some greater performance based upon choosing a different plan. So the INNER JOIN, a lot of times, was using a nested loop whereas the LEFT OUTER JOIN avoided that and did some things that could be cached better. So that was definitely a pretty good win. The next area he looked at is a hashing value. So instead of using an IN with a bunch of IDs, he says 9,000 IDs, he used ANY in the values and that resulted in a greater performance improvement. So sometimes this worked, sometimes it didn't, but when it worked he was able to get another 10 to 100-fold improvement.

So this looked a little bit hit or miss but it is possible. The next area he looked at was extended statistics. And so he had an example query where the repository_id was in this particular set of values and the merge_commit_id was in this particular set of values and those were highly correlated. So he created extended statistics using this command here, CREATE STATISTICS, on those two columns and this is as long as you're using greater than or equal to PostgreSQL v13, you can add these. And that resulted in the planner doing hash joins as opposed to nested loops, again resulting in about a 10 to 100 speed up. The next area talked about is primary keys. So initially, they were using a natural primary key but the source of that primary key unfortunately was from an external source and suddenly they decided to change the format of that and they had to redo all of their keys.

But what was interesting is when they changed it to be a simple integer versus a text value, they noticed a two to five performance improvement with joins on those columns. Something we'll notice maybe in a few posts this week, is that integers are generally ideal for performance in terms of joins and using for keys. The next area he covered was CLUSTER and this is a command that allows you to order the data on the table close to how it typically gets queried. Now, the unfortunate side effect of CLUSTER is it locks everything out of the table, so it's not something you can run on a regular basis, clearly. I would probably only use it if it's for a data mart or a data warehouse purpose, so not anything that's constantly being acted upon in a transactional manner.

I probably wouldn't use cluster in this case, but that can give you a performance improvement by aligning the data to how it's being queried. He noticed that when they were using clusters it was a 2-5x improvement, so not as much as 100, but it gives you a little bit. The next error he talks about is a plan. So he was using this extension pg_hint_plan, and that basically kind of goes around what the planner decides to do and you give hints to say to do an index-only scan here or just use a regular index scan and a number of other hints to the planner as to how to execute the query. Now, interestingly in this part, he mentioned different changes that were made, but he never mentioned the performance improvement. So I imagine it wasn't that significant, but he didn't really say. But this is a pretty long blog post and I definitely encourage you to check it out to look at what he discovered in terms of his performance improvements.

The next piece of content- "Postgres Auditing in 150 lines of SQL". This is from supabase.com and they're talking about setting up an auditing system. They have an example table here they called Members that just have an ID and a name. What they came up with was a relatively simple API where you can just call this function, enable_tracking, and it automatically starts auditing the transactions on this table. So basically, if something gets INSERTED, UPDATED, and DELETED, it will track that and it looks similar to the layout here. So you have a unique record identifier, they're using a UUID for this. When a record is INSERTED, the record ID will, of course, be there. When it's modified, both the record and the old record will be there. And when it's DELETED, you'll only see the old record reference in these columns.

The rest of the comms has what operation it was, INSERT, UPDATE, DELETE, a timestamp, and the table OID. They did this for more efficient querying because again, integer is more efficient than text for querying or for indexing. The table schema, table name, and then as well as what was in the record and what was in the old record. So you could see the changes from an edit, what gets newly INSERTED and then what gets DELETED. Now, in terms of constructing this, they created a separate schema to store it in called audit. Then they created the table that we just looked at, what the results were is what's in that table. Then they created an index on the timestamp because they figured that's going to be queried pretty frequently, what time did a certain change happen?

And they actually used a block range index, which is great for timestamps that are not going to be adjusted. Basically, it's kind of like an append-only table, which is what this is. Each change in any table is going to just be inserted constantly into that audit table. So they went with a BRIN index on the timestamp column and then they did another index on the table_oid. So they didn't do it on the table name, they did it on the OID. So that's how they expect you to query this table. Now in terms of creating that unique identifier, they actually used a UUID. So how they did that, they appended the table_oid plus the primary key into IN to generate what that UUID is. So that the UUID identifier will always be the same to designate that particular row because the primary keys shouldn't be changing or that row's identity will change if you change the primary key.

So this is the function they use to identify the primary keys and then another function to actually consume and build the UUIDs. So you can look at this function here and how that does that. And then they added an index on the record version and the old record version. This is the generic function that will work for any table that basically reads the values in there and then outputs them and inserts them into the record version table in the audit schema. So that's the function that gets fired. But if you've used triggers, you know there's a trigger that actually fires a function. So what we just described was the function the triggers get added by this enable_tracking function. So this is the actual API that turns on tracking or turns it off. And so when it's on, it'll add those triggers and they will be fired and inserted into the audit table.

When it's disabled, it'll turn those off and you can see those functions here and that's pretty much it. And they said they did it with two lines of code to spare, so they did it in less than 150 lines. Now, one caveat they do mention about performance is that this will be a hit to performance because you replaced triggers and any op change to that table is going to fire these triggers to be able to do INSERTs in this other table. So you are going to have a performance hit with doing this. But they do say for a higher write frequency, you can look at the extension pgAudit. And then lastly, they also have an extension they created called supa_audit. So basically, you're able to install this extension and then just start using the code here. But if you want to learn more, check out this blog post.

The next post is actually related to this previous post because it was the one covered in five minutes of Postgres. And this is "5mins of Postgres E8: Postgres Auditing: Table triggers with supabase vs the pgAudit extension". He does compare and contrast this a little bit with the pgAudit extension and how it actually relies on sending information to the log files of Postgres as opposed to a table. So there are no triggers. But the burden there with pgAudit is the amount of logging that gets done. So you have to be prepared to handle that increase in log usage for your Postgres installation if you're going to be doing a lot of auditing with pgAudit, but definitely encourage you to check it out in this week's episode as well.

The next piece of content- "Database Design: Using Documents". This is from endpointdev.com, and he's talking about a scenario where you have a table of websites and you're going to be scanning them and you want to record the URLs for those pages that you've scanned. Now, he has an example schema here, and the goal of this article looks to be to find the smallest size that is possible for storing this information. So it doesn't really look at performance. So that's one caveat to think of. Now, generally, a smaller table will be faster to scan or smaller indexes will be faster to retrieve data, but it all depends on how you're querying. But in this example, he's looking at size. He says, 'Okay, let's take this schema and change it to a normalized condition where we're not using natural keys, but we're actually using integers that should allow our largest tables to be very narrow, so not recording a lot of information'.

So we did that. The next scenario was looking at a composite type. So we created a composite type. You could have done this without a composite type and just did it as an array of text strings for the URLs. So basically it's the website page and then each URL is an ARRAY. So I think you could have done it that way. Next, he looked at a JSON column and then also a JSONB column to again, compare the sizes. Looking at it, the original table with the natural keys was 436 megabytes. The normalized table which he made was 387 megabytes. Going to the composite type or the ARRAY was 324. Now, the reason that he's explaining this size difference is the amount of overhead for many rows. Because if you think about it, the normalized table has many rows and there's an overhead for each row for that. The composite type with the ARRAY has much fewer rows, but each row is much larger.

Of course, then he looked at JSON and then the JSONB column, which is even a little bit smaller than that array, which I'm a little bit surprised by. I wonder because he was saying there is some overhead for using a composite type. So I wonder if just a straight-up array would have been the smallest version. But it's an interesting investigation of different methods of setting up a schema and the size differences. But again, to me, it all depends on what queries you're going to be using as to which format is more efficient. Based on how you're querying, would it be more efficient to have the normalized tables or in the JSON format? To me, that's more important than the size differences. But this is an interesting piece of content and I definitely encourage you to check it out if you're interested.

The next piece of content- "POSTGRESQL: CREATING CHECKSUMS FOR TABLES". This is from cybertec-postgresql.com and they're talking about checksums which basically output this fingerprint based on a file. But instead of doing it for a file, like they're doing an md5sum checksum here, they're saying can we do it for a whole table or even particular rows? And that's what they did in this post. So the plan of what they're trying to do is they want to do a hash of each row and put them together. So in order to do that, they wanted to make sure they had things in a proper order. So they wanted to use an ordered set to do that. So they're showing an example of an ordered set here. This is a case where ordering becomes more important. When you're calculating an average, an order doesn't matter, but if you're trying to find a median, like the middle value, the order becomes more important.

But for what they're trying to do, where they're trying to do a checksum upon checksums, then it becomes definitely critical. So what they're going to do is they're going to use an MD5 function that's available in Postgres and they're going to create a custom aggregate. So they're going to call their aggregate md5_aggregate and they're going to design this function that basically, hash a row, get another row, hash that to it, get another row, hash that to, and so on until it runs out of rows. So they created their test table here and using this new aggregate function that they created for the whole table, they are able to get a single MD5 hash. Now, they didn't say how long this took to run. I imagine it might take a while to do an MD5 hash of every row but they also show how you can get hashes for subsets of the data using this example query. So definitely an interesting technique to compare the results of, say, two different tables to see if they match. So if you're interested in learning more about that, definitely check out this blog post.

The next piece of content- "PostGIS vs GPU: Performance and Spatial Joins". This is from blog.crunchydata.com. There was a post this week on someone using a GPU-assisted platform to plot out parking infractions in the city of Philadelphia, and they used a custom Python execution engine to do it. He then wondered how PostGIS with Postgres could compare calculating the same data. So that's what he did here. He set up his system, downloaded the publicly available data to load it in, set up his tables, and made appropriate changes to get them working, some indexes, and set up this query. But he noticed it wasn't using all of these parallel workers. So we actually adjusted the min_parallel_table_scan_size, and with that, he was able to get a parallel execution plan.

And with running four workers, he was able to process the data in 24 seconds. Now the GPU custom solution did it in 8 seconds. So he said sure, we didn't beat it, it was three times as slow. But the thing about it is, this is a custom design solution that they did, so they don't have all the overhead that PostGIS and Postgres add, but with that, you get a lot of power. So for example, you can now come up with other queries about this data, be able to publish it, and remotely query it using their pg_tileserv and pg_featureserv extensions. So I thought that was a pretty interesting blog post in comparison.

The next piece of content- "CQL Filtering in pg_featureserv". This is from blog.crunchydata.com, and they're talking about an enhancement that they've done to pg_featureserv, which is an extension for accessing PostGIS data over the web. And they've add CQL capabilities, which is a common query language, I think, related to geographical information systems. So it lets you set this up and do querying with it. So if you're interested in that, you can check out this blog post.

The next piece of content- "Queries in PostgreSQL: 2. Statistics". This is from postgrespro.com, and this is a very in-depth document about statistics. So if you're interested in learning more about that, definitely check out this blog post.

The next piece of content. The PostgreSQL person of the Week is Ibrar Ahmed. If you're interested in learning more about Ibrar and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we did have another episode of The Rubber Duck Dev Show this past Wednesday. This one was on reviewing the top 6-10 web application security risks according to OWASP. So if you're interested in that, definitely encourage you to check out our show.

100x Optimizations, Auditing, Document Schema Designs, Checksums | Scaling Postgres 206

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript