Index Merge vs Composite, Transparent Column Encryption, Trusted Language Extensions, WAL Archive Module

In this episode of Scaling Postgres, we discuss merging indexes vs. a composite index, implementing transparent column encryption, developing trusted language extensions, and reviewing the WAL archive module.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Index Merges vs Composite Indexes in Postgres and MySQL". This is from sirupsen.com. He's comparing the difference between two indexes each in its own column, and looking for data in both of those columns vs. having a single index that is a combination of both of the columns. So a multicolumn index. He looked at this for Postgres and MySQL, and he did something called napkin math, which is actually a tool on GitHub that gives you references for how fast an operation you would expect to be. Based upon those estimates, he felt a composite index could happen in one millisecond. And an index merge solution using two indexes, one on each column, would be ten to 30 milliseconds.

Now, of course, this is a contrived example. They're comparing integers, but it still has some relevance here. And when he actually did the test, he found the composite index was about five milliseconds for both Postgres and MySQL. And I should say this was doing index-only scans for the type of query. It was doing just a count of all the rows. The index merge for MySQL took 30 to 40 milliseconds. So about the range that was predicted, the actual composite index was a little bit slower, but he says it's normal for the estimates to be in the order of magnitude different, the Postgres was, on average, a little bit slower, but had a very wide variance. And the reason is that it wasn't actually doing an index-only scan when doing these index merges.

So basically, in terms of comparison between two indexes vs a multicolumn index, the multicolumn index was about ten times faster. And he says, particularly with Postgres, as you get a higher number of records, it looks like that difference will become more prominent. And this is what I've kind of found in my own consulting, is that normally as the base case, I stick with single-column indexes. But then as the data grows and it comes time for optimization, usually a multicolumn index is the first thing to reach for because the application or the use case has matured at that point, and you have a good sense of what queries are going to be running and what could take advantage of a multicolumn index. But if you want to learn more about that, definitely check out this blog post.

Next piece of content- "Transparent Column Encryption with Postgres". This is from supabase.com, and he's talking about a new enhancement to the extension called pgsodium. Now, this relies on the encryption library libsodium, and it does encryption in the database. Now, what it does is it does store a reference to keys in tables, but it doesn't actually store the key within the database. So the keys that do the decryption are not accessible from SQL, for example, nor are they placed in the logs or things of that nature. I believe you probably can place it in a file in the database, or you can point to any key management system where the keys are actually stored and integrate that with libsodium, as far as I understand.

So basically, the new version of pgsodium has this API-based key management in place, and it shows you how you can do some encryption of particular data and then decrypt that data as well. But then the thing of interest that was released in pgsodium 3.0 is transparent column encryption. So basically, it has a way to insert data into a table where it's automatically encrypted and then be able to retrieve that information in a decrypted state. So how it works, if they're using security labels to set this up, once you have the extension installed, and they have the example of a credit card table where you're putting a credit card number in a text field. Then you apply a security label on that particular column and it says to encrypt it with this particular key ID.

And again, this is a reference to the actual key. So pgsodium knows where to find the actual key to do the encryption, even though the key reference is stored in a table in Postgres. Now, you can also do one key per row, so that way every individual piece of data has its own dedicated key for doing it. And by default, these are also nuance less. So you can do rows with nuance support, so you can add that as a byte array as well. So pgsodium supports that, and you can even include associated data with the encryption. So definitely a lot of different features to set up your encryption. Now, how it actually works is that you can insert your data into the table, so you specify what data you're inserting.

So, of course, you want to be cautious not to log this information in Postgres. Like, don't log your insert statements, for example. But then if you pull it out from that table, you will see encrypted values. So you can see the credit card number has an encrypted value. Whereas it sets up a separate view for that table from where you set it up, where you can actually call the view called decrypted_credit_cards, where it actually shows you the decrypted credit card. So this is definitely an interesting enhancement to the existing pgsodium extension. And if you're interested in doing transparent column encryption, definitely suggest you check it out.

Next piece of content- "New- Trusted Language Extensions for PostgreSQL on Amazon Aurora and Amazon RDS". This is from aws.amazon.com. They're talking about a new open-source extension that they have developed that allows you to create extensions using only trusted languages, so languages that exist within a sandbox. So this is an open-source project that they've released. In terms of trusted languages, what they mean is you can use SQL, PL/pgSQL, JavaScript, or Perl because these have security constraints where they can't write to the file system. Now, probably the reason why they created this is because they have a hosted infrastructure and they don't want you writing to the file system. So they've given you a convenient way to develop your own extensions and install them quite easily into their hosted infrastructure because nothing's going to be writing to disk, essentially.
Now this should work with Postgres, of course, not just on Amazon's infrastructure, but once you go ahead and get this extension set up, you create it within your database and then you can define an extension like this pgtle.install_extension and you just put in the code for your custom made extension. They did a sample use case here where they're actually checking for bad passwords. This new project also supports hooks into Postgres, so they're using the check_password_hook to be able to do password validation. So that's what this extension does and they show you how to go ahead and install and get that activated within the database. So this is a great enhancement to the open-source community.

If you want to learn more about it, I definitely encourage you to check out "Creating custom extensions on Amazon RDS and Aurora with pg_tle" at pganalyze.com. Lukas goes into much more depth about different use cases for this new project. And furthermore, you can check out the code for the project on GitHub as well.

Next piece of content- "New WAL Archive Module/Library in PostgreSQL 15". This is from percona.com and they're talking about how you archive WALs. Now, for the longest time, we've had the archive command where you actually run an operating system command that takes active current walls and then archives them. Well, now you can use a programmable library to do it as of Postgres 15. So for example, you can set this specific archive library and do a reload to get it working, but you need to use an acceptable library.

Now, it does come with a sample archive library called basic_archive, but really based on my reading of this post, there seem to be a lot of issues. So it's definitely not something used for production, but it's something for you to assess the code to essentially build your own if you want to do that. Alternatively, it is a source for people who are writing archive solutions for Postgres to use. So for example, I believe things like PgBackrest may start to use this or other backup archiving solutions. But if you want to learn more about how this works and some of the benefits of it, definitely encourage you to check out this post.
Next piece of content, it's time for PGSQL Phriday. This is episode #003 where the community is covering "What is the PostgreSQL community to you?". And the introductory post is here by sqlasylum.wordpress.com so you can read more about why he wanted to discuss this particular issue.

The first post I saw was "PostgreSQL Community: Passionate, Knowledgeable, and Thoughtful". And this is on softwareandbooze.com, so definitely encourage you to check out this blog post.

The next one is "PGSQL PHRIDAY #3: WHAT IS THE POSTGRESQL COMMUNITY TO YOU?". And this is from scarydba.com where he discusses his feelings on community.

And finally there is "WHAT IS THE POSTGRESQL COMMUNITY TO YOU? - PGSQL PHRIDAY #003" from rustprooflabs.com, and he has a number of different linked resources for keeping up with the Postgres community.

Next piece of content- "PostgreSQL 14 Internals, Part IV". This is from postgrespro.com and this is the PostgreSQL 14 internals book that is being translated and released in different parts as free PDF downloads. So if you want in-depth coverage of Postgres Internals, definitely encourage you to download the next part of this where they cover the workings of the planner and the executor.

Next piece of content- "PARTITION POSTGRESQL: WHICH PARTITION DID I INSERT MY DATA INTO?". This is from cybertec-postgresql.com and they're talking about which partition a particular data has been sent to. So they just set up a basic partition table with three partitions and a default partition that they set up just using a list of values. And when they do the INSERT, they actually do a returning statement and they look at the tableoid and they cast it to a regclass. So that actually gives you the table name that it was inserted into. And I didn't know that this was possible to do. So that's pretty interesting because they're loading something that does not have a defined partition, so it gets placed in the default partition, which they call t_rest. And you can even do queries on this and include the tableoid cast to a regclass so you can look at particular data and see what table that resides in. So if you want to learn more about that, definitely check out this blog post.

Next piece of content- "Postgres Strings to Arrays and Back Again". This is from crunchydata.com. Here they're talking about a situation where you have a CSV in a CSV. So someone has formatted more than one value in a spreadsheet of some sort and has exported it to a CSV file. And now you have data that looks like this within a particular column. So the question is, how do you deal with that? And what he did is he just loaded it into a two-column table and used the COPY command to get it in there. So now your table essentially looks like this, and he uses the function string to array to actually break that array into individual values.

But then of course, what do you do with it at that point? You can now unnest it. And if you unnest that string to an array, it actually creates multiple rows with how many values there are along with this station. And you can click this downward link to see what the query results are. Now with this in place, you can now do different queries against it. Here he's using a CTE, but he's looking at okay, what was the max_temp and the min_temp for a particular station. So if you want to learn more about that, definitely check out this blog post.

Next piece of content- "Using Postgres as a task queue for rowdy Gophers". This is from ente.io, and he's talking about building a queuing system for Postgres. But the TLDR set up a polling system using UPDATE... SKIP LOCKED. Now, this post is quite substantial. He talks about different paths he went on that didn't work or didn't work as well, and how he basically ended up using FOR UPDATE with SKIP LOCKED. So if you want to learn more about that, you can check out this blog post.

Next piece of content- "GRANT VACUUM, ANALYZE IN POSTGRESQL 16". This is from cybertec-postgresql.com they're talking about in version 16, you will now be able to grant particular users the ability to run VACUUM or ANALYZE. Right now, only owners and superusers can run these commands, but now you can grant them to particular users. There will also be two new predefined roles. One is pg_vacuum_all_tables and pg_analyze_all_tables. So this is great for people who are doing Ops work but don't actually need super admin capabilities. So check this blog post out if you want to learn more about that.

Next piece of content- "PostGIS Day 2022". This is from crunchydata.com, and this is a review of PostGIS Day that happened on November 17. They talked about a number of talks they found interesting, as well as different tools that were showcased as a part of the event. So if you want to learn more about that, definitely check out this blog post.

Next piece of content, there was another episode of Postgres FM this week. This one was on "PostgREST" and they cover what it is as well as the pros and cons of using it. So basically, this is a standalone web server that turns Postgres into a restful API endpoint. So then you could just write client code against it. So if you want to learn more about that, you can click to listen to the episode here or click here to watch the YouTube video.
Next piece of content. The PostgreSQL person of the week is Oleksandr Shulgin. If you're interested in learning more about Oleksandr and his contributions to Postgres, definitely check out this blog post.
The last piece of content. We did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This was on "What Happens After RubyCritic With Ernesto Tagwerker". So RubyCritic is a way to analyze your code and give you recommendations on how to make it better and more maintainable. And we covered how after you did that, you actually go about deciding what to change and how to do that. So if you're interested in that, we definitely welcome you to check out our show.

Index Merge vs Composite, Transparent Column Encryption, Trusted Language Extensions, WAL Archive Module | Scaling Postgres 244

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript