background

Postgres Playground, PG14 Internals, DB Corruption, Anti-Join | Scaling Postgres 229

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss a new Postgres playground, a book about PG14 internals, how to corrupt your database and using anti-joins.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Learn Postgres at the Playground". This is from crunchydata.com, and they're talking about a new tool that runs Postgres in your local web browser. It uses WebAssembly from what they're saying and it takes particular data sets that are loaded in and it gives you tutorials to follow along. So it gives you a sandbox or a playground to play around in and get to learn Postgres. Some of the initial tutorials that they're covering are the basics of psql. So just using the psql client. Doing partitioning, doing some performance analysis, joins, indexing, PostGIS, as well as window functions and CTEs.

So if we take a look at the page, it looks like this at crunchydata.com/developer/tutorials, and it has different links that you can click on. For example, if I click on psql basics on the right pane, it actually starts a psql database presumably, and loads it with some sample data using WebAssembly. Then on the left pane, you can just follow along and enter commands to see what the results are. So as you can see, this is an actual database. You can look at the databases installed. So you just see Postgres and the templates and you can see what tables are available, and you can do SELECTs against it. So this is a pretty cool toolset, in my opinion.

Now, also related to this, there was a YouTube video recently released called "The past, present, and future of Postgres and the Postgres Playground". And this is on the Hasura YouTube channel. This is about an hour-long interview with Craig from Crunchydata. Whereas the first 20 minutes or so is just talking about Postgres in general, the latter half goes into more depth on the Postgres playground. So if you want to learn more about that, definitely check out these pieces of content.

Next piece of content- "PostgreSQL 14 Internals, Part II''. This is from postgrespro.com and maybe a month or two months ago we mentioned the first part was released. This is a free PDF describing the Postgres 14 internals. So they released part one, and this is now released as part two, where they cover the "design of the buffer cache and explains the need for write-ahead logging". So if you want to learn more about the internals of Postgres, definitely recommend checking out this free PDF.

The next piece of content- "HOW TO CORRUPT YOUR POSTGRESQL DATABASE". This is from cybertrc-postgresql.com. And this is the post that literally describes how to corrupt your database. So it lists about seven different methods. The first one is creating a corrupt database by setting fsync=off and then how you corrupt it is basically you put a load on it. They're using pgbench in all of these cases, then you cut the power to the database and eventually, you'll get a corruption error. The next one is creating a corrupt database from a backup and this has to do with not creating a backup label and he goes through the process of how to corrupt it this way. Another way is using pg_resetwal. So for example, if you do an immediate shutdown and then you reset the WAL, essentially the WAL is not available to restore the database. So now you have a corrupted database. The next is using pg_upgrade with a link option, but then turning on the old database before you've dropped that old database cluster.

Because they're using shared files because of the link that could cause corruption. You can just go in and manipulate the database files and edit them any way that you want. You're going to get corruption that way or even modify things in the Postgres catalog. So here they're deleting something from the pg_attribute table. So this is also more of a list of what not to do. So as he says here, don't mess with the system catalogs, don't modify things in the data directory, don't run with fsync=off, don't call pg_resetwal on a crash server, and don't delete or omit the backup label. The last three items are things you should do- remove the old cluster after an upgrade with pg_upgrade, run a supported version of Postgres and run on reliable hardware. So if you do these things, you'll definitely minimize your chance of corruption. But if you want to learn more, definitely check out this blog post.

Next piece of content- "Rise of the Anti-Join". This is from crunchydata.com and they're talking about anti-joins, where you want to join to a table and exclude everything that exists in the second table. Now they actually approach this in a few different ways, but to get started, they generated a series and created a table A with all the same values, B with all the same values, and then just deleted two values from the B table and says, 'Okay, how do we pull those out now?'. Really, they wanted to use this for a spatial query where they exclude everything that is not a county. So I'm assuming this is pulling all the data that exists in the water in this image I'm looking at here. So one way they tried to do it is to use a NOT IN and that just takes forever. You don't want to use that.

Another way is using the "except" keyword, but that's pretty slow as well. I mean, it gives you the right answer, but the fastest way to do it is an anti-join. There are two ways to write this. You could do a WHERE NOT EXISTS and then your query for matching them up or, this is the syntax that I always tend to use, you just do a join to the table where the values are equal. But then you look at table B where the value is null, and that's definitely the fastest way to do it. Then he says you can also apply this technique when doing a LEFT outer join and then checking if the second table's value is null as a way to pull out these points that are not a part of the county. So it looks like items that were in the water here, but if you want to learn more, definitely check out this blog post.

Next piece of content- "AWS RDS PostgreSQL Development with pgAdmin 4". This is from enterprisedb.com and this is pretty cool. I don't use pgAdmin, I tend to just use psql. But for those who use pgAdmin, you can now actually create instances in it on cloud platforms. It looks like they have one for Azure already for creating hosted database systems in Azure, now you can do it for Amazon RDS. Basically, you just put your keys in and it just uses the AWS API to provision a given RDS database instance for you. So definitely check out this blog post if you want to learn more about that.

Next piece of content- "Leverage a new Way to Import an Existing Postgres Database to Kubernetes". This is from enterprisedb.com and they are using their cloud-native PG operator for Kubernetes and they are going through a scenario where you're running Postgres 10 on RDS and how do you transfer that to a three-cluster Kubernetes managed Postgres 14 installation? They go through the YAML file that you would need for the operator to be able to set up that environment and go through the process of transferring the data over. So definitely check out this if you're interested in that.

The next piece of content- "PostgreSQL 16: part 1 or CommitFest 2022-07". This is from postgrespro.com. This is talking about Postgres 16, which is due to be released about 15 months from now. So over a year from now. But they've already started the CommitFest to see what features are going to be landing in Postgres 16. This goes over the notable features that were worked on during the CommitFest, and there's a lot of detail with regard to each one. So if you're interested in what the future holds, you could definitely check out this blog post.

Next piece of content- "ALIASES FOR SUB-SELECTS IN FROM CLAUSE". This is from cybertec-postgresql.com and we talked about that previously, but if you're not aware, when you use a subquery, you need to give it an alias. So this is a sub-query and it has no alias and it produces an error. But in Postgres 16, so here they're using 16 development, you no longer require that alias and it works as expected. So definitely something to look forward to in the next 15 months or so.

The next piece of content, the next episode of Postgres FM was published, and this episode covers a "Monitoring checklist". So they go over a checklist of all the different monitoring that you would want to do for your Postgres installation. So they're talking about TPS and QPS, latency connections for your sessions, longest transactions, tracking those tracking commits vs. rollbacks, transactions left until a transaction ID wraparound, replication lags, and the count of EALs waiting to be archived, which is an archiving lag. They also cover WAL generation rates, locks and deadlock counts, basic query analysis graphs such as top-n by total time or meantime, and basic weight event analysis. So I haven't had a chance to listen to this episode yet, but I'm definitely going to be doing that because all that stuff sounds pretty cool.

Next piece of content- "5mins of Postgres E31: Postgres security patch release, spotting vulnerable extensions and securing the public schema". This is from pganalyze.com. Last week, new versions of Postgres were released to handle a security bug as well as introduce some additional features. In this episode, Lukas covers that update, and he also talks about some additional extensions that you can use to identify, as he says, problematic vulnerabilities. So there's the pgspot extension that looks for potential vulnerabilities in extensions that you're using. He also talks about an extension called pghostile, which basically is great for pen testing that tests your database to see how vulnerable it is.

He also mentioned that with Postgres 15's new default, they're revoking PUBLIC CREATE from the public schema, and that should alleviate a lot of potential security issues that come up. Basically, it's just safer to revoke the PUBLIC CREATE from the public schema. Otherwise, as he says here, users can just go ahead and create objects like a table or a function in the public schema. So definitely for security, a great new default is coming, but everyone will definitely have to test their database on version 15 to make sure that all their permissions are set correctly to make sure they don't have any issues. But if you want to learn more, definitely check out his episode.

The next piece of content. The PostgreSQL person of the week is Hou Zhijie. If you want to learn more about Hou and his contributions to Postgres, definitely check out this blog post.

The last piece of content. We did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was on "Code Quality With Ernesto Tagwerker". He has worked on a number of Ruby libraries that assess code quality, including RubyCritic and Skunk. So if you're interested in code quality as well as testing, I definitely recommend checking out our show.

episode_image