background

Lateral Joins, Generating JSON, dbcritic, Fetch First Rows | Scaling Postgres 174

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss understanding lateral joins, generating JSON with Postgres, the dbcritic utility, and fetch first rows vs. limit.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "UNDERSTANDING LATERAL JOINS IN POSTGRESQL". This is from cybertec-postgresql.com and they're talking about lateral joins. Now, I really like this blog post from the perspective of how they explained lateral joins. They took an example where you have a SELECT statement that you would send to Postgres that selects whatever columns from a particular table called tab. Well, you could think of this as a for loop in a programming language. Basically, for every instance of tab, assume it's an array of something, you're going to loop over it and grab data from each instance of the array. So now, what is a lateral join? Well, a lateral join is basically a nested loop.

So you want some set of data and then within that set of data or that instance in an array or row in a table, you want additional items of data from another table. So they have this following example where you have a set of products and they set up 1,000 products with particular prices. They set up a wish list table and inserted three individuals with the desired price for a particular product for each of these individuals. This is what the 1,000 products basically look like and what the wish list looks like. So, going back to thinking of a select statement as a loop, essentially you are going to loop over each item in the wishlist and then you're going to loop over the products, looking for the case where it's a certain price and returning three of those items. That's essentially what the lateral join does.

So you are selecting from the wishlist table and then you're doing a lateral join to this query or subset of a table. So it's a query with a product table where the price is lower than the desired price on wishlist. They're ordering by the price and giving you a limit of three. So each wishlist is shown along with three prices that are below the target price. So basically, these first three columns come from the wish list and the remaining columns come from the product table. That's all a lateral join is, it's basically a nested loop. When you do an EXPLAIN of the plan, you can see that it is indeed doing a nested loop. So that's all that a lateral join is. What this post covers, and kind of makes me think, should they have perhaps renamed lateral join in the SQL standard? But if you want to learn more, check out this blog post.

The next piece of content- "Cut Out the Middle Tier: Generating JSON Directly from Postgres". This is from blog.crunchydata.com and they're talking about using functions that exist within Postgres to actually return JSON to your application. So as opposed to just generating the normal output that you typically receive, you output it as JSON. So they have an example here where they have employees and departments. So employees belong to departments and the first function they show is called row_to_json. So you can give it a single row of data. Here they're selecting from where the employee ID is one and it will output in JSON each of the columns in that row.

Then it does something interesting here with the geometry column where it actually nests some JSON because they have a custom type setup that casts it to JSON. So it automatically does this. Now what you're generally going to want to do is update rows of JSON essentially, or an array of JSON objects. And that's what json_agg or the JSON aggregate function does. So here as a subquery, just selecting the employee ID and the name from the employees' table where the department ID equals one. Then doing a JSON aggregate on that output shows you the JSON that you see here, an array of two essentially JSON objects or JSON entities. You can even nest the output.

So here's an example using a CTE where they're selecting the data they want from the employees table, then they're joining it to the departments and they're including in the output an aggregate of the employees for each department. Then the last section of the CTE aggregates all of the departments that are being selected. So you can see it generates the JSON, as you see here where you have departments and each employee in those departments for each object. Now, the benefits of doing this in your application are definitely speed, because if you're using an application framework such as Django or Rails or any other, usually those use ORMs and there's a lot of overhead using that ORM generating objects for each of the returned pieces of data and returning it to the user. Going directly to JSON is much faster and should give you better performance. So if you're interested in that, definitely check out this blog post.

The next piece of content- "dbcritic: Constructively criticizing your Postgres schema (using Idris)". Now this is from channable.com. They're talking about how they wanted to help audit how they were setting up their database schema and give guidance on some best practices. So they created this solution called dbcritic. I've heard of other language critics such as Ruby critic to basically criticize how you're writing Ruby. So you run this against your code to give you an output of it. Well, this does it for your Postgres schema. Right now, they're only criticizing two areas. The first area they're mentioning is having an index on your foreign keys. Now the importance of this is that when you go to delete one ad, if it has a foreign key, it's going to need to look at ad_statuses by this ad_id to delete those as well. Without an index, that's going to be slow and it'll be really slow. If you need to delete multiple ads that each have multiple statuses, the time increases dramatically.

But if you have an index on this ad_id column, it should be able to delete those much faster. That's one of the things that dbcritic criticizes. The other one is mistakenly using timestamp when you want a timestamp that includes the time zone. So timestamptz. So basically it does an audit of your schema to make sure that all of your timestamps are timestamptz. So those are some of the areas that dbcritic covers now and they have integrated them into their CI pipeline. Now, it's an interesting project, but it is written in Idris, which they say is a functional programming language similar to Haskell. I've never used it before, so I think that would probably be the only thing that would give me pause about using it at this time. However, these types of audits should be relatively easy to implement in other languages. So if you're interested in auditing your Postgres schema, maybe you can take some of the suggestions from this post and implement them yourself.

The next piece of content- "POSTGRESQL: LIMIT VS FETCH FIRST ROWS... WITH TIES". This is from cybertec-postgresql.com and they're talking about how Postgres allows you to use LIMIT to return only a certain number of rows from a query. But the actual SQL standard is to use FETCH FIRST ROWS. So for example, if you want to use LIMIT in this query with seven rows, you could say LIMIT three and it returns you three rows. You could also use FETCH FIRST three rows only and it will return you the exact same thing LIMIT would. So it is more verbose, but it is the SQL standard. Now they are also covered with TIES. So this is something you cannot do with LIMIT but you can do with FETCH FIRST ROWS. So when you specify FETCH the first three rows with TIES, what it will do is not just give you the first three rows, but it will include additional rows that would have been equal to the last one.

So if you look at the table they initially implemented up here, they have 3 appearing twice, and they also have 4 appearing twice. So that's the data in the table. When you FETCH only three rows without ties, it's going to FETCH these first three, but when you FETCH it with ties, it will FETCH four of them. And then if you actually add another number 2 row to it and you do FETCH FIRST three rows with ties, it will then give you one, two, two, because it won't retrieve the three anymore. Now, as long as you're ordering by the ID, when you FETCH FIRST four rows, you're going to get that repeated row here, so the five rows. But if you order by the ID and then this random column, you're only going to get four rows even though you're doing it WITH TIES. So that's something to keep in mind. The order by clause is very important when using the WITH TIES. So this is another great blog post. Go ahead and check it out if you're interested in learning more.

The next piece of content- "What happens to logical replication after running pg_upgrade?". This is from elephanttamer.net. They're talking about a scenario where they have a large active database that's doing logical replication to another database as well as maybe even doing some logical decoding. So it's very active. A lot of queries and statements are being run and you want to upgrade them to the next version. So the issue that you will encounter is that if you need to use pg_upgrade to do that upgrade because that's the fastest way to do it, generally, that pg_upgrade will destroy all the replication slots. So you're going to lose any logical replication going on. So it's critical to have those replication slots back in place before you allow any more traffic at all to the main database system. So he's worked out a process to do these types of upgrades. So the first step is to block all traffic to the primary database and he does this by altering the pg_hba.conf file or you could alter your firewall settings.

The second step is to wait till the replication lag is zero and then drop the subscription on the subscriber, which will also drop the publisher's replication slot. Then you do your upgrade on the subscriber, do the upgrade on the publisher, and then the next thing is on the subscriber, create the subscription again. But you do it with copy_data=FALSE because the data should already exist there. Then you do a smoke test to make sure that logical replication is working. So you insert some data, make sure it gets replicated, and go ahead and remove it. Then you can create any other logical replication slots that are required, and then you unlock the right track. So this is a process that worked for him. He did mention that there's also the ability to alter the subscription and set a new replication slot, but he had some issues getting this working with his upgrade procedures. So if you use logical replication and need to upgrade your database, maybe you want to check out this post to follow his suggested procedures.

The next piece of content- "Secrets of 'psql'- a Great Tool". This is a YouTube video on the San Francisco Bay Area PostgreSQL Users Group. This is a great post talking about using psql, the command line tool. Because if you're going to be using a lot of Postgres, it's a great tool to get to learn all the ins and outs of. This presentation does a really good job of doing that.

The next piece of content- "How much data goes into the WALs?". This is from fluca1978.github.io. This post first covers a lot about the WAL, basically the write-ahead log of Postgres that basically records all data changes to Postgres. Then eventually they get written to the actual data files. He talks about the existence of the WAL files, how they're set up, and how they have log sequence numbers within them that reference different areas of the log, including some utilities you can use to inspect the log. So this is a good post explaining a lot of that, to begin with. But then he goes on to talk about how WAL gets generated. And when you insert data into the table, you can see how much WAL is generated given different types of inserts, and how if you have an unlocked table, you're basically not going to get any WAL generated from any changes made to particular tables. So if you want to learn more about the WAL and how it works, definitely check out this blog post.

The next piece of content- "Improved logging by libpq in PostgreSQL 14". This is from postgresql.fastware.com. They're talking about some enhancements to 14 with the client-server communications and to be able to trace what's going on. They have added some timestamps for you to be able to debug different issues in client-server communication. So if you're interested in that, definitely check out this blog post.

The next piece of content- "First contact with the pg_filedump". This is from highgo.ca and they're talking about using a utility to inspect Postgres's data files. So for example, he created a table, inserted three rows into that table, then identified what file exists by using the pg_relation_filepath function to identify where a file is, and then used pg_filedump to run against that file. So now you can inspect the header block for that table, look at individual data items and the different parameters, and how they work to inspect Postgres. So if you're interested in learning more about the internals of Postgres, you can definitely check out this blog post.

The next piece of content- "Logical Replication Of In-Progress Transactions". This is from amitkapila16.blogspot.com. In Postgres 14, they released some enhancements to be able to start sending data to a logical replication subscriber before a transaction is committed. So generally, in versions earlier than 14, you needed to wait till the commit happened for information to start transferring to subscribers. But if you have a lot of activity going on in a transaction, that's less than ideal. So they've started a way to be able to start streaming changes before they're actually committed in Postgres 14 and higher. This blog post walks through the description of the feature, all the changes they had to make in order to make it work in Postgres as well as some of the contributors to it. So if you're interested in that, you can check out this blog post.

The next piece of content, the PostgreSQL person of the week is Jean-Christophe Arnu. So if you're interested in learning more about Jean-Christophe and his contributions to Postgres, definitely check out this blog post.

The last piece of content. This past week, The Rubber Duck Dev Show gave a presentation on pair programming. So if you want to learn the whens, whys, and hows of pair programming, you can check out this episode.

episode_image