background

Recursive Queries, Lateral Joins, Pagination, JSONB Arrays | Scaling Postgres 127

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss recursive queries, using lateral joins, paging through a data sets efficiently and working with JSONB arrays.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres we talk about recursive queries lateral joinsPagination and JSON b arrays. I'm Kristen Jameson, and this is scaling postgres episode 127one.

[00:00:21] Alright, I hope you, your family, friends and coworkers continue to do well. Our firstpiece of content is PostgreSQL. Versions have been released, so this includes twelve point411.910, point 14 9.69.5, as well as beta three of version 13. Now there are two security issues.One is an uncontrolled search path element in logical replication and the second is anuncontrolled search path element in create extension. So be sure to check those out to see ifyou need to upgrade sooner rather than later, depending on how you use postgres. And then atthe bottom here, they mention all of the different bug fixes and improvements that were done. Sodefinitely review and see when you want to upgrade your version of postgres.

[00:01:07] The next piece of content is understanding recursive queries in PostgreSQL. So thisis looking at recursive queries and how you do it in postgres is via CTEs. So first they reviewcommon table expressions and basically convey it as a view that only exists within the scope ofa single query because you can usually always write a CTE as a subquery and then they coverthe syntax of a recursive query. So you use with recursive the name of your CTE and then itsdefinition and how you do the definition. It's two queries combined with a union or a union all.The first one is the non recursive branch and the second one is the recursive branch thatreferences itself. So therein lies the recursion. And then lastly you say select all from the CTEyou've defined and they go a little bit over how recursive queries are processed internally. And ituses a working table, so really it kind of works through things procedurally. But then it goes intoan example here. And of course, a lot of recursive examples use people who are managingpeople in a hierarchy. So they're using that. Here you have this manager or this MGR columnthat defines who the manager of that person is. So what they want to find out is all thesubordinates of person 7566. So the nonrecursive branch basically selects the employee enumber where it's 7566. And then the recursive branch selects from the table, joining on itselfthe CTE where the manager ID equals the employee ID. And then here they show the completequery building out the CTE and then what the results, which is what you expect. And then theygo into showing how you can do generated columns as long as you use this union all. And hereas each part of the recursion, they add one to this level that they're defining. So you can see afirst level subordinate and a second level subordinate. And then next they show defining thepath. So for example, being able to output that. Jones supervises Ford. Who supervises Smith?And basically in the Recursive query they append the next employee name onto this column thatthey've defined. And then lastly, they go into how this compares to doing it with Oracle. And theydo mention that Oracle does not conform to the SQL standard, but they do have their own wayof doing this. And then they also cover things looking at designing a Fibonacci sequence usingRecursion. So if you want a refresher on Recursive queries, definitely check out this post fromCybertechnposgresql.com.

[00:03:38] The next post is iterators in PostgreSQL with lateral join. So a lateral join, as they sayhere, is that the output will apply the right hand part of the join to every record in the left handpart of the join. So they use this example of a Generate series and you're generating a series ofnumbers from one to four. So in the left column you'll see 1234. But then they do a cross joinlateral to another generate series that goes one through whatever the value is in the generateseries. So for example, when the left column is one, it's going to generate a series from one toone. So you've got one and one because there's nothing other than the one. When the generateseries is at two, it generates two rows. Why? Because this lateral join executes this twice. Onceto go from one to one and then it goes to two. So it's going to show you two. And then when youhave three in the left hand column, it's going to generate three rows. When you have four in theleft hand column, it will generate four rows, and so on and so forth. Now, they use functions asexamples here and even here, --where they're showing movie recommendations based upon a user ID and recording a nameand a rank. You could also do this as a subquery. So here they're say picking the five mostpopular streams by zip code. So this was a relatively brief post giving a good explanation oflateral joins. And they even have a reference here to an online class that crunchydata has puttogether to help understand it better. So if you want to learn more about lateral joins inPostgreSQL, check out this blog post from Crunchydata.com.

[00:05:18] The next post is Keyset Pagination. So this is basically how to do Keyset PaginationPaging through a data set. Now, there was a previous post and the way he does his blog post onthe same page as the next blog post. So this one posted is called Pagination Tips and covers acouple of different ways to do it, but I would probably only choose the key set Pagination. Hegoes into more detail here for performance reasons, because basically using offset with a limit isreally poor on performance as you get to larger and larger records. And how he does it isbasically remembering the product ID that was used and then showing records greater than thelast product ID looked at by a limit. Now, the reason he does a limit of eleven if he wanted todisplay ten is to know whether there's a next page or not, to say show a next page or not. Thenhe talks about a method using the CTE here to be able to page backwards potentially. Butthere's usually a disconnect between good performance and actually doing things by pagesbecause it gets very complicated in terms of what happens if a record that you're looking at isdeleted or a new record is inserted in its place for whatever reason. How would you handle thatin the Paging? Do you need a cache values in the application? So there's a lot of things toconsider when doing Pagination. Now, I also am going to reference two other blog posts thatcover this in a lot of detail and they have their own recommendations. One is a post from lastyear around this time called Pagination with Relative Cursors and essentially it's like a keysetPagination. Basically they are saying you want to avoid doing limit offsets when you have largeoffsets because the performance is just horrible. And they show here once you have 100,000offset it takes over 2 seconds and then it continuously gets worse and worse the larger offsetsyou go. Now they're calling it a relative cursor Pagination, but basically it's the same technique.You're using an ID and a limit. So you're seeing what products are above a particular area. Now,they mentioned the disadvantage is that it's hard to go to a specific page. Basically you can goforward and backward but you aren't really able to go to a specific page three or page six forexample, because it's always going forward and backward. And then they talk about you evenhave a way to use sorting and skipping records by a different sort other than the ID and theyhave a technique of how to do that here. Now, even a third post related to this that I think maybe close to five plus years old is Paging through results from Usetheindexlook.com and hecovers different databases. But for PostgreSQL he shows again the one that's the least efficientusing an offset and a limit. Now, he uses the SQL specific syntax fetch next ten rows only. Butagain he uses the same technique of using greater than or less some particular date with a limit.And he says you can use this specific syntax in postgres a row values syntax to consider both adate and an ID at the same time. And he shows here in this great graph on the bottom how theperformance with offset starts getting progressively bad as you go up in page numbers, whereasthe performance, using what he calls a seek or this keyset Pagination maintains pretty goodperformance no matter how large the pages get. So if you're interested in designing an efficientway of paginating in your application, definitely check out these blog posts.

[00:08:40] The next piece of content is working with a JSONB array of objects in PostgreSQL.Now, this is an application developer describing how to work with JSON B in their data, and heuses a number of different JSON B functions to do it. His example is having purchases in anarray of products in the JSON B field. Now, I wouldn't necessarily advocate this particular datamodel, so I'm not sure if I would follow this data model myself, but as an example in how tomanipulate the results a --nd using all the different JSON B functions, this was a great post to read through to learn moreabout that. So if you're interested, check out this post from Levelupconnected.com.

[00:09:24] The next piece of content is running multiple PG Bouncer instances with Systemd,and this is from Secondquader.com. So Pgbouncer runs as a single process, and it only usesone CPU or one CPU core. So if you have a multicore machine, PG Bouncer is not able to useall of the cores of those machines. Now, what people typically do is they set up multiple PGBouncer services on different ports. So you have to run different services to be able to leverageall the CPUs. However, this has a technique of running multiple PG Bouncer instances withouthaving to do that using a common port. So they cover an example of the unit file for the systemD configuration as well as the PG Bouncer configuration that he's using. And there's this settingreuse port that you can set to be able to run multiple instances. And he made a modification ofthe PG Bouncer services using the Systemd template system. And after making thesemodifications, you're able to start PG Bouncer service one, start PG Bouncer service two. Hesays they don't even have to be numbers. They could be some other Identifier that you use.Now, he says the problem with this particular setup right now is that the Unix domain socketscan't be shared, but he has a solution for that. So you can set up socket activation, which issupported in PG Bouncer 1.14, and he shows the configuration for that. And what that gives youis a common port for all services to connect to on that system. And then a per instance TCP IPand Unix domain sockets for administration and monitoring. So basically, you can start theservices. What started here are the port numbers for administration of that system, but they canall use the same connection port to connect to the database. So they show that here thatproduction traffic can just connect to a common port. Whereas if you want to administer aparticular PG Bouncer instance, you can use the port defined when you started the service. Sothis is a pretty interesting technique that I'm definitely going to keep in mind potentially, if I'mgoing to be setting up multiple PG Bouncer instances on a single server. So if you're interestedin getting more details, definitely check out this post from Secondquader.com.

[00:11:38] The next piece of content is Partitioning Improvements in PostgreSQL 13. This isfrom Higo CA, and they first talk about the history of all the improvements that have been madeto partitioning over all the different versions. And in 13, they talk about partitionwise joinimprovements, partition Pruning improvements, as well as before row level triggers. So if youwant to get all the details and example of queries of how to check out these improvements,definitely check out this post from Higo CA.

[00:12:11] The next post, also from Higo CA is PostgreSQL high availability the considerationsand candidates So they're looking at High Availability and they're proposing what peoplegenerally want when they're talking about a High Availability PostgreSQL solution. And theycame up with these requirements or considerations, as well as kind of what High Availabilitymeans. For example, there should not be any single point of failures in the system, continuoushealth monitoring of back end servers and systems, and reliable failover in case of failure. Now,they also consider four High Availability solutions that they're going to consider to meet theserequirements. The first one is PG pool Two. The second is PostgreSQL automatic failover orPAF. The third is Rep Manager or the Replication Manager. And the fourth is Petrone. So theycover the very basic summary of each of these solutions. And in future blog posts, they're goingto be examining each one to meet the High Availability needs of postgres. So if you'reinterested, you can check out this post and look forward to the future ones covering eachsolution.

[00:13:19] The next post is consensus based. Failover with PG pool Two. So this is talking aboutFailover and how you can do it with PG Two based upon their watchdog feature and usingConsensus to do it and showing the different values that you can configure for it. So if you'reinterested in doing that, definitely check out this blog post.

[00:13:39] The next post is building a recommendation engine inside Postgres with Python andPandas. --This was an interesting post where they actually were building a basic recommendationsystem, but doing it within PostgreSQL and using functions to do it. If you're interested inchecking that out, check out this post from Crunchydata.com.

[00:14:01] And the last piece of content. The PostgreSQL Person of the Week is MichaelBrewer. So if you're interested in learning more about Michael and his contributions toPostgreSQL, definitely check out this blog post that does it. For this episode of Scaling Postgres,you can get links to all the content mentioned in the Show Notes. Be sure to head over toScalingpostgres.com, where you can sign up to receive weekly notifications of each episode, oryou can subscribe via YouTube or itunes. Thanks. --

episode_image