background

Recursive CTEs, DB Access Patterns, Dates & Times, Cluster | Scaling Postgres 106

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to use recursive CTEs, database access patterns, using dates and times and how to cluster your data.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about recursive CTEs database accesspatterns, dates and times and cluster. I'm creston, Jameson. And this is scaling postgresepisode 116.

[00:00:22] All right, well, with all the craziness this going on, I hope you, your coworkers andfamily are safe and I hope you're having a great week. Our first piece of content is RecursiveSQL queries with PostgreSQL. And this is from the Towardsdatascience.com blog. And basicallythey're talking about recursive CTEs or common table expressions. And he first talks about whyyou would want to use a recursive query. And he gives a few examples. Where you have first iswhere you have self referential data. So for example, you have an employee table and one ofthe columns indicates who is the manager of that person or some other type of relationship. Oryou have a category that belongs to another particular category, but it's referencing it within thesame table. The next example is you actually have a tree of information, some sort of taxonomysystem, links between articles and things of that nature. Now, what I really like about this post isit really starts off from a basic standpoint and explains everything going on. I would say it's oneof the best recursive CTE posts I've seen in a while. Now first they just discuss what a CTE isshowing you the with statement here to basically compartmentalize or kind of do a sub query tomake it easier to reason about queries you need to do. And then he shows the self referentialemployee table here where you have a Manager ID that it points to the primary key of the personwho is that employee's manager. And he inserts some sample data here. You can see the CEOhas no Manager ID, whereas all of these others are managed by other people in the chart. Andthen he goes into recursion basically how you use a recursive CTE. And he gives a basicexample here where it has one, which is the base case, and then does a union all for selectingthe other rows. And when you run this query, you could see it prints out these ten rows. Nowthen he uses the example of his employee table that he created to get a manager tree. And sohe uses the base case of Alice, the VP, and then unions it with this query that increases the levelhere. He starts it at one, but he increases it each time to see who that person manages. And youcan see the results of this recursive CTE down here where you see those staff who are on thesecond level, those staff that are on the third level, and he even gives a convenient diagram soyou can actually see what this output means graphically. And then he even does a test doingdegrees of separation between two employees. And lastly, he continues on and looking in termsof graphs. So defining the relationship that exists between multiple points. He does that in theRecursive CTE sets up here and it gives this recursive output defining all the differentinteractions between different points, or I should say the path between different points. So ifyou're wanting to learn more about Recursive CTEs, I highly suggest checking out this blog post.

[00:03:32] The next piece of content is the Troublesome Active Record Pattern and this is fromCalpatterson.com. Now there is a Ruby library called ActiveRecord, but what he's referring tohere is the actual Active Record pattern and he mentioned the definition here in Martin Fowler'sdefinition, basically converting a whole programmable object into the primitives that are stored ina database table as a row. So basically he has issue with the pattern itself, not explicitly say theRuby on Rails Active Record library, although he does have issues with ORNs themselvesbecause he does mention here issues with object relational mappers. Now, some of theproblems that he sees is that typically by default it retrieves the entire author object. Nowactually that is what happens with an SQL query. If you're using select star or select Asterisks,you're going to be pulling all of the columns from that table, so much as you should narrow downwhat you retrieve. If you're doing it programmatically through an Orn, you should only choosethe particular columns of the information that you need. Then he mentions the entire book foreach object has to be retrieved because he's doing it as a nested loop here. But of course someOrms you can use Joins and you can literally send queries that look like this. And he mentionsthe third nested loops prelude, m --uch parallelism. So yes, there are a lot of problems, but reason I brought this post up isbecause I still use an Orm because it's efficient, but I know how to use it to be able to generatethe SQL that looks like this efficiently. So you don't want to just use an Orm blindly, butunderstand how to use it so that you can restrict how many columns you're pulling back so thatyou can do a join to a table and you don't have to pull more data than you need. So basically inmy opinion, you can use an Orm but you need to use it intelligently, you need to know how thedatabase works and use it efficiently. Now he also mentions here the issues with transactionswhere you get a record, you change the number of copies in stock and then you save it. So ofcourse there has been no locking going on. So another process could have already updated thisor increased the stock and now you're putting a wrong value in the value retrieved earlierbecause there was no lock on it. And how he has an example here of how you would want toimplement this. You can even do it in a particular update query to do this using raw SQL andthere's a way to do that communicating through the orms if you need to. Then he also discussesRestful APIs as well. So while this post isn't PostgreSQL specific, he mentions things that youneed to be aware of and use intelligently. If you're using some of these tools, such as orms thatmake certain operations more efficient, you just want to know how to use them correctly to avoidperformance and data problems.

[00:06:41] The next piece of content is PostgreSQL date and Time data types Explained. So thisbasically just goes over date and data types and how they work. Now that first covers atimestamp and the difference between timestamp with time zone and timestamp without timezone, and how most people encourage using timestamp with time zone. And it gives you thecode here because it handles time zone things for you and you can make your programs easierto work with. Now, you don't always have to have a set session here. You can do it within thequery itself to define what the time zone is. But for his demonstration purposes, he just shows ithere. And if you have a timestamp with the time zone included, it can do the time zonetranslation for you. But they also have the timestamp without time zone if you wanted to rely on,say, your application framework to do the time zone conversions. But he covers all abouttimestamps and then he goes into dates which exclude time. And he shows how you cancompare a timestamp to a date in cases where it's true, in cases where it's false, and how youcan potentially convert between them. Then he goes into times of a day and intervals andfollowing up with time ranges. Now, if you want a refresher on using dates and times withpostgres, definitely a blog post to check out.

[00:08:01] The next piece of content is Cluster improving PostgreSQL performance. This is fromCyberTech Postgresql.com and they're talking about Cluster, which is a tool that reorders thedata based upon an index. So the reason why you would potentially do this is becausephysically ordered data on the disk, a lot of times it's easier to do lookups or faster and moreefficient to do lookups. Now for this, he created a test table and did a generate series of 10million rows. Then he did a second table that randomized the order of it. So the tables were thesame size and include the exact same data. They are just in a different order. He created anindex on each of the tables and you can even see the random index takes longer to generatebecause data is mixed all around on the disk. He did a vacuum and then looked and see howlong it would take to read the blocks of the database. Now, all of this was pretty much stored inmemory, so you're not seeing some of the disk access costs here, but you can see that thesequential one on disk was about twelve milliseconds, whereas the one that's random is about13 milliseconds. Again, that's because it didn't need necessarily to go to disk, but look at thebuffer shared hit ratio. It only had to access 130 buffers when it's sequential, where it had toaccess over 18,000 when it was randomized. Now if you then cluster based upon that index, sowe ran the cluster command and now it's not still not quite as fast, about 13 milliseconds still, butthe read is down to 135. So basically if you know your table is typically accessed by a particularindex, it can be beneficial --to use cluster to order those on the disk. Now the disadvantage of cluster is that it doesn'tmaintain that order on the disk. If you do a lot of updates, it's going to start spreading themaround again and you'll need to run cluster again. And it's also something that cannot be donewhile the table is active. So it needs a lock on the table. But if you're looking to boostperformance, maybe for as he mentions here, maybe more so of an analytic database or whereyou have more static data, then it could be beneficial.

[00:10:22] The next post is is postgres running and this is from Mydbainotebook.org and it's justa post of what you can check to see is postgres running. Because if there's a problem that youcan't connect, that's probably the first thing you're going to look for. And he just has this veryquick PS command to look and see what the postgres processes are. And once you know whatprocesses are running, you can actually reference the directory issued for the postmaster to kindof find where the PostgreSQL comp file is because to connect to it you're going to need to knowthe port. Now you also need to consult if there's any other comp files that are included inpostgresql.com as well as the auto.com file. So that's something to keep in mind. And then forcompleteness this is for Red Hat and others, but he also mentions for Debian and others as wellwhat that looks like when you do the PS command and even shows a little quick said commandto be able to find what the port is in the file. But again, you need to make sure that there's notincluded files or it's in the auto.com file as well. And then how do you start and stop postgres?He referenced some of the Debian tools that are available such as PGLS clusters to list outclusters as well as how to start, stop and restart them, and then as well as system D as well. So,pretty quick blog post, but if you work with different PostgreSQL instances, definitely some quickcommands to determine is postgres running.

[00:11:57] The next piece of content is actually a YouTube channel. It appears they've posted anumber of videos for the Southern California Linux expo on their YouTube channel, and there'sabout five or six or so videos that have PostgreSQL. So for example, there's using PostgreSQLfor data privacy and security declarative caching with Postgres and Redis. No excuses.PostgreSQL Security PostgreSQL Worst Practices enterprise Authentication with PostgreSQLlinux I O internals for PostgreSQL Administrators welcome to Postgres Twelve, et cetera. So ifyou're interested in video content, definitely check out this YouTube channel.

[00:12:42] The next post is creating a PostgreSQL procedural language. Part Five returningResults this is the of course, fifth in series. That second quadrant.com has been working throughsetting up Julia as a programming language to use within PostgreSQL. So if you've beenfollowing this series, the fifth post is available.

[00:13:02] The next piece of content is the new PostgreSQL person of the Week is BruceMomgen. So if you want to learn about Bruce's contributions and work in PostgreSQL, definitelycheck out this blog post.

[00:13:15] The next post is Spatial Tile serving with PostgreSQL functions. This is fromCrunchydata.com, and again they're following along with the web services that they've createdto work with Postgres and how you can generate a spatial tile serving. So if you have an interestin this content, definitely check out this blog post.

[00:13:38] The next piece of content is PostgreSQL GSS API authentication with Kerberos. PartOne how to set up Kerberos on Ubuntu so this looks to be the first part of a three part blog posttalking about authenticating to PostgreSQL using Kerberos. Now, this first post doesn't coverPostgreSQL, but it covers setting up Kerberos on Ubuntu. So if you have an interest in that,maybe you want to check out this first post.

[00:14:09] And lastly, PG. Timetable advanced, PostgreSQL, cron like, scheduler released. Thisis from CyberTech Postgresql.com, and they've released the second version of their PGTimetable tool because they made some changes to how data was stored. So they want to goahead and do that and explain how this new tool works. So if you're interested, definitely checkout this blog post that does it. For this episode of Scaling Postgres, you can get links to all thesign up to receive weekly notifications of each episode, or you could subscribe via YouTube oritunes. Thanks. --

episode_image