background

IN vs ANY, Ghost Conditions, Percentage Calculations, Variadic Unnest | Scaling Postgres 243

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss IN vs. ANY performance, ghost conditions, percentage calculations using window functions and using variadic unnest.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. We don't have as much content this week as normal, probably because of the Thanksgiving holiday that was last week. But there are ten pieces of content to go through. So our first piece of content is "5mins of Postgres E45: IN lists vs ANY operator: Bind parameters and performance". This is from pganalyze.com. This covers a post that you can see highlighted here that we covered on last week's Scaling Postgres, which is "Postgres Query Boost: Using ANY Instead of IN. This particular post examined using Java and how they couldn't really use IN and they had to use ANY because IN essentially accepts a list of values, whereas ANY is kind of any size array. While Lukas covers this post and the points in it, he also brings up questions of performance by looking at other previous posts that had compared IN vs ANY. So he compares a benchmark from 2017 that showed IN lists were faster up until you got about 128 values in the IN list.

Then the ANY list started performing much better. He also talked about someone using thousands of different integers in an IN list and they were seeing a poor performance, they actually did a patch to Postgres 14 that improved the performance of IN lists. So he's advocating if you use a large set of IN lists, definitely move to Postgres 14 or later. Or consider changing over to ANY, maybe that would give you more performance. He even commented about how there were some cases where a Postgres index was not used with ANY, but it was used with IN, and this was with row type comparisons, but they were able to eventually get it to work by typecasting appropriately the values. So I definitely encourage you to check out his episode because it covers a lot of performance-related information with regard to IN and Any.

Next piece of content- "Ghost Conditions For Unindexed Columns". This is from sqlfordevs.com, and he's talking about something he's calling ghost conditions, which I haven't heard that term used before. It's basically where you have a table that has indexes and certain values, but what you want to search by is actually not indexed. So, for example, status and insurance are either not indexed or indexed individually, not as a composite index and how can you improve the performance of this without adding additional indexes? He suggested that there are other values that are indexed that this set of data fits within.

Well, you can add that condition to it what he's calling ghost conditions because they aren't really relevant for pulling in the data, but they are relevant if this allows you to use an index on, say, a type that allows you to narrow the range of records that you're interested in to retrieve the results faster. So he goes through and talks about that in this particular blog post. So basically, if you're seeing poor performance, you can always add an index, of course, but maybe look at other columns in there and are there correlated data that is indexed you could use to return results faster? But definitely check out this blog post if you want to learn more about that.

Next piece of content- "Percentage Calculations Using Postgres Window Functions". This is from Crunchydata.com and he's talking about how before Windows functions, it was quite hard to do different percentage calculations over all the rows in a particular table or subset of data in the table. And he's giving an example here of how you would calculate the percentage that someone earned over the rest of the people in the table. And basically, you have to do two scans of the table to collect that information. Whereas if you use a window function, you can actually get all the information you need in one scan and you do that by using a window function.

So in this case, he has the numerator here looking at the earnings times 100 and he divides it by the denominator where he's using a sum function in a windowing mode. So he's summing all of the earnings of the table over the entirety of the table. And even if you want to do the percentage of earnings by band, that becomes more difficult to do if you don't use a window function. But if you do, instead of doing it over the whole table, as it shows here with just an open and closed parenthesis, you can actually say partition by band and that breaks it out by band. So definitely a quick post on how to do percentages using window functions. Definitely encourage you to check out this blog post if you want.

Next piece of content- "VARIADIC UNNEST". This is from postgresonline.com. He's talking about a situation where he received some data that was entered quite poorly into a spreadsheet. So he had a practice_name and an employee_name, but each of the employers was just separated by a new line in the data, so it wasn't in its own row. So this is three rows of data, but it includes five employees and a blank employee for one of the practices. So for their credentials, there was just a new line to align up the second credential with the second person listed and their specialty as well.

So he received this and said, 'Oh my gosh, how am I going to deal with this data?'. He used a technique where he converted the strings to arrays, breaking on the new line that was entered to keep them separate. And he did that for each row so that they could line up. Then he used the unnest call, which you can call in a variadic fashion. This means you can have multiple different variables and each of them will fit in its own column and it won't do like a cross join, which he demonstrates here in the first part. And he basically got the data all nicely structured using this technique. So if you're interested in learning more about that, definitely check out this blog post.

Next piece of content- "SQL or NoSQL? Why not use both (with PostgreSQL)?". This is from supabase.com and basically, he went over the scenario of where you have a meal planning application that tracks essentially what you eat and you could go the NoSQL route and just throw everything into a JSON document and store that. And that works much more easily than trying to break out the data into a relational database. But the problem comes when you want to try to analyze it. It becomes much more difficult to do that using JSON documents as opposed to doing a SQL query.

However, it also causes issues because normally you would want to set up a lot of tables and do all those inserts, maybe you don't want to do that. So he developed kind of a hybrid schema where there's a main table per person per day, but then exactly what they ate was in a JSONB column in terms of amounts, in terms of food, water and exercise. So I wouldn't necessarily advocate this type of design layout, but it does give them a way to then show you how to analyze this type of JSON data that's received using different JSON functions. So if you want to learn more about how to use JSON functions, definitely check out this blog post.

Next piece of content- "POSTGRESQL ERROR: PERMISSION DENIED FOR SCHEMA PUBLIC". This is from cybertec-postgresql.com, and this happens, of course, when you upgrade to Postgres 15, create a user, and you start trying to create objects. You may run into this error, and this blog post covers that issue and how you could resolve it. Basically, you now need to grant permissions to that user on the public schema or, more appropriately, maybe to create its own dedicated schema and it would have rights to those. And how he resolved it he just ran GRANT ALL ON SCHEMA public TO demo. But if you want to learn more about that, you can definitely check out this blog post.

Next piece of content- "Global Index, benchmark with pgbench". This is from highgo.ca and they're talking about the global index on partition tables. So basically they're trying to create a unique global index that would apply to partition tables because right now you can't have a globally unique index and this is the proposal or their proof of concept on how to do it. This actually shows some performance results that you can see here. And they're looking at a different number of partitions from 1 to 1,000. As you can tell, it looks like the transactional performance of the global index, given the data that they have here, looks really good at the one partition and the ten partition level, but then it drops off significantly at the 100 table level and the 1000 table level. So I don't know if this is a long-term solution for this, but this is the stage that this work is at. So if you want to learn more about that, you can definitely check out this blog post.

Next piece of content- "PostgreSQL scary settings. data_sync_retry". This is from fluca1978.github.io. And he's talking about a setting you can make in your PostgreSQL comp file called data_sync_retry, and how you really shouldn't turn this on because there's a high probability you're going to corrupt your data. And he goes through all the process of why that is. But if you want to learn more about that, definitely check out this blog post.
Next piece of content, there was another episode of Postgres FM this week. This one was on "Materialized views", and they cover what they are, their pros, their cons, and some areas where they can improve. You can click this button to listen to the episode here, or you can watch it on YouTube from this link here.

Last piece of content. The PostgreSQL person of the week is Tobias Bussmann. If you're interested in learning more about Tobias and his contributions to Postgres, definitely check out this blog post.

episode_image