background

How Often Does That Query Run? | Scaling Postgres 367

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss a Postgres VS code extension, service definition files, undead tuples and the benefits of finding out how often a query is run.

Content Discussed

YouTube Video

Podcast Audio

Transcript

In a lot of my consulting I am asked to come in and take a look at a database and optimize some queries, or even just look at the whole database and find the slowest queries and optimize them to make the database run more efficient. And it's easy to get into a cycle of okay, here's a query. How can we make it faster? Or we can add this index. But sometimes the question we should be asking is not how do we make this query faster, but why are we running this query so much? But that's something we're actually going to cover at the end of the episode in my Consulting Corner segment. But I hope you, your friends, family and coworkers continue to do well.

Our first piece of content is “Announcing a new IDE for PostgreSQL in VS Code from Microsoft”. This is from techcommunity.microsoft.com and I don't know if they had an existing VS code extension for postgresql, but there is a new one now and some of the key features they mention are schema visualization so you can look at an ERD diagram. GitHub Copilot that seems to be built into everything now, now has some database awareness. So you have some different copilot context menu options for example, to explain a query. And they even have agent mode, which means having the AI doing something for you. In this case they're actually creating a database in Azure. I'm not really sure how many of the features of this are Azure database specific. I think a fair number of them are for just general postgres usage. They have a way to manage the connections and do passwordless authentication with Entra and a database explorer to look at the different objects as well and query history. So if you're a developer who uses postgres and VS code, you may want to check out this new extension.

Next piece of content “PostgreSQL Service Connections”, this is from boringsql.com and when building connections they mentioned a number of the methods that are typically used. You list out the individual parameters host port, database, username, password or you can do it in a URI like format, essentially a connection string. You could also do them as environmental variables. You could also do them as command line arguments to various commands like psql or pg_dump. But you can also use predefined connection profiles stored in service connection files. And these are basically INI files that have the format you see here. And I always forget these because I actually haven't used them too much, but they could be advantageous in certain use cases because you can store them system wide or in a given user in the pg_service conf file. And to connect up using one of these predefined services, you just use the command and say service equals the service name. In this example, psql service equals mDB. So to me this is similar to an ssh config file, essentially. Although I don't think you can put passwords in an ssh config file, although I haven't tried that. But once you set these up, it's super easy to implement and change your commands over to using it, and it makes it much more readable. You can just say pg_dump particular service as opposed to having all this long connection string as part of the command. And you can even use these service definitions in different types of connections, like with DSN keywords or within connection URLs or even an environmental URL here. And this is supported wherever Libpq is supported. So if you have a programming language that uses libpq, these service definitions should work. So he shows an example of it working in Go and Python and Java, PHP or Rust. But something he says he knows doesn't work is the nodes pg_module. But these services even work for internally defined connections, like if you're using dblink or I imagine when you're setting up a publisher and subscriber, it should work there as well. And you can also define just partial service definitions, so you can exclude the password from it, like I was mentioning before, and add that as a environmental variable, for example. But this is a feature to keep in mind you may have use for in the future.  

Next piece of content “When The Dead Won't Die”. This is from cybertec-postgresql.com he's talking about a situation where you have some rows in a table, they've been deleted, and then vacuum goes through and presumably cleans them up, but then suddenly they come back. Now they're not visible again, but it looks like vacuum still needs to vacuum them up again. So you might be thinking, as I was working, what the heck is this talking about? And it has to do with vacuum not always cleaning up indexes. So In a normal 8 kilobyte database page, you have areas at the beginning of the page that have line pointers to the actual tuples within that page, and the tuples are at the end. Now, when you're running vacuum, there are certain conditions under which indexes are going to be skipped. They're not going to try to clean them up because that would take more time to do. And you can see there's a threshold at which it will bypass the index, and that is 2% of the relations pages. So the heap still gets vacuum, but not the indexes. And when this happens, it says vacuum marks these line pointers as LP_dead, indicating that they are still in use but considered dead, and will be probably freed soon. But in the second phase of each vacuum, they will finally be set to LP_unused. But. But when Analyze runs, Analyze still recognizes these tuples marked lp_dead as dead rows, which might cause them to suddenly reappear in the system metrics. So he actually has an example here where he created a table with 100,000 rows. He deleted 1,000. When he runs Vacuum, you can see the index scan is bypassed because essentially that's only 1% of the rows that are dead. And if you look at the statistics after the vacuum, you can see the number of dead tuples is zero. So as opposed to 1000, it is now zero. But if you run Analyze, it updates the statistics tables and now it shows the thousand dead rows again. So Vacuum couldn't free up those tuples because the indexes still need to be dealt with. But analyze runs and it adds them back into the table statistics as dead rows. So anyway, if you are keeping a close watch on your statistics tables and you're seeing something like this happen, now you know why.

Next piece of content. “Short alphanumeric pseudo random identifiers in Postgres”. This is from andyatkinson.com and he's imagining the scenario where you have a integer primary key and let's say you don't want to make that integer visible, or you want some other pseudo random identifier to use instead of the id and you want to be able to convert back and forth, meaning presumably you can use this as a public ID and be able to convert it back to the integer. Well, he had particular properties of this id. He wanted to make it a fixed size of five characters in length. He wanted to have fewer bytes than a uuid. He wanted the value obfuscated, he wanted reversible to be able to convert it back into the original integer. He wanted to use postgres capabilities, no other extensions or other languages involved. And finally a non math heavy implementation. And how he's doing this is he's converting integers to bits using exclusive or or xor bit rise operation and modulo operations. So he wrote the code out to this and he created a simple function called obfuscate ID and you give it an integer and it outputs the this five character text string. And he also has the obfuscate_ID where you give it that five character text string and it outputs the integer back. So with regard to an implementation, you can create a table and say your integer is the primary key and he's using generated always has identity. And then he created this next public_id as is it is text, so it's a five characters text string and it's generated always using the obfuscate_id function passing in the ID that gets generated. So it automatically keeps the public ID up to date and looks like he did add a unique index to that column as well as some particular constraints. And once you insert some data into it, it looks like this. So it has the id, the public id, and using a function to then reverse the id, so you get the integer ID back. Now he says with this implementation, inserting data is three and a half times slower doing these function calculations. So that's something to keep in mind. But if you're interested in something like this, you can check out this blog post and here's a link to the PL PGSQL source code as well.

Next piece of content. ”pg_dump speed across versions”. This is from depesz.com and he looked at pg_dump speed from versions 10 to 18 and he ran all of these different variations of pg_dump. To give a good example. Now he did this without any data, so no data was included. It was mostly just a schema dump. But there were 2.4 million relations in pg_class, which comprised 288,000 tables and 1.7 million indexes, which is a pretty large schema. But out of these he noticed pretty dramatic improvement from 14 to 15 where speed almost doubled. It got slightly faster in 16, but suddenly it's gotten slower in 17 and 18. Now 18 is still in beta, so that could change. But he's wondering what the regression is from PG 16 to 17, but he doesn't have any insights with regard to that. But you can check this out if you're interested.

Next piece of content ”Mini Summit 5 Transcript: Improving the PostgreSQL Extensions Experience in Kubernetes with CloudNativePG”. And we did cover a post, I think on Gabrieli's website where he discussed this. Well, this is a presentation and a video describing the enhancements that have been brought to Cloud native PG as well as the greater Kubernetes ecosystem to be able to handle extensions more easily. In Kubernetes environments. So you can check out this if you want to learn more about that.

Next Piece of Content “Data archiving and retention in PostgreSQL. Best practices for large datasets”, this is from dataegret.com and we talked about a post about this last week. Well this is actually more the presentation and slides, so if you want more insight into this, you can definitely check out this piece of content as well.

Next Piece of Content “Hacking Workshop for June 2025”, this is from rhaas.blogspot.com and Robert will be hosting two or three discussions of Masahiko Sawada's talk PostgreSQL Meets Art using Adaptive Radix Tree to Speed Up Vacuuming and this was presented in 2024 PGConf.dev and they're going to be reviewing this so if you're interested you can look into how to join up with them.

And now it's time for my Consulting corner. Like I mentioned, a lot of times I'm brought in to optimize a specific query or a set of queries, or even look at the database and take care of the top 10, for example, and optimize those as much as you can. Well, more and more I'm taking a look at some of the queries and wondering do you really want to run this query 500 times a second? And if you're just looking at pg_stat_statements, it's really hard to tease that particular metric out. But things like AWS, Performance Insights or other monitoring tools that tells you how often a particular statement is run over a period of time. Because if you're running a query that's pulling 10,000, 100,000 rows, you probably don't need to run that multiple times per second. And where it gets interesting is if you're looking at wait analysis as well. Because you may see that a particular query has 20 queries waiting at any one time, and those wait states are maybe CPU or disk access and what's happening? It's a long running query, 10 to 100 milliseconds, but it's running so frequently at any one time you have this query being run on ten different postgres backends. Now sometimes the answer is to optimize the query and make it run faster and then that's less load on the system. But sometimes the answer is oh well, we need this data. And then the next question is, well do you need it multiple times per second? And usually the answer is no, because maybe the best optimization is actually running that query less if you can. Just something to keep in mind.
 

episode_image