Debian / Ubuntu Packaging, ivfflat Indexes, In vs Any, View Logging

In this episode of Scaling Postgres, we discuss how Debian & Ubuntu package Postgres, how ifflat indexes work, in vs any performance and how to log view usage.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is- "Postgres Debian/Ubuntu Packaging". This is from aklaver.org. This is very interesting because they're talking about the Postgres packages and the alterations that Debian and Ubuntu do to them. Or at least the framework that it sets up for Postgres on these operating systems. Actually, this is my preferred operating system to run Postgres on. Now maybe that's just because of the first one that I used but I really like how they've laid things out compared to Centos or Redhat Enterprise Linux. So of course with Debian or Ubuntu, you get a particular version of Postgres that's locked according to that version of Ubuntu.

But you can always go to the Postgres repositories and get the sources to list and repo key so you can install any version that you want. That's definitely the way to go to get the most recent version. So they show you how to do that but this is also in the download section of the PostgreSQL website. You do get an update or just an app update and then install the particular version that you want. Now it also offers a number of commands and you can see the first one here pg_lscluster. So this shows you all the clusters that are running on the box. Now I'm assuming they're using 22.04, so they have the 14 cluster running and then the 15 one that they just had. And by default, it knows to increment the port number as you can see here.

Then it goes over and shows you where the different binaries are stored, where the data is stored, where the logging happens, and then, of course, all the configuration files under ETC. One thing you'll notice is that every time you do an install of a Postgres cluster, it namespaces it based on the version and also the cluster name. So you can have as many cluster names as you want to get an install as many instances of a Postgres 15 installation as you want. And of course, you can install many different versions on the same operating system as well. They give you an overview of what some of the config files look like and then just some general guidance on getting started with it. As well as other than pg_lscluster, pg_ctlcluster allows you to do system D commands.

Such as stopping the database, starting it, reloading it, and things of that nature. They show that here and they compare it to the system D command of how to do it. And here's how this is named spaced. So at the end of PostgreSQL, you put the @ symbol, the version, the hyphen, and then the cluster name. And that's not all because the next blog post covers a lot more about the different commands that are available. So there are certain commands that run with the most recent version but otherwise, they run based upon particular cluster versions and they show some examples of that here. And also a list of the commands that are unique to the packaging.

So you'll see the pg_lscluster, and you'll see the pg_ctlcluster which controls the cluster. You can also create them, you can drop them, rename them, and a few of these ones I haven't even used before. Like I've never used the backup cluster or the restore cluster. I basically relied on pg_base_backup and did the restore process based on that. But if you are using Postgres on Ubuntu, I highly suggest bookmarking these two posts so you can see all the commands that are available which I think makes managing Postgres much easier.

Next piece of content- "Nearest Neighbor Indexes: What Are ivfflat Indexes in pgvector and How Do They Work?". This is from timescale.com. And again, with the rise of artificial intelligence, everyone's talking about using vectors and the pgvector extension and the index that works to index that data is an ivfflat. Which is actually known as an Inverted File Flat algorithm for approximate nearest neighbor search. So a lot of this artificial intelligence is being able to identify similarities to something else, so that typical problem is finding the nearest neighbor or finding the K nearest neighbors that match a particular query. And that's what ivfflat is designed to do. Now they talk about dimensionality here and they say there are some indexes that let you do K nearest neighbor.

So they talk about R-Tree, KD-Trees, and Ball Trees. But it says a quote "These approaches cease to be effective when dealing with data larger than 10 dimensions due to the 'curse of dimensionality'". And actually, a lot of the dimensions reach up into the thousands or multiple thousands with some of this machine learning and artificial intelligence. So it says here quote "...Open AI's are 1,536 dimensions...". So basically, some new techniques had to be developed. And this post talks about the ivfflat. So this is an example of a representation of vectors in a two-dimensional space. And you can see that those vectors have a certain clustering. There are certain vectors that are clustered together. So how the index works is it finds these clusters and finds the centroids, which is something that's computer basically finds the center of this cluster.

And then divide those up as you can see here in this diagram. And I believe this is roughly equivalent to the lists parameter when you're building the index. You can define how many lists or how it's going to break up that data. But once it has this data and sets it up, you can then do a search on it. And so for example maybe you have a search that is here. What it does is it tries to find, using similarity, the cluster that's closest to it. And then from there once it has that cluster, it looks at individual entities to say 'Okay, what is this similar to?'. So that's basically how the index works and of course, they go into a lot more detail I don't have time to get into in this show.

But they also talk about issues when you're actually searching on the edge of a boundary and it may be closer to the next list or the next one over. They said there are solutions for dealing with this. The index has the capability to do probes or the 'probes' parameter and that way you can check out the next closest centroid or the third closest centroid as well as consult multiple areas to try and find the true nearest neighbors to your particular query. So here they're talking about the list parameter and the 'probes' parameter when building the index and they show the multiple different operators that are possible when working with this index. So they cover three different distant types- L2 Euclidean, Negative Inner product, and Cosine.

And basically how you index is how you're going to want to query it. They give you the example command to use, an example in Python of how to set this up and do some querying. And the other issue you will have to deal with is dealing with data changes. So the ivfflat index gets updated based on inserts, updates, and deletes, but the clustering centroids will not be updated. So this is something that you'll have to reindex concurrently periodically to make sure that your data in the centroids are kept up to date. But this is a great post explaining how to use the ivfflat index and how it works in the pgvector extension. So I definitely encourage you to check this piece of content out.

Next piece of content- "TIL-IN is not the same as ANY". This is from kmoppel.github.io. And he's talking about IN vs. ANY. So this is like you're saying this ID is in this list or this ID is represented in this particular array of data. A lot of times you can use them interchangeably, but sometimes one is faster than the other. Normally ANY has been faster than the other although there was a fix for this in Postgres 14 and higher. But he advocates using ANY because you'll get much less noise in pg_stat_statements. So if you have a variable amount of enlists, they each appear as their inquiry. Whereas with ANY, it will just appear as one query that's much easier to work with. And I speak from experience because pg_stat_statements is my number one tool to do performance work. He says it's safer against potential SQL injections with some frameworks, avoids potential plan cache saturation, and is more convenient with a bunch of text pattern-matching filters.

But he ran into a surprising issue when he was working with a customer or client and they saw a 90,000x difference using IN vs. ANY. And the issue was there was only a single entity it was trying to find, it wasn't a list of entities. And what the parser does in Postgres is it actually reduces IN queries, where it just has one value, to an equals. So project_id equals this. Whereas the same optimization, or adjustment, does not happen with the ANY query. So basically, a lot of times their In and ANY are interchangeable now in most recent versions of Postgres but you can get some instances where you get differences such as this between them. But then the question comes down to what exactly happened? Why was it a 90,000x difference?

And for that, this was covered in Lukas's "Five minutes of Postgres". So Postgres performance with IN vs. ANY on pganalyze.com. And he covered this blog post and what he mentioned is that the reason the equals was so much faster was that it was actually ordering by that value. And when the planner saw that it was equally in one value, there was no need to order by that particular column. So it was a no-op, a no-operation. So that's what led to the enormous performance improvement. So this does seem to be a very unique case so I don't know if I would advocate switching back from using ANYs to INs. I think it's just something to be aware of as you're optimizing queries. Maybe consider using one or the other to see which gives you the better performance for your particular workload. But if you want to learn more check out these two blog posts..

Next piece of content- "VIEW ACCESS LOGGING- FIX IT IN POSTGRESQL". This is from cybertec-postgresql.com. This is a situation where someone has a database that has 600 tables and 300 views and 2 or 3 different naming conventions across them. So basically they wanted to clean things up and they were focusing on the views. So basically, they want to create new view names, but then see which of the old ones are still being used and redirect to use the new names. But of course, as they say here quote "But we don't want to break anything". So basically they chose the following solution to deal with it. Now they go into the whole repair process here, I'm not going to focus on that. But I'll focus on the key part of the solution that they worked on.

First, they created new views. Then they created views to replace the old ones and what the old ones did is they actually ran a function, so they have a CTE here with a function, with the intent to log that this old name was being used to a separate table. And then they simply selected from the new view that the old one replaced. You can see they created this view access log table to primarily store the view name and then use this code here to log and insert the data based on the view name that came in. So now they can track each case of this old view name being used. And they can go back to the source application, and modify it there so that everything for now uses the new view names. And then they can delete those old views. So if you want to learn more about how to do that, definitely check out this blog post.

Next piece of content- "Automated index bloat management: How pg_index_watch keeps PostgreSQL indexes lean". This is from dataegret.com. And we mentioned this a week or two ago in ScalingPostgres where this tool allows you to keep the bloat on your indexes low by triggering a re-index of them. And the question is how does it handle this? And it basically does it by tracking the number of rows in the index vs. the index size. So when that ratio of index size to the index tuples gets too high, it then triggers a re-index. Now it says this is available for Postgres 12 and higher. But a question I have that I didn't see listed in this post is that partition tables can't be concurrently re-indexed I believe until Postgres 14. So, I guess the question is does it work on Postgres 14 are not? So that's maybe something you want to be thinking about if you use partition tables and you want to use this tool. Or maybe just speak with the developers about it to see. But they go over how to go ahead and install it and get started using it. So if you're interested in that, check out this blog post.

Next piece of content- "PGSQL Phriday #010 is coming up". This is from pgsqlphriday.com. And this particular one is a challenge from Alicja on writing a blog post about pgBadger. You can cover things like how it has improved the performance of your database. Do you leverage its reporting capabilities? Have you encountered any challenges? Do you have unique or innovative ways of using it? So all sorts of different blog posts covering pgBadger. That's this month's PGSQL Phriday.

Next piece of content- "Joins 13 Ways''. This is from justinjaffray.com. Now this isn't Postgres specific, it's not even necessarily relational database specific, but it covers all sorts of different ways to join data together. And I found this kind of interesting even though it's not Postgres specific nor even a relational database specific. But if you want to check out these different ways, definitely check out this blog post.

Next piece of content- "The Postgres Trajectory". This is from momjian.us. He has a short presentation of only 14 slides where he covers where he thinks Postgres is going. And it seemed to have a fair amount of focus on open source vs. source closed source solutions. But if you want to check it out, the presentation is linked here.

Next piece of content- "Fun with PostgreSQL puzzles: Surface Area and 3D Slices". This is from crunchydata.com. And this is continuing the advent code challenge. This is day 18's. So if you want to learn more about that, check out this blog post.

Next piece of content- "pgBackRest differential vs. incremental backups". This is from pgstef.github.io. So if you want to learn more about those differences in pgBackRest, check out this blog post.

Next piece of content- "Goodbye etcd, Hello PostgreSQL: Running Kubernetes with an SQL Database". This is from martinheinz.dev. And this is interesting. I didn't know you could use Postgres instead of etcd for a Kubernetes cluster. But he did say quote "It's intertwined and tightly coupled with Kubernetes". So there are some things you have to do, but apparently, it can work. So if you're interested in that, you can check out this blog post.

Next piece of content, there was another episode of Postgres FM last week. This one was on "pg_upgrade: the tricky and dangerous parts". This is from postgres.fm. And if you want to listen to this episode, you can click the button here or look at their YouTube video here.

And the last piece of content the PostgreSQL person of the week is Afsane Anand. If you want to learn more about Afsane and her contributions to Postgres, definitely check out this blog post.

Debian / Ubuntu Packaging, ivfflat Indexes, In vs Any, View Logging | Scaling Postgres 273

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript