background

Postgres 16 Beta 2, The Rise of Vectors, FDW Performance, Unused Indexes | Scaling Postgres 272

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the release of Postgres 16 Beta 2, the rise of vectors and storing them, Foreign Data Wrapper performance and how to identify unused indexes.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 16 Beta 2 Released!" This is from postgresql.org. And Beta 2 has been released, not too many changes but the one that caught my eye was the "default collation provider selected by initdb is changed back to libc". So clearly some issues with defining correlations when starting out. But if you want to see all the new stuff in Postgres 16, you can try out this beta 2.

Next piece of content- "VECTORS ARE THE NEW JSON IN POSTGRESQL". This is from jkatz05.com. And when I first looked at this title, I interpreted it as replacing JSON and I thought oh no, but actually, that's not the case. What Jonathan is describing is that vectors are in the place where JSON was a number of years back. Meaning that back prior to Postgres 9.-2, there was no JSON support in Postgres but with 9.-2, they introduced a text version of the format. And in terms of indexes, you had to rely on expression indexes if you wanted to query from it. But then a short time later in version 9.4, they added the JSON B type, which is the binary storage representation. And they added gen indexes to allow you to query it at a high-performance level. So he says vectors are pretty much in the same place.

So for years, Postgres has supported arrays and you can store effectors in those arrays. And additionally, there's the cube data type, in that can store vectors as well. But each of them has their little caveatscopy-outs. So for example he says quote, "arrays can handle vector dimensionality", like up to a thousand different dimensions, "but not different types of operations" you want to do like a nearest neighbor. Cubes can handle those types of operations, like nearest neighbor, but they can't handle the thousands of dimensionality that exist for artificial intelligence or machine learning requirements.

But there's this relatively new extension called pgvector that satisfies this condition. In addition, it has an index called an ivfflatIVF Flat to do the indexing of those vectors. So he envisioned this is about the stage of where JSON was introduced. And that now we're just looking at how we want to refine it and what features should be placed in the Postgres core and what enhancements should be added to the pgvector extension that's been created. So I found this very interesting. There's also the presentation that he did "Vectors are the new JSON"  that I would encourage you to check out as well.

In addition, there's another piece of content- "Storing and querying vector data in Postgres with pgvector". This is on pganalyzed.com. And this is Lucas's five minutes of Postgres. And this is the blog post he covers in this he goes into more depth on the subject as well as gives a lot of kudos to Andrew Kane who actually was the lead developer on pgvector which is interesting because he actually has written a lot of Ruby libraries as well. So he's been very prolific in the Ruby community as well, so kudos to him. And I definitely encourage you to check out Lucas's piece of content with regard to this.

Next piece of content, also related to pgvector, is "Image recognition with Python, OpenCV, OpenAI, CLIP and pgvector". This is from dev.to in their Francesco Tisiot section. Now, he's talking about the pgvector extension being added as support to the Aiven database, which is a variant of Postgres. But what I found interesting about this post is covering vectors from first principle. So it gives an introduction to vectors and embeddings to help understand what they're all about. So if you're interested in learning more about that, definitely encourage you to check out this blog post as well.

Next piece of content- "Performance Tips for Postgres Foreign Data Wrappers". This is from crunchydata.com. And as you use Foreign Data Wrappers, this one they're predominantly talking about, the Postgres Foreign Data Wrappers So, you're speaking to another server and pulling the data from it to potentially combine it with data that exists on your local database. They mentioned performance can be pretty good until you start trying to join across multiple servers. So you're joining a local table with a remote table, there lie some issues because if you do a simple join like this to a foreign table, what the foreign database sees is actually just a "select all" from the remote table. And they fetch it a hundred rows at a time, so basically this results in a very inefficient query.

Now they say here that Local joints are going to perform fine, that just works as expected, and Remote joins, that's where you're querying a remote server and you're asking to join tables on that remote server. That happens fine as long as you're on at least version Postgres 11. But it's these cross-server joins that you run into issues. And the first recommendation for handling that says using a CTE. So basically you define in the CTE how you want to query that remote table so it doesn't bring back all the data, just precisely the data that you need and then join it to the local table.

You can also use subqueries to do some of the queryings as well, but they did say in the CTE here where you're looking for a certain number of IDs, that you don't want to use "N", you want to use "= Any" to try and materialize the local tables IDs before sending them over. That allows a much more efficient query to be sent to the remote server and get the data back that you want. Another recommendation is increasing the FETCH count. So as opposed to just, by default, a hundred rows, there may be cases depending on the query that maybe you want to fetch a thousand rows at a time or ten thousand rows at a time. So they have a recommendation with regard to that.

And then also, once you get to really large data slices, consider caching locally. This could be done through a materialized view or even just creating a cash table that is maintained using a merge command. So periodically, you merge new changes into that local table. And I kind of like what he says here quote "For Foreign Wrapper Performance, think like an application developer". So you really need to think "How am I going to construct this query together to make it the most efficient it can be?". Normally, you just query the data that you want, and Postgres is able to figure out the most optimized path to give it to you. But when you're crossing a server-to-server boundary, you need to take more responsibility for thinking about how to query the data in an efficient manner. But if you want to learn more, definitely check out this blog post.

Next piece of content- "Finding Unused Indexes In PostgreSQL". This is from pgdash.io. And of course, the more indexes you add, the more you can boost select performance but at the expense of inserts and updates. So as an example they created a table and then added three indexes to it. And then on this other table, they had no indexes and you can see that inserting the million rows took four times as long with three indexes on it compared to no indexes. In addition with an update, updating that million rows took four times as long with three indexes versus having no indexes. So having additional indexes will definitely have a performance impact from an insert and update perspective even though maybe your selects are faster. So definitely something to keep in mind.

And then of course the last thing mentioned is that it's also going to take up more desk space the more indexes you have, the more disc space you will need. But how do you identify indexes that can be removed? And for that, you generally use the pg_stat_user_indexes view because that tells you how many index scans have happened for each index. Now a disadvantage of this, it shows an aggregate count. So if it's been a while since your statistics have been reset, it can be hard to determine if recent indexes are not being used anymore. It'll definitely tell you if indexes are never used so this view is advantageous from that perspective.

They also have a recommendation of actually storing the data on a periodic basis, so you can see those changes. But the benefit with Postgres 16, that they mentioned here, is that they've added an additional column "last_idx_scan". So it actually gives you the date at when the last index scan occurred. So that way you can easily see the last time it index was used. So that's great. But if you want to learn more about that you can definitely check out this blog post.

Next piece of content- "MEMORY CONTEXT: PRIVATE MEMORY MANAGEMENT IN POSTGRESQL". This is from cybertec-postgresql.com and they're talking about how usually running queries have their own memory area and they describe memory contexts. This post goes into a lot of detail about the advantages, how they're organized, and I definitely recommend reading this to get a good perspective on that. But there is a way to view memory context usage with this view which is "pg_backend_memory_contexts", which shows all memory contexts held by the current session. So for each session or connection to the database, you can examine the memory contexts.

And there's also a function called "pg_log_backend_memory_contexts", and you pass it in the process ID of the backend process. So basically you can find the process ID and pg_stat_activity of, say a long-running query, and then you can use that to see how the memory is being used. They talk about plugging up at the debugger, which. I wouldn't necessarily do it in production, but then they go into issues of actually running out of memory in Postgres.

Another way that information gets logged is that if you have appropriately set in Linux, the "vm.overcommit_memory to 2 (and tune your vm.overcommit_ratio)", if Postgres runs out of memory, you'll get a regular run out of memory error as opposed to potentially crashing. And quote "PostgreSQL will write a memory context dump to the log file". So you can actually look in the log file to see those memory contexts as well. So if you want to learn more about the subject, definitely recommend checking out this blood post.

Next piece of content "PGConf.DE 2023- The Last Hurrah of 'Don't Do This'?". This is from vyruss.org. And I believe we've covered this presentation before, but it's been updated for this month. And the presentation is "Don't Do This", so there's a number of things that you should not do in Postgres. And the actual PDF of all the slides are here, so this is something I think is good to review periodically to make sure you're not doing these things, so definitely recommend checking this piece of content now.

Next piece of content- "A look at PostgreSQL's journey over 5 years in Stack Overflow's Developer Survey". This is from stormatics.tech. And they've kind of pulled together some different surveys to show how Postgres has evolved over the last number of years. You can see the gold line is PostgreSQL that's been increasing pretty steadily, but that's kind of stabilized here, whereas a lot of the bigger players- the Oracle, MySQL, and Microsoft SQL servers seem to have been declining and more recently MongoDB going down more than usual. This is from the DB engines ranking, but they also show the most popular and, we covered this on last week Scaling Postgres, where Postgres has now eclipsed MySQL being the most popular and as well being the most admired/loved. So if you want to take a look at these charts you can check out this blog post.

Next piece of content- "The Do's and Don'ts of Postgres High Availability Part 3: Tools Rules". This is from enterprisedb.com, and they're talking about different tools you can use for your high-availability setup. They have a number of recommendations here. One is EDB Failover Manager or Repmgr, and it considers these more traditional or a little bit older systems for doing high availability.

Although there are also, I believe, products of EDB. Next, they cover Patroni. So this is what I've heard that's used a lot. As well as Pg_auto_failover, which might be the simplest solution for doing high availability. And then lastly their newest product by EDB, the Postgres Distributed. And in terms of backup tools, they recommend Barman, again that's a product of EDB, as well as pgBackRest, as their recommended backup tools. So check out this blog post if you want to learn more.

Next piece of content- "Using iCalendar RRule in Postgres". This is from aklaver.org and I thought this was interesting because there's actually a defined rule called RRule that the iCalendar specification uses for determining repeating events. And he talks about the rule and shows hows you some Python libraries where you can pass it in that rule's parameters to be able to display data. So, in general, you wouldn't want to display all this individual data. You would only want to use it at render time but actually, just store the rule.

And he shows how you would generally store that rule in Postgres. So here you could see he's defined a task title, a task description, the actual rule, that's defined here, and then the start date. And then from this, you can generate all the repeating events. You don't have to store absolutely everything. And then he even wrote a plpython function to find the next occurrence of an event based upon a rule. So if you're interested in that, definitely check out that blog post.

Next piece of content- "Setup PostgreSQL development environment on MacOS". This is from highgo.ca. So if you have interest in doing that, you can check out this blog post.

Next piece of content- "OrioleDB beta has been released". This is from orioledata.com. And this is the new storage engine that they developed that has taken a first principles approach, trying to maximize "high transaction troughput, high volume of updates, high volume of in-memory operations, and avoiding lock bottlenecks". And I think of this as basically a Z heap alternative where they're using an undue log as opposed to storing updates as a new row in the actual heap table. So this is great that the beta has been released. Check this blog post out if you want to learn more.

Next piece of content, there was another episode of Postgres.FM last week. This one was on “UUID” and particularly "discussing the performance aspects of using UUID for primary keys". So you can listen to the episode or you can watch the YouTube video.

And the last piece of content, the PostgreSQL person of the week is Stephanie Janine Stölting. If you want to learn more about Stephanie and her contributions to Postgres, definitely check out this blog post.

episode_image