background

Easy Foreign Data Wrappers, JSONB Cheatsheet, Updating Cost Limit, Parallel Aggregate | Scaling Postgres 261

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to easily create a foreign data wrapper to consume an API, present a convenient JSONB cheatsheet, changes to updating the vacuum cost limit and new parallel aggregates.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Making a Data Polyglot with PostgreSQL Foreign Data Wrappers". This is from richyen.com. And this was a pretty simple post, but it shows you how quickly you can build a Foreign Data Wrapper to be able to interact with essentially any API. Now what a Foreign Data Wrapper is, it enables you to contact an external source.

It could be another Postgres server, it could be an Oracle server, it could be a spreadsheet, it could be a flat file somewhere, or it could even be an API. And it allows you to run select queries against that foreign table as if it was a local table. And in this post, he's creating a Foreign Data Wrapper against a Cloudsmith API. And what he used to do is something called Multicorn. This doesn't really explain what it does from the name to me, but it allows you to easily create Foreign Data Wrappers using Python.

So frequently you would use C to write extensions in Postgres, but this allows you to do it in Python. And he says right here quote "... over 50 of the FDWs listed on the PostgreSQL Wiki page are written for access via Multicorn". So basically once you install Multicorn, then you can write your Foreign Data Wrapper using Python classes. So he grabbed the MailChimp Foreign Data Wrapper as an example and just modified some classes and he was able to get it working.

To actually use it, you just say CREATE EXTENSION multicorn, you create your foreign server, give it specific options, create a schema, and then create the foreign table in that schema that matches what the API returns. Include your key for the API down here and now you can just…select all the records from this particular table and it pulls it in from the API. So this seems super easy to do, so if you have a use case for wanting to interact with other sorts of APIs, maybe this is a route you could go. But if you want to learn more about that, check out this blog post.

Next piece of content- "PostgreSQL JSONB Cheatsheet: Complete & Fast Lookup Guide". This is from dev.to in the Francesco Tisiot section. And he basically created a JSONB cheat sheet here and there's a high-resolution copy you can download. And this is great because I use JSONB but not every day and whenever I need to use it, I always have to Google or look up in the documentation how to do certain functions. And having this as a ready reference is awesome. So definitely check out this piece of content and download this resource if you're interested in that as well.

The next piece of content- "Updating the Cost Limit On The Fly". This is from our rhass.blogspot.com. And apparently, an enhancement was just added to Postgres 16 that enables you to modify the vacuum_cost_limit while a vacuum job is running. So let's imagine a scenario where you have a very large table in vacuum that is taking forever to run, from days to even a week. But the resources of the server are not fully utilized.

What that usually means is that you need to alter your vacuum configuration or your auto vacuum configuration. And the number one value to increase, in my opinion, is the vacuum_cost_limit. And boosting that up will allow more work to be done per unit of time and will vacuum that table faster. But the problem prior to this enhancement is that it doesn't affect already running vacuum jobs.

So if you're in the middle of that huge table vacuuming, altering this configuration won't alter that job. It's for any new jobs that are started. But with this enhancement, it now updates that actively running job so it will use more resources and therefore run faster. So this is awesome. If you want more details, definitely check out this blog post.

Next piece of content- "PARALLEL AGGREGATE- POSTGRESQL16- BETTER PERFORMANCE". This is from cybertec-postgresql.com. And they're talking about parallel aggregates where, basically, you're able to use multiple processes to do aggregation. This is things like sums or averages or finding the max or a min. And in Postgres 16, the capability to do aggregations in parallel has been done for the array_agg() function and the string_agg() function. So they show an example of that here where they create a table, insert some test data, and change the settings of doing parallel work to zero so they can ensure that they get a parallel plan.

They do an explain plan of that query against the table they created. And they can see that the cost limit is 117. You can see that it's doing a gather merge so it's doing a parallel aggregate. And when you compare it to a previous version, this is from Postgres 13, you can see the cost is 185. So not quite twice as fast, but a significant improvement from 118 to 185 in terms of performance doing that aggregate in parallel. So if you want to learn more about this feature, definitely check out this blog post.

Next piece of content is Lucas's "Five minutes of Postgres". This is from pganalyze.com. And he's looking at vacuum cost limits and parallel aggregate improvements in Postgres 16. Basically, the two previous posts are what he reviews and what he's looking forward to in Postgres 16. So if you want to look at his perspective on these changes, definitely check out his piece of content.

Next piece of content- "SQL Maxis. Why we ditched RabbitMQ and replaced it with a Postgres queue". This is from prequel.co. And this post describes exactly what they did and why. Now when they started talking about dropping RabbitMQ for a queue within Postgres, I had envisioned they were going to be using listen and notify to be able to process jobs for example. But actually, their jobs are really long-running. There are some that are on the order of seconds but some can be on the order of hours because they're transferring and processing a massive amount of data.

So really it seems like a case of, really, large jobs run less frequently. And for that, it looks like they just created a data table in Postgres and just kept that updated based on the status. So it was just pure SQL and tables to do the queue. So really RabbitMQ was a bit heavy on that. But I always enjoy these retrospectives in real-world uses of technology. So if you're interested in learning more about this, definitely check out this blog post.

Next piece of content is actually a set of posts all about the PGSQL Phriday post submissions on triggers. So the first one was from mydbanotebook.org and she was the sponsor of this month's PGSQL Phriday. And basically, she leans on developers not using triggers at all and equates them to "sharp knives given to toddlers" because she sees so many performance problems with using triggers.

And she struggles to find legitimate use cases for them. She says a lot of people suggest auditing, but she really doesn't think that should be used either. Instead, use extensive Postgres logging to, say a CSV format, to be able to collect all of that information.

The next post on the topic is called "Triggers: A Love/Hate Relationship". This is from softwareandbooze.com. And this is a pretty substantial post, I'll cover a few points here. One, he says the problem with triggers is the order of execution. In other words, it's hard to determine sometimes when a given trigger will fire. In terms of Postgres, it's basically executed alphabetically.

So just make sure if you have multiple triggers, that they're in the proper alphabetical order that you want them to fire. Two, triggers are easy to mess up and can potentially mess up your data. Three, triggers impact performance. Absolutely. And then bonus triggers hide logic. And this is why I really don't like using a lot of triggers because it's kind of magic behind the scenes, at least to some developers. But in terms of benefits, he likes using them for audit tables.

He likes using them to propagate changes. And also when building a queue, I'm not so sure about these particular ones. I think I'd rather have it within a transaction doing what's needed to be done as opposed to in a trigger. And then automatically update in a column, which I can see use cases for. If you can't use generated columns, I would definitely choose that over a trigger. But definitely a lot of perspective on that you can check out.

The next post on this is "STOP! TRIGGER TIME". This is from pgmustard.com. And he actually emphasizes that triggers are actually used for some internal purposes, mainly keeping foreign key constraints accurate. And he gives an example here where it creates an author's table and a books table with a foreign key. He inserts a lot of records into authors and a lot of records into books. Then he deletes one of the authors. And as a consequence because of that foreign trigger, the execution time takes 14 milliseconds.

Of which, almost 14, 13.9 milliseconds, of that time is due to doing a search on the book's table to make sure that referential integrity is maintained. So this indicates a potential problem. How you solve it is you put an index on your foreign keys. So generally, that's my recommendation when you're indexing a table. Of course, you have a primary key and you want to index your foreign keys. So when that foreign key is added and he deletes another author, you can see the execution times now at 1.2 milliseconds. So definitely absolutely faster. But check out this post if you want to learn more.

The next piece of content on this topic is "A word about triggers”. This is from hdombrovskya.wordpress.com. And this is a little bit of a shorter response, but she says triggers definitely shouldn't be banned and she has had some use cases where they have been beneficial. Particularly when she was talking about her permission settings that she had to set up that were quite complex.

Next post in the series is "AUDIT DATA WITH TRIGGERS''. This is from rustprooflabs.com. And this is another vote for advocating using it for audit purposes. He actually writes a whole post about how to write an auditing capability in your database using triggers.

Next post on the topic is "Triggers for tracking changes in a table". This is from andreas.scherbaulm.la. And he's again talking essentially about auditing.

Next blog post on the subject is "the art of the trigger". This is from Diane M. Fay at di.nmfay.com. And she says something very important here that some databases can use triggers to encode automatic behaviors and responses and are infamously hard to understand. Now she covers a lot of other points here that you should definitely review. But in terms of my perspective on triggers, this is the number one reason why I don't want to use a lot of them. Because it's a bit like magic. In other words, you go to do an update, an insert, a delete and then some side effect happens that maybe you're not expecting.

They're kind of a little bit like callbacks in programming. You update a record and then something over here gets changed that you're not expecting. I'd rather be much more explicit. For example, if you want to update two other tables, go ahead and put that in a transaction and explicitly do those updates as opposed to relying on a trigger. But there are three areas that I like using triggers. One is for DBA operations. Like if we're having to change an int to a bigint column and we want to keep that column in sync or keep a whole table in sync, triggers are the way to go.

They're a temporary use case for that purpose. The second is updating columns when generated columns won't work. I had a use case where I had to search on names. And it was important to actually remove the accents from the names before doing a full-text search, and based on how the unaccent extension worked, I believe, we couldn't use generated columns. So basically, we had to use a trigger. So that's another use case where triggers had to be used. And the third is auditing.

Now if all you need to do is audit and review something internally, you could get by with using the PostgreSQL logs to do it. But if you're wanting to present something to your stakeholders or your users from within the database itself, it can make a lot of sense to implement auditing on a subset of the tables. And I really don't think you get yourself into too much danger by simply doing inserts into a separate audit table. Yes there will be an impact, but inserts are generally pretty fast. But those are kind of my use cases where I would tend to use triggers.

Next piece of content- "Postgres 16 highlight- require_auth for libpq". This is from paquier.xyz. And he's talking about an enhancement where, now, the libpq library can request certain authentication modes. So for example, now you can say when you do a connection that you only want to use SCRAM-SHA-256. Now this helps prevent downgrade attacks where maybe the server says no I'm only going to use MD5 or no I'm only going to use trust authentication or something like that. This ensures from the client level that also ensures some amount of security. So if you want to learn more about that, definitely check out this blog post.

Next piece of content, we seem to have a new connection pooler now called Supavisor. Which is, as they say, a Postgres connection pooler. Now this isn't a blog post, it's a tool, which I don't normally mention. But given a couple of posts in the last few episodes about pgcat being a new connection pooler, I thought I'd mentioned Supavisor here as well. So you can check out the code on github.com.

Next piece of content- "Fun with PostgreSQL puzzles: Finding shortest paths and travel costs with functions". This is from crunchydata.com. And this is actually the 16th day of the advent of code challenge. So I won't go over this post here because that would essentially spoil it. But if you're interested in learning more about a traveling sales problem essentially, you can definitely check out this blog post.

Next piece of content- "How collation of punctuation and whitespace works". This is from peter.eisentraut.org. And this follows up on the post he did previously, talking about how collation works and this specifically focuses on punctuation and whitespace. So if you're interested in that, you can definitely check out this blog post.

Next piece of content- "Setting Up PostgreSQL Failover and Failback, the Right Way!". This is from highgo.ca. They're talking about setting up Postgres to be able to failover, which is usually just using the promote command from a pg_ctl. Or failing back. You can use pg_rewind for that purpose. So this blog post explains how to do that and set it up.

Next piece of content- "PostgreSQL Schema: Learning PostgreSQL with Grant". This is from red-gate.com. And he's talking about using schemas, which are essentially namespaces, for your objects in the database. How you can set them up and organize them, as well as how you can control the default search path for being able to navigate between different schemas. So if you want to learn more about that check out this blog post.

Next piece of content. There was another episode of Postgres FM last week. This one was on "Partitioning". And they cover, as I see here, what it is, why, when it's helpful, and some considerations for your partition key. So if you're interested in that, definitely listen to their episode or check out their YouTube video here.

Next piece of content, the PostgreSQL person of the week is Joël Cattin. If you're interested in learning more about Joel, definitely check out this piece of content.

And the last piece of content, we did have another episode of the Rubber Duck Dev Show this past Thursday afternoon. This one was on "Developer Documentation with Grant Willcox". So we discussed different ways you can do documentation in your code as well as for your product in general. If you're interested in that sort of content, we welcome you to check out our show.

episode_image