background

Aggregate Filters, Insert-Only & Vacuum, Out-of-sync Sequences, Functional Indexes | Scaling Postgres 180

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to use aggregate filters, the importance of vacuum with insert-only tables, how to fix out-of-sync sequences and faster functional index options.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content- "Fast, Flexible Summaries with Aggregate Filters in Windows". This is from blog.crunchydata.com, and they're talking about the situation where, say, you have some sales data that's broken out into three categories. They have A, B, and C here, it looks like, in different columns. They want to know what sales data is attributed by percentage to different values within those columns. Now, one of the first ways that they did it was using a CTE. But essentially, this has to access the table three times in order to do it, to find out what percentage of sales was due to where there were B's and what percentage where the value of A is greater than 900. So that's not a very efficient way to do it, and it's quite long as well. The other way to do it, which I have tended to use in the past, is case statements. So you could say where this column is B, then show the value, otherwise do a zero, and you just sum it up.

So that's a pretty simple way to think about it and it can give you the answer pretty efficiently. However, what they're talking about here is that you could also use a filter and that for any aggregate function, you can use a filter. So they're doing the sum of the values, but then they filter it and they specify where C equals B to do, essentially, the same thing the case statement is doing. Or they could filter where A is greater than 900. So this is a very efficient way to write it, a very clear way to read it. It says it runs faster than the other solutions. I don't know how much faster it would be than the case statement, but they said it does run faster. So that's another benefit of this technique. They say this applies to all aggregate functions. So here they use the filter clause with a standard deviation aggregate function. So pretty much any aggregate function, you can apply this type of filter. So using aggregate filters is a great way to make your SQL code more readable and potentially more performant. So you can check out this blog post if you want to learn more.

The next piece of content- "Insert-Only tables and Autovacuum Issues Prior to PostgreSQL 13". This is from blog.crunchydata.com, and they're talking about a scenario where they had a client where they had a query that was slowing down a lot on the primary and then even more on a replica. So normally the return values were less than one millisecond for a query, but the primary got up to around ten milliseconds and the replica got up to 300 milliseconds, but it was resolved by simply doing a vacuum. So they wanted to investigate why this was the case. Now, because they were doing physical streaming replication, all the statistics should be the same. Pretty much everything should be identical with regard to the data. Now, what would be different is that, of course, the shared buffers would contain different things on the replica compared to what's on the primary because it's presumably receiving different queries.

So that might account for some of the slower response times of the replica. But the reason for the slowness that a vacuum resolved is that this query apparently was using an index-only scan, which index-only scans rely on the visibility map being updated. That is one of the jobs of vacuum, is to update the visibility map to know what data is on a page and it helps make index-only scans super efficient. So then the question came, okay, well, why wasn't vacuum running? What was happening was that this was an append-only table. Essentially, they never deleted or updated data in it. Now, when you have that, you do have an autoanalyze that kicks off to keep your statistics up to date, but it doesn't run an autovacuum because essentially there's no work to be done. Now, in version 13, they made changes where insert-only activity would result in a vacuum.

So that's a benefit of being on 13 is that it will handle these insert-only use cases where vacuum isn't running and you can even adjust it with a parameter called autovacuum_vacuum_insert_threshold. So they mentioned the solutions to solve this type of problem. The first option is, of course, to upgrade to Postgres 13. The second option is to run vacuums manually on a periodic basis. So you could use Cron or any other scheduling tool to do that. The third option is to alter the table storage parameters. Now, they mentioned modifying the autovacuum_max_freeze_age. I would think you could also alter the scale factor and the vacuum threshold for the table as well to get it to kick off and be able to update the visibility map. But they did mention that option here. But basically, if you run into this type of issue with append-only tables, you could do the upgrade, run vacuum more frequently, or alter some individual table parameters. So if you want to learn more, you can check out this blog post.

The next piece of content- "FIXING OUT-OF-SYNC SEQUENCES IN POSTGRESQL". This is from cybertec-postgresql.com and they're talking about cases where the IDs that are in a table are out of sync from its sequences. So when you create an auto-incrementing sequence by creating a table with a serial column in it or a big serial where they're saying the serial8 here, it creates a table and then makes this ID auto-incrementing because it creates a sequence object as well for it. So when you insert data into the table, it automatically inserts the correct value in the ID based upon what the sequence says. If you insert a second row, just specifying the payload column, it automatically places the ID in there, ID 2. But you can also manually upload data and specify that ID and you can update it for the value that you want.

Now, at this point, you are out of sync with the sequence. So the sequence thinks the next value should be 3, but you already inserted a 3. So when you go to insert the next value with just the column payload, it actually tells you, hey, there's a duplicate key violation because key 3 already exists. Now interestingly, I didn't see that in this blog post, but essentially, that's just a single failure because, at this point, the sequence believes the next value should be 4. So you could easily retry this insert and it should succeed, but it would be inserted as ID equals 4. So in terms of a problem getting out of sync, the sequences kind of know how to handle it itself. It tries inserting and it will fail, but it will go ahead and try the next number the next time.

And of course, the way to avoid it is not to use an ID when you're doing an insert, just specify the other columns and don't specify the ID when you're inserting data. But if for some reason you want to fix these, they actually created a new extension called pg_sequence_fixer. And what it basically does is it looks at all of your tables, finds the max ID in it, and sets the sequence to that max ID. And you can even add a safety margin where they actually set it to 1,000 more than the max failure of the table. So I've never had a really big issue with sequences being out of sync because they're kind of autocorrecting. They'll eventually find the next value to enter, but it would produce some failures along the way. But if you want a way to resolve it using a tool, you could check out this blog post and their pg_sequence_fixer.

The next piece of content- "Index Only Scan on Functional Indexes. This is from dev.to from the AWS Heroes section, and they're talking about a scenario where you want an index-only scan to work on a functional index. So a functional index is where basically you want to query the data with a function. Say you want to look at the username where you're using an upper function. Well, a normal index on just the username won't be triggered because it's using a function. You literally have to create the index with that function included for that function to then be used. But what if you just want the username? So you just want to be able to use an index-only scan to retrieve the username postgres actually won't do an index-only scan. As you can see, the query here doesn't use an index-only scan because Postgres simply looks for the username column and it doesn't exist.

But what you can do is actually create a covering index, basically including the username column as a payload. So here they created the index and then included the username as a payload. And then when you do that same query just looking through the username, you will get an index-only scan. So that's a convenient way to more efficiently query for just this data. Now he said what you can also do in Postgres 12 of course is that they have the new feature-generated columns so you could actually create a separate column of uppercased usernames and it will be kept up to date by the system itself essentially using something like a trigger mechanism and that you can just use a basic index. You don't have to create a functional index to avoid this type of problem. So this was a pretty brief post on how you can get more index-only scans when you want to query based upon a function.

The next piece of content- "PostgreSQL Database Security: OS - Authentication". This is from percona.com and they're talking about operating system-based authentication. In a previous post, they talked about internal authentication where Postgres manages the passwords itself and you can create and change them there. This one talks about the passwords existing within the operating system and how the database can use that to do authentication. The first option available is Ident. So that would be using an Ident server that exists on the server itself and running on a particular TCP/IP port to be able to do that type of authentication. The second option is PAM or the Pluggable Authentication Modules. So you could set that up to be able to do authentication with the operating system. Then the third option is Peer and this is the one I tend to use more frequently. It basically uses a Unix socket to map the operating system user to a database user to be able to log in and do queries. So if you want to learn more about operating system-based security for Postgres, definitely check out this blog post.

The next piece of content- "How to get advisory lock in shell?". This is from depesz.com and he's talking about a scenario where he has a lot of maintenance jobs going on with a database server and he doesn't want them to overlap. So maybe he says he is doing a repack operation or particular dumps of the database or other data migrations happening and he doesn't want them to overlap. So we want some form of locking to make sure that they don't essentially step on each other's toes. So his plan was to use the advisory lock feature. So basically it's an application-based lock so you can just specify a particular number and say this is this particular lock and nothing else should be able to lock it while it's running. Now, with his particular jobs, he actually wanted to be able to hold the lock open for a period of time. So he actually developed a simple script that did a lock and kept it open using the \watch functionality to be able to set up this lock within a shell script. So if you want to learn more about how we implemented this, you can definitely check out this blog post.

The next piece of content- "Announcing PG_DBMS_JOB in PostgreSQL for Oracle DBMS_JOB Compatibility". So this is from migops.com, and according to this post, Migops does a lot of conversions from Oracle to Postgres and they like to use open-source tools to do it. Now, they've used various different scheduling tools such as pg_agent, pg_cron, and pg_timetable. But Oracle has its own DBMS_JOB scheduling tool and there were some features in it that none of these other tools were able to implement what DBMS_JOB could do. So they actually created their own extension that mimicked a lot of these features. So it allows an Oracle to Postgres migration that's using DBMS_JOB to simply use this extension and it's called PG_DBMS_JOB. So if you want to learn more about its capabilities and how it could potentially help you with an Oracle migration, you may want to check out this post.

The next piece of content- "Tuning the Pgpool-II Load Balancer". This is from highgo.ca and they're talking about using Pgpool as not only a connection pooler, but also a load balancer. And how you can specify a session-based load balancing where once a connection is made, a session established from a client, it always goes to a particular standby. So for example, the green client is always going to stand by one and the red client is always going to stand by two based on when that session was created. But of course, I should mention that this is for read load balancing. The writes always go to the primary, of course, whereas you can also select a statement load balance, and their select statements are randomly distributed between the existing standbys. This post talks about these two different forms of load balancing and which makes the most sense based on your usage. So if you're interested in that, you can check out this blog post.

The next piece of content- "Advanced PostgreSQL Features: A Guide". This is from arctype.com and this blog post talks about a number of different advanced features of Postgres. So it's a little bit of a simpler post explaining a variety of different features. Just a very basic introduction, but they talk about the capabilities of table inheritance that Postgres offers. Talks about non-atomic columns. So basically they're talking about arrays, for example, or maybe even JSONB fields where you can store more than one value within a column. Talking about window functions, which of course a lot of relational database systems support. Talking about its support for JSON data. Talking about its full-text search capabilities. About its views capabilities, of course also available in many other relational databases. And then lastly, geospatial data that you can do with the PostGIS extension. So if you want to learn more about these different types of advanced features, you can check out this blog post.

The next piece of content- "TimescaleDB Compression by example". This is a YouTube video that's on the TimescaleDB YouTube channel, and it's specifically talking about compression as it exists within TimescaleDB. Now, TimescaleDB does compression as a part of its column store capabilities, and this presentation, over an hour in length, goes over how this works, how it's implemented, and how you could use it for different use cases and some examples. So if you're interested in potentially using TimescaleDB, I highly encourage you to check out this presentation.

The next piece of content, the PostgreSQL person of the week is Daniele Varrazzo. So if you're interested in learning more about Daniele and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we did have another presentation of The Rubber Duck Dev Show this past Wednesday. In this episode, we talked about "Application Infrastructure Performance". So if you're looking for more developer-related content in a long-form format, maybe you would like to check out this episode.

episode_image