background

Shaped Sample Data, Version Changes, Missed Bottlenecks, Indexes for Newbies | Scaling Postgres 199

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to generate shaped sample data, track changes between Postgres versions, identify missed bottlenecks and use different index types.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "How to shape sample data with PostgreSQL generate_series() and SQL". This is from timescale.com. They're talking about a third in a series that they're doing about generate_series. This is about generating data that you can use for testing purposes without having to say, have a copy of production data. This part focuses on potentially avoiding just a random function for generating data, but allowing you to generate all sorts of different types of, I'll call them waveforms. So for example, generating something that looks like that, using techniques like sine waves and mathematical variations to get the type of waveform you want to simulate. So I found this to be a very interesting post. If you want to generate more realistic data, then definitely check out this blog post and find some techniques about how to do that.

The next piece of content- "Configuration changes across Pg versions". This is from depesz.com, and he's talking about his site why-upgrade.depesz.com. It's basically a site that allows you to put in two different Postgres versions and it shows you what has changed between those versions. It shows you things like parameters that have changed or default values that have changed for parameters. For example, when you put in 13.5 to 14.1, it shows you that two parameters were removed and are no longer there. There are 17 new parameters that you can configure and three parameters have had their default values changed. So to me, this is a super useful tool. If you're ever wondering what's changed between versions, maybe try using this tool to help you get a better sense of it.

The next piece of content- "Five Easy to Miss PostgreSQL Query Performance Bottlenecks''. This is from pawelurbanek.com. The first area he mentions is when you're searching by a function call or basically, you have a query that's using a function. Well, that's not going to use any index that exists on it. You actually need to use a functional index. He says he's not really a fan of doing that, but there are some other ways you can do it other than just making a functional index. You could change all the data to ensure that the database always just has lowercase data in it. You could do the functional index. But his solution is using the CITEXT extension.

Basically, that lets you create a column with a different data type of citext, a case-insensitive text. So you would change the table data column to be that citext for email and then when you do a search, it can use the index. The next bottleneck he mentioned is searching by a pattern. So using LIKE or ILIKE to try and find partial matches within a field. So in his example here, he looked at wanting to see all emails that end in @example.com. Now that's not going to use an index and he has a particular technique he used with some extensions to be able to do that. Now, say you only wanted to look at the left anchored portion. So if your email addresses start with A, if you did an A percentage, you can do what is mentioned in a comment down here, use the text_pattern_ops usage for that email address and you can use the index for LIKE or ILIKE.

But that won't work for a right-anchored pattern. So it won't work in this case, but the technique he uses here does. Now he says adding the btree_gin index, I don't know why he mentioned that it's not needed because this works when you want to combine a search on say, an ID and JSON field. So I did some tests and I was able to get it working without this. So I'm not quite sure why this is here. But you would need to add the trigram extension and then create a gin index using the gin_trgm_ops addition to the index. Once that is done, you can do a right-anchored or even, I believe, a no-anchored search with an email and it will use the index. The next bottleneck he mentions is ordering by NULLS LAST. Because if you don't have this in your index, it's not going to be able to use it.

So you can tell this is doing a sequential scan once he adds NULLS LAST as a part of it because you don't want to bring up users who have NULL email addresses in the ten limits that he's using here. Now you can add NULLS LAST to the index. He says that adding custom indexes on a per-query basis is a bad practice to me. It depends on performance. If this is a main core query that's going to be used a lot, it would make sense for it to be added. His alternative here is doing two queries and these two queries could have worse performance than just adding another index. And plus it requires two round trips to the database to do the query, unless, of course, you're doing a UNION, for example.

But it depends on the use case. I might want to add NULLS LAST to the index. The fourth bottleneck is bloated null_indexes. So this is the case where you have a column that's being indexed and it has a lot of NULL values in it. Now, generally, you probably don't want to do that and you'll probably want to use a partial index where the values are NOT NULL and that'll just give you those core values. So you would end up with a much smaller index that should be able to be searched faster, use less space, and also stay cached more frequently because the index is smaller.

The last bottleneck he mentioned is updating a transaction scope. Now, I wasn't really sure what this title meant, but I think he's just talking about when you're doing a lot of changes, a lot of updates, a lot of deletes, it's important to batch those because you're going to be doing an awful lot of row locks, as he mentions here, RowExclusiveLock, which you can tick. But whenever you're deleting or updating a lot of records, each of those has to be locked. So it's best to do that in a batch process. But if you want to learn more about these bottlenecks, you can check out this blog post.

The next piece of content- "Postgres Indexes for Newbies". This is from blog.crunchydata.com and this is an introductory post about some of the main indexes. In Postgres, they talked about B-tree indexes, which are your general index type. Also discussed were BRIN indexes, which are block range indexes and are great for indexing over blocks of ranges. So this is good for date ranges or timestamp ranges. It's not as fast pulling out specific values, but for doing ranges it can be a really good index for that use case. Talking about GIST indexes are typically used for spatial type indexes as well as some text searching. Although I tend to use gen indexes for that. The GIN indexes are useful when you have multiple values in a single column. So that could be text search, that could be JSON or JSONB fields, and the post goes over these different index types and how to use them. So if you want to learn more, you can check out this blog post.

The next piece of content- "Waiting for PostgreSQL 15- Introduce log_destination=jsonlog". This is from depesz.com and they have added JSON output. I'm not a huge fan of having the log output be JSON because a lot more has to be written for every record. So essentially the column of the data will be written in every single record. So I'm not a huge fan of that. It'll probably take more space, but I can see how this becomes much easier to consume for monitoring applications. But if you want to learn more about this feature coming in Postgres 15, definitely check out this blog post.

The next piece of content- "Surviving Without A Superuser - Part Two". This is from rhaas.blogspot.com and this is a follow-up to the post where he's talking about potentially the need to have a type of user that is below a superuser but actually has control over a fair amount of schemas and objects within a segment of the database. So this could be useful for service providers where the service provider running the database on your behalf would have the superuser access. But you could be granted this secondary set of permissions to be able to manage different objects under your purview and he talked about some things we'll want to take into account, like allowing these special types of users to set up logical replication, or allowing them to set up event triggers or even some alter system commands. Not all of them, but a subset of them. So this blog post describes some of the things that would potentially need to be done to make this happen.

Next piece of content- "Why I enjoy PostgreSQL - Infrastructure Engineer's Perspective". This is from shayon.dev, and he was talking about the difference between MySQL and PostgreSQL in terms of how infrastructure engineers actually prefer MySQL. I did not know that. But he's mentioning some things that he likes about PostgreSQL, and he's talking about making schema changes and talking about how you can create an index or drop an index concurrently to avoid locking during that change. He also mentions you can apply foreign key constraints without significant locking by using the NOT VALID command.

So you would add the constraint using not valid and then at some point later you could validate the constraint and allow that to run and this would not be blocking. He also mentions being able to add NOT NULL or other constraints or even a default value. Now, since Postgres 11, I believe, you can do this without locking. So I think this procedure would only be relevant for versions below 11 because you can add a NOT NULL and not have it do locking with versions 11 and later to my knowledge. And then the last area he covers is Extensibility. So there are definitely tons of extensions that you can add to change the different features of Postgres. So if you're interested in learning more, you can check out this blog post.

Next piece of content- "PostgreSQL bi-directional replication using pglogical". This is from aws.amazon.com. They're talking about trying to get bi-directional replication going with pglogical and they walk through the steps of how to do it and get it running. Now, the only way it is set up to handle conflicts is basically the last write wins. So even though this is doing some bi-directional replication, if there is a conflict, the last one to update basically wins. But if you want to learn more about how to do this, you can check out this blog post.

The next piece of content- "DATA NORMALIZATION IN POSTGRESQL". This is from cybertec-postgresql.com and they're talking about normalization in general in the 1st, 2nd, and 3rd normal forms as well as the Boyce-Codd normal form. So if you are interested in learning more about normalization in Postgres in particular, definitely check out this blog post.

The next piece of content. The PostgreSQL person of the week is Kuntal Ghosh. So if you're interested in learning more about Kuntal and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we had another episode of The Rubber Duck Dev Show this past Wednesday. This episode was on "The Dark Side of Open Source". Now, this is mostly focused on some Ruby and JavaScript libraries, but given Postgres is an open-source project, you may find this particular content of interest. So this is a longer form, developer-based discussion. But if you're interested in that, we encourage you to check out this blog post.

episode_image