Filter Clause, Hypothetical Index, Non-Relational Data, Using TOAST | Scaling Postgres 251
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss how to use the filter clause, create hypothetical indexes, store non-relational data and manage TOAST.
- Using Postgres FILTER
- Will Postgres Use My Index? Hypothetical Indexing for Postgres
- Postgres is Relational Plus
- PGSQL Phriday #005: Relational and Non-relational Data
- Storing blobs in Postgres for our saved plans feature
- Relational and Non-relational Data: PGSQL Phriday #005
- Unlocking the Secrets of TOAST: How To Optimize Large Column Storage in PostgreSQL for Top Performance and Scalability
- PostgreSQL supply chain
- PostgreSQL Constraints: Learning PostgreSQL with Grant
- Intro to Postgres Custom Data Types
- How to Solve Advent of Code 2022 Using Postgres - Day 15
- Tracing Locks in Postgres using eBPF programs
- An Argument for Logical Failover Slots
- PostgreSQL Database Security: External Server-Based Authentication
- PostgreSQL 15 and beyond
- Transaction ID wraparound: a walk on the wild side
- Stateful Postgres Storage Using Kubernetes
- Null and Empty string in Oracle vs PostgreSQL vs SQL Server
- When Proj Grid-Shifts Disappear
- Default configuration
- Luca Ferrari
- Rubber Duck Dev Show Episode 72 | Game Development in DragonRuby With Amir Rajan
All right. I hope you, your friends, family, and coworkers continue to do well. There is a firehose of content this week. It seems like over the past number of months, there's been a lot more content released about Postgres, which is great. But it makes these shows a little bit harder to do. But our first piece of content is "Using Postgres FILTER". This is from crunchydata.com. And they're talking about a FILTER clause which is basically something you can use instead of a case statement to determine what data appears in a particular column. And in this post, they basically want to split out funds into collected and uncollected based on status. Now the first way they did it is by using a CTE. Now it's interesting, I see a lot of programmers go straight for a CTE, probably because they're used to composing their own functions.
And it just makes intuitive sense to construct a query in this manner, but look at how long this CTE is. Getting a similar result just using the case statement is just this here. So to me this is much easier to understand. And then if you actually use the filter to do it, I think it makes it even more readable. So to me, this argues for getting to know the different syntax in your database and FILTERs a great thing to know how to use as well as case statements. Because they can typically get the data you want much more efficiently in terms of the amount of code used compared to something like building out a CTE to do it. But definitely check out this blog post if you want to learn more about that.
The next piece of content- "Will Postgres Use My Index? Hypothetical Indexing for Postgres". This is from crunchydata.com. And they're talking about a Postgres extension called HypoPG and it enables you to place a hypothetical index. It doesn't really create an index, it just makes Postgres believe one is there so that when you run an explain plan, it can give you the plan based upon that index being present. It currently works for B-tree, BRIN and HASH indexes but not GIN or GIST indexes. So they have an example here. They ran an explain plan and it is doing a sequential scan on the table because there are no indexes. So they create the HypoPG extension and then they create a hypothetical index using this syntax.
And now when they do the explain plan, you can see that the explain plan is choosing to use a bitmap index on the hypothetical index that was created. So that tells you it will use this index if you create it. So this can save a lot of time if you have billions of rows in the table. Being able to add a hypothetical index to say 'Hey, will this really help the plan or will it really use this index?'. This is one way to help you achieve that. And he says you can even check the size of it as well by using pg_size_pretty and the hypopg_relation_size. So you can get a sense of how large that index will be. So definitely something to check out if you're interested.
The next series of blog posts are all related to the PGSQL Phriday, talking about relational and non-relational data. This first post is "POSTGRES IS RELATIONAL PLUS'' from rustprooflabs.com. And a lot of his non-relational data he's talking about is his PostGIS data. So a lot of spatial data. Now while there are some data types designed to store spatial data, it seems like there's a lot that's being handled in his use case in terms of JSONB. So a lot of non-relational data is being stored there. And in terms of things that may not work well, once you get up to really, really large sizes trying to do joins, there's a big risk of running out the disc space as you're using a lot of this huge non-relational data.
Now in terms of the biggest challenges of dealing with this data, it's basically what is it's structure? So he's saying documentation is super important. So when you have this non-relational data, it's by definition unstructured to a certain extent or its flexible structure, there is usually no schema being enforced upon this non-relational data. So basically having some sort of documentation to know how it's structured and laid out is super important. So check out this blog post if you want to learn more about his insights.
The next post is "PGSQL Phriday #005: Relational and Non-relational Data" from hdombrevskya.wordpress.com. And he talks about the decision of whether to store non-relational data in the database or not. Because if you're talking about a lot of data, you may just want to store it in a file system and store a pointer in the database or maybe you want to store it in some sort of object storage like S3. You may only want to store this data in the database if you want to do, say, text searching on it using Postgres full-text search. But if you don't need to do any searching for the data, maybe keeping it offline is the better path to go. But you can check out his blog post to learn more about that.
The next post is "STORING BLOBS IN POSTGRES FOR OUR SAVE PLANS FUTURE". This is from pgmustard.com. And from their perspective, they were storing some of their non-relational data on a bucket in Google cloud and just storing the references in the database. But with a new feature where they're adding the ability to save plans, they actually are storing it as a blob in the database. They have some reasons they chose to go this route but he gives another insight into this decision process. And you can check out his blog post if you want to learn more.
Next piece of content related to this is from andrea.scherbaum.la. And he talks about his use in terms of a JSONB. So check out his post if you're interested as well.
The next piece of content related to that is about TOAST and it is "Unlocking the Secrets of TOAST: How To Optimize Large Column Storage in PostgreSQL for Top Performance and Scalability". This is from percona.com. And if you're going to be storing data in a non-relational manner, typically that data is going to be rather large and won't be stored in the TOAST. And of course, the TOAST stands for The Oversized-Attributes Storage Technique. So because PostgreSQL uses fixed page sizes, if there's any data that needs to exceed that page size, then it typically gets TOASTed as if it were using this technique. When you start storing large data and it's going to be TOASTed, you need to make sure you're using the right strategy and handling space management effectively. So TOAST offers four different strategies related to whether it tries to store the data in line compressed or uncompressed, or offline compressed or uncompressed. All of those different permutations are reflected in the different strategies here- PLAIN, EXTENDED, EXTERNAL, and MAIN. And for a given data type, if it's TOASTable, which is typically things like text columns or byte arrays, you can definitely change the strategy for how you want a store depending upon your use case. This blog post goes into making those decisions as well as some things to watch out for like if you don't have it VACUUM on a regular basis it could impact your database, too much data being stored there can diminish query performance, and of course, it just increased the amount of space being used. But if you're wanting to learn more, you can check out this blog post.
Next piece of content- "PostgreSQL supply chain". This is from peter.eisentraut.org. And he's referencing this image that I've seen elsewhere on the internet that depicts a modern digital infrastructure. And it looks like a set of irregular-sized Jenga blocks. There's one little block holding up This whole structure is quote "A project some random person in Nebraska has been thanklessly maintaining since 2003". So he's basically talking about this perspective about PostgreSQL and how maybe PostgreSQL is one of these bigger boxes somewhere in the middle here.
But keep in mind that PostgreSQL has its own dependencies. He talks about things like Perl, Python, ICU, LLVM, systemd, and even things like OpenSSL. Then there's lz4, zstd for compression, there's zlib, and a number of other ones too. He actually highlights some ones that could probably use a little bit more love in terms of maintaining and how we really should think about these dependencies in terms of the overall PostgreSQL ecosystem and about how we can shore up support for these different libraries. But definitely an interesting perspective that I encourage you to check out.
Next piece of content- "PostgreSQL Constraints: Learning PostgreSQL with Grant". This is from red-gate.com. This is just a very long comprehensive post with code examples describing all the different ways that you can set constraints in the database. He's talking about not-null constraints, unique constraints, primary keys, foreign keys, check constraints, and exclusion constraints. So if you want to shore up your knowledge of any of these areas, definitely check out this blog post.
Next piece of content- "Intro to Postgres Custom Data Types". This is from crunchydata.com. And they're talking about domains which are a custom data type that you create that has constraints already applied to it. So they have an example here that you create a date_of_birth data type, which is considered a domain. And say that the value must be greater than the year 1930. Or you could create a data type that has a valid_email with the constraint that insurers it's a valid email. And then to use the domain, you just use it as a data type as opposed to what you typically may use like a timestamp or a text field.
They also talk about CREATE TYPE which can be used for composite types. So in this case they're showing a type called physical_package and they're giving the three different dimensions as numerics. So it's a composite of multiple data types. And typically when you're using it, you may need to cast it to a particular value as they say here. TYPES are also used in ENUMS which they give some examples here. However, they do mention that a lot of people prefer check constraints as opposed to ENUMS. Then finally they discuss the range types. So just a quick little primer on different custom data types that are possible in Postgres.
Next piece of content- "How to Solve Advent of Code 2022 Using Postgres- Day 15". This is from crunchydata.com. Definitely check out this blog post if you want to learn a Postgres solution to this problem.
Next piece of content- "Tracing Locks in Postgres using eBPF programs". This is from pganalyze.com and it's part of Lukas's "Five minutes of Postgres". Here, they're talking about a couple of different tools. One is pg_lock_tracer and pg_lw_lock_tracer and from his description, they help you get more insight into what's going on with say PG locks. But as opposed to PG locks giving you one point in time, this actually aggregates some statistics so you can trace what's going on with particular statements or queries. So if you want to learn more about that, definitely check out his piece of content.
Next piece of content- "An Argument for Logical Fail Over Slots''. This is from percona.com. They're talking about the scenario where you have a primary, it's doing asynchronous streaming replication to a replica which is very typical, but it also has a logical replication to this third logical replica here. The issue comes if you need to failover from the primary to the replica, the logical replication slot that's on the primary doesn't failover. The replica knows nothing about it. So if you do the transition, that replication slot does not exist and all your logical replication fails.
So basically you need to go through the process of making sure the logical replica is not talking to it all to the primary; it's indeed down. You need to create the slot on the replica and have the logical replica start consuming from that. And they talk a little bit about that process here but because this is a problem, particularly with synchronization, they're just saying We really need to get this process in place. And they do mention and we've covered this in previous episodes of Scaling Postgres, that Patroni has a built-in way to do replication slot transfers to a promoted system. But it would be great to have a Postgres solution to this as well. So check out this blog post if you want to learn more.
Next piece of content- "PostgreSQL Database Security: External Server-Based Authentication". This is from percona.com. And they're talking about all the different ways that you can authenticate with Postgres. One set of ways is internal authentication. So the database itself, that's who's logging in and identifies that person. You can use trust authentication, reject authentication, md5, SCRAM, or certs. You can also do OS authentication which relies upon things like PAM, peer, or ident. But there's also external authentication. So something that's outside of the system, could be GSSAPI, SSPI, LDAP, or radius. And that's what this blog post covers. He goes over all the different external authentication methods that you can set up with Postgres and discusses each briefly. So check out this blog post if you want to learn more about that.
Next piece of content "PostgreSQL 15 and beyond". This is from postgreSQL.fastware.com. And this is a review of all the features that were added to Postgres 15, which typically come out in late fall when Postgres 15 was released. So you may find a lot of this review but what I found interesting was where he was talking about Postgres 16 and beyond. Because there's been a number of months since 15's release, 16 has already started to be worked on and he mentioned all the different features that are being worked on in place. In addition, there's this cool graphic that shows major features released in each version and you can see in 16, he projects we'll be able to get in by the time 16's released in the fall of 2023. So check out this blog post if you want to learn more about that.
Next piece of content- "TRANSACTION ID WRAPAROUND: A WALK ON THE WILD SIDE". This is from cybertec-postgresql.com. And this is the wild side. He's basically trying to find a way to force his database to run out of transaction IDs and see if he could get data corruption. The short story is he didn't, but it's definitely an insightful process of all the different hurdles he has to face to try to hit the transaction wrap-around. So if you're interested in that, you can check out this blog post.
Next piece of content- "Stateful Postgres Storage Using Kubernetes". This is from crunchydata.com. And they are going through the process of how you would take the typically stateless Kubernetes and make it so that you can run Postgres in a stateful fashion on it. He talks about using persistent volumes and persistent volume chains with storage classes. So basically. he describes how you would typically set this up. But then at the end, he says you can sidestep creating all of this if you just want to use their crunchy data Postgres operator to do it for you. So I did find this interesting as an educational post about ways you can run Postgres on Kubernetes but check it out if you want to learn more.
Next piece of content- "Nll and Empty string in Oracle vs PostgreSQL vs SQL Server". This is from migops.com. And they're talking about when you're looking to migrate from these different databases, how do they handle empty strings and nulls differently? So you can check out this blog post if you want to learn more about that.
Next piece of content- "When Proj Grid-Shifts Disappear". This is from cleverelephant.ca. He's discussing a postGIS issue where upgrading from one version to another resulted in a vertical shift between the different systems they were using. And this blog post goes ahead and explores that.
For the next piece of content, there was another episode of Postgres FM this week. This one they talked about the "Default configuration" that Postgres comes with. So if you want to listen to their show or watch it on YouTube, you can definitely get the link here to do that.
Next piece of content, the PostgreSQL person of the week is Luca Ferrari. If you're interested in learning more about Luca and his contributions to Postgres, definitely check out this blog post.
And the last piece of content, we did have another episode of The Rubber Duck Dev Show this Thursday afternoon. This one was "Game Development in DragonRuby With Amir Rajan". So whereas one would never think of using something like Ruby typically for game development due to its speed, they've done some interesting things with DragonRuby to make it efficient for doing 2D games. So if you're interested in that, we definitely welcome you to check out our show.