background

Int to BigInt, Resistance is Futile, Optimizer Statistics, Advanced Query Optimization | Scaling Postgres 153

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to convert an integer primary key to a bigint, how container resistance is futile, how the optimizer uses statistics and how to set manual statistics.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about int to big int primary keys.Resistance is Futile optimizer statistics and advanced query optimization. I'm creston. Jameson.And this is scaling postgres episode 153. You all all right? I hope you, your friends, family andcoworkers continue to do well. Our first piece of content is actually a YouTube video. And it'scalled Changing your Huge tables. Data types in production. This is from Fostim 2021 in thePostgreSQL dev room. And this is actually on the channel of Jimmy Angelacos. And he's talkingabout changing data types on huge tables in a production database. So do it without impactingproduction. The example that he used for a lot of the scripts that he shows in this presentationare where you have an integer primary key, and you need to make it a big int because you'reabout to run out of integers. So he goes through the whole process and shows you these scripts.Basically, you first add a new big int column and then use triggers to start copying the data fromthat integer to the big int. And then you set up a job or process to backfill the data in that big intcolumn to make sure that it matches what's in the int column. And once everything is in sync,then as a part of one transaction, you do a series of DDL to rename the column, drop the oldcolumn, set up the new primary key, et cetera. But this technique can, of course, be used forother data types. He just is using the example of where you have an integer that's about to runout, and it's the primary key of a table. So if you want to learn more about this technique thathe's describing, definitely check out this YouTube video.

[00:01:48] The next piece of content is deep PostgreSQL thoughts. Resistance to containers isfutile. This is from Crunchydata.com in the first part of the post he's talking about. There's somepeople that think containers basically aren't ready for primetime, particularly for deploying yourdatabases. And this post talks about how trying to resist this trend is probably futile, and we'reheading down that path anyway. And he talks about what a container is and even runningPostgreSQL. And some operating systems, they already place certain things in C groups, whichis kind of what containers use to do some of their containerization. So you're kind of alreadydoing that. It's just making it more formal when you use a container. Now, he does list someconsiderations you need to address when you're dealing with containers. And running postgres.Number one is the out of memory killer. We covered that post last week that hepresented@crunchydata.com. The next is storage, because a lot of containers are designed tobe ephemeral. So how do you handle persistent storage? How do you handle restarts andemotion? And this is basically talking about the orchestration layers and work is being done inthat area. And they're talking about the Crunchy data operator that they have developed andthen custom deployments. That means someone has set up an automated containerorchestration, but then they want to make tweaks to it and sometimes that causes issues thosetweaks. Now, personally, I haven't deployed PostgreSQL in a container. I don't see a reason tofor my use cases or for a lot of customers I work with, I don't see a reason for them to useContainers. It seems like the push towards containers is because an organization is alreadyusing them, say for their application deployment. So they just kind of want to normalizeeverything. But personally, I don't see a great pressing need using containers with Postgres. Butlike anything, it depends on your use case. So if you're considering using containers withPostgreSQL, you may want to check out this post from Crunchydata.com.

[00:03:45] The next piece of content is PostgreSQL Analyze and optimize their statistics. Andthis is from CyberTech Postgresql.com and he's talking about statistics and how they are used inregards to completing queries. So it goes through the general process of processing SQL in thatit goes through a parser. It goes to a traffic cop that sends utility commands this way, but queriesdown to the rewrite system which then prepares the query for the optimizer. The optimizerconsults statistics in order to determine the best path for the executor. Now, he has an examplehere where he creates a table using Generate series and he uses an explain plan to see whatthe optimizer is planning to do. And then he looks --at the PG Stats table, which is actually where these statistics are stored. He shows you all thedifferent columns, then he shows an example of what some data looks like and it looks at thesestatistics per column. So you're looking at the ID column here and it gives a histogram ofdifferent bounds as well as other information that it uses to determine the best way to approachquerying this column. Then you see a name column here and it's storing something a little bitdifferent. It doesn't have a histogram a different bounds because pretty much everything is hansin this column. So it's presenting it the most column value and their frequency from the statisticsthat it analyzed. So if you want to learn more about statistics and how PostgreSQL uses them,you can definitely check out this post.

[00:05:15] The next piece of content. PostgreSQL creates statistics, advanced queryoptimization. This is also from CyberTech Postgresql.com and continuing the discussion ofstatistics, this is where you can actually use the command Create Statistics to createdependencies between columns. So normally these statistics are per column, but the CreateStatistics allows you to define relationships between columns. Now, some of those relationshipscould be between the country and the language that is spoken. Those are highly correlated. Orthere could be relationship between the zip codes within a state or the states within a country orprovinces within a country, those are correlated. But in his example, he used a Generate seriesand he defined a correlation between the X and Y column in that the Y column is always just50,000 greater than the x column. And then he did a query looking at this and when doing anexplain, it expected to get a million rows. But when it actually did an explain analyze, it actuallyonly pulled out 10,000 rows. So there is a correlation between X and Y that when you'reassessing these statistics based on per column and doing calculations on the number of rowsthat you expect to be there, you're going to get differences. But if you then use Create Statistics,give it a name and he's using the indistinct, there's a number of different statistics you cancreate. He's using indistinct between these two columns. Then when you do an explain, you getthe properly estimated 10,000 because it did identify a correlation between these columns whenusing Create Statistics. So if you want to learn more how to do that, you can check out this post.

[00:06:57] The next piece of content is querying JSON data in PostgreSQL. This is fromAaronboss Dev and he's talking about using JSON data in postgres. Now this is a relativelysimple post. He has a basic set of JSON here that he's working with and he goes through howyou can pull it out as part of a select using the arrow syntax. And this returns JSON and then helooks at the arrow notation with a double arrow which basically returns text instead of JSON.And with that you can do where queries to determine particular values in the JSON and also howyou can nest responses to navigate through the JSON to return the exact data you want to in aselect clause. He then covers the containment operator so you can use it in a where clause tosee does this record contain a particular type of JSON as well as the question mark operator,which helps you to determine if there are any keys that match within a particular set of JSON. Soit's a pretty basic post, but if you're wanting to get started using JSON with PostgreSQL,definitely check out this post. The next piece of content Query Optimization in postgres with PGStat statements. This is from Crunchydata.com and this is a pretty basic post again, and it'stalking about PG Stat statements. It talks about how to get it set up. Basically it's part of thecontrib module. So as long as you have that installed, you need to add PG Stat statements toyour shared preload libraries and then create the extension for each database you want to use itin. He shows what the view looks like, but once you get it up and running, you can start trackingthe statements that are being run against postgres. And he gives an example of a report so youcan look at your queries, taking the most time to run or queries that are being run veryfrequently. He also covers you can reset the statistics at a particular point if you want to look atfresh data after you've created an index or done some schema changes. And he also covershow you can configure it by j --ust seeing how many statements you're going to be recording in PG Stat statements, as wellas the type of statements you want to track. And of course, once you have identified a statementyou want to improve, you would then use Explain, explain, analyze, et cetera to figure out how tooptimize that query. Now, he also mentioned you can enable I O statistics so you get somethingfor the block read time and block write time. But this is not enabled by default, so you need toadjust the track I O timing parameter in the postgresql.com file, but you need to be aware thatthis could be a performance burden on your system, so you want to make sure that's not goingto negatively impact your system. And he did talk about a utility that I actually haven't heard ofbefore is PG Test timing. So it can actually look at your timers and help you determine if you'repotentially going to be impacted by enabling this parameter in your system. So if you want tolearn more about PG Stat statements and how it works to analyze your queries, definitely checkout this post.

[00:09:52] The next piece of content is why partition OpenStreetMap data. This is fromRustproof Labs, and he's talking about a decision process he's going through on whether heshould partition data that he's received from OpenStreetMap into his postgres database. So he'susing PostGIS to analyze this. And right now, he's juggling different schemas as he loads data into do analysis. And he's wondering if converting this to using partitions would make thingseasier. Now, this is the first post that kind of talks about how he's working now, basically loadingdata and identifying it in a particular schema and then potentially having to change thoseschemas. And then sometimes he wants to combine data, which requires unions, which kind ofbrings source some different issues. So he's really wanted to look at partitioning to see if thiscan improve his workflow. Now the second post is partition. OpenStreetMap data in PostGIS. Sothis is a second post going into more detail about how he's planning to do it and how theimplementation would follow. Now, he hasn't determined that he's 100% going to do this, butthese are posts that are walking through the process about how he's considering doing it and if itwould make sense for him. Now, a third post is coming. It's not ready yet. Perhaps next week'sepisode. We'll have the third post to determine whether he's decided to go forward with this ornot. But these two posts are a great review of when you decide to do partitioning. It is a burdento set up and maintain, so you want to make sure that the benefits outweigh the cost. Sodefinitely check out these posts if you're interested.

[00:11:28] The next piece of content ArcGIS feature service to PostGIS the QGIS Way so this isa post about GIS and working with different services to get it integrated into PostGIS. So ifyou're interested in that, you can check out this post from Crunchydata.com. And the last pieceof content is the PostgreSQL person of the Week is Valeria Kaplan. So if you're interested inlearning more about Valeria and her contributions to postgres, definitely check out this blog postthat does it. For this episode of Scaling Postgres, you can get links to all the content mentionedin the show. Notes be sure to head over to Scalingpostgres.com, where you can sign up toreceive weekly notifications of each episode, or you can subscribe via YouTube itunes. Thanks.--

episode_image