background

Hash Indexes, Int Float Numeric Types, Postgres 14 Features, Data Science | Scaling Postgres 148

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss hash indexes, the int, float and numeric data types, features coming in Postgres 14 and data science in Postgres.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about hash indexes, int float andnumeric postgres 14 and data science. I'm Kristen Jameson and this is Scaling Postgres,episode 148.

[00:00:21] All right, I hope you, your friends, family and co workers continue to do well. Our firstpiece of content is reintroducing hash indexes in PostgreSQL. This is from Hakibanita.com, andthey're talking about hash indexes. Now, it says reintroduced, and they've had them for a while,but as of version ten, they actually really became more usable. So the first thing this postdescribes is what is a hash index? And you basically take a hashing function and apply it to avalue, and that hashing function then places that value within a discrete number of buckets. Sothey have an example of one hash function is doing a modulo or looking at a remainder of anumber, and you're putting these values in, say, one through ten, and then whatever theremainder is from three, it will be placed into that bucket. So one goes into bucket one, but alsofour goes into bucket one because four divided by three has a remainder of one. And also,whatever this number is, you're going to have that many buckets. Now, the hash indexingfunction within postgres looks like it maps it to a 32 bit integer. So about 4 billion possible hashcodes. So essentially those would be the buckets now, because there are a certain number ofbuckets, and you can see that when you have three, you have a lot of collisions, meaning thesame value goes into the same bucket. Presumably that will happen less often with a 32 bitpossibility, but collisions do occur, and it discusses kind of how they handle them. And they havewhat they call hash index overflow pages that then look to find out what the actual value is. Soit's not sufficient to just go to the bucket. You need to go to what the exact value is. And then italso mentions there's also splitting that can occur where it needs to actually split a bucket intotwo buckets. So that's also a possibility. And then they get into actually testing these out. Sothey're doing a test using a URL shortener. So they're going to have just an integer ID. The keyis essentially what the short URL is, and then the full URL here. So they're going to be placingBtree indexes on both the key and the URL as well as hash indexes. And the first thing that theylooked at was the index size at different points of inserting data. And the first thing that you'llnotice is that the hash index usually seems to be about half the size of a B tree index. So it hasspatial efficiency on its side. And because it's a hashed value that it's storing, if you have a verylarge amount of data you're trying to index, it probably makes more sense to use a hash indexversus a b tree index for that. Another thing you can notice is that it doesn't grow in a linearfashion like the b tree index, but it has the stepped approach and that's that hash splitting thathappens. They also mentioned the hash index size is not affected by the size of the index value.That's what I mentioned before, when you have a column with a lot or long pieces of informationand a hash index could be more efficient and lastly, it's not affected by the selectivity of the indexvalue. Now one thing they mentioned here, the difference between these b tree indexes wasactually due to the deduplication that version 13 introduced. And if you actually look in versiontwelve, you'll see an identical b tree index size across the row values tested, and also that thehash value is still about half the size of the b tree index. Now, they did some reindexing to see ifthey could make some difference and it does minimize the size difference between the twoindexes, but it is still the case that the hash index is smaller. Then they talk a little bit about hashindex fill factor which you can adjust, although they say it's default to 75 and between the valuesof 50 to 100, it doesn't seem to be that much variance, so I would probably just stick with thedefault. They looked at insert performance to determine how fast inserts can happen and whilethey were both similar, the hash index had a little bit of an edge and they're testing about 9.7versus 10.8, so maybe there's about a 10% difference. Then they looked at select performance,so selecting out different rows and here the difference seemed to be a little bit larger, althoughthey did say it's a very slight difference. Looking at the timing, --it looks to be about a 30% 35% difference. So the hash index performed these selects in00:59, whereas it was zero 92 for a b tree index. So the hash index looks a little bit smaller, alittle bit more performant for select queries, but it does have of course limitations. You can'tcreate a unique index as a hash index, you can't index across multiple columns the way you canwith a b tree index and you can't create sorted indexes. So it really works best for gettingindividual values. Of course you can't use it for range groups and you can't use it in order byclauses. Again, due to that sorting issue. So their conclusion is that the hash index is usuallysmaller than a corresponding b tree index. The hash index select and insert performance can bebetter than a b tree index. The hash index removed many of the restrictions in PostgreSQL tenand is now safe to use. And the hash index has many restrictions that limit its use to veryspecific use cases. So if you're interested in learning more about hash indexes, potentially usingthem in your projects, check out this blog post.

[00:05:42] The next piece of content is PostgreSQL Int, four versus float, four versus numeric.

[00:05:48] And this is from Cybertechsql.com and it's looking at these different numeric datatypes. So they created a table with the different values in here and then they did somecomparisons. And basically a lot of this post talks about when wanting to work with money. Ofcourse you would never want to use a float because it uses approximations to do its calculationusing the floating point unit of your CPU. So it's not going to give you an accurate count, butmore of an approximation when using floats. So you'll always want to use integers or numericsfor any kind of money that you're working with. Now of course one consequence of this is sayusing numerics as opposed to floats is that you do have a performance impact. So averaging theinteger in their example here took about 255 milliseconds. Averaging the float, which is probablyusing the floating point unit in the CPU took 267 milliseconds, whereas doing the numeric tookabout 368 milliseconds. So much more of a difference compared to the previous two. And thenthey give you a little chart showing the differences. So if you're calculating any type of currency,generally you would want to stick with an int or use a numeric if you're wanting to maintaindecimals. So if you want to learn more about these data types and their performance, you cancheck out this post.

[00:07:08] The next piece of content waiting for PostgreSQL 14 report progress of copycommands. This is from Dep.com. They're talking about that. There's a new view for you tomonitor copy progress called PGSTAT progress copy. So if you start a copy command, you canstart watching the copy progress and it will output the number of bytes processed and thenumber of lines processed for a copy operation. So this looks to be a great addition to Postgres14. The next post, also from the same site, is waiting for PostgreSQL 14 add idle sessiontimeout. So this is adding an idle session timeout. Now there already exists idle in transactionsession timeout. So if you're in a transaction in idle, it will actually cancel that session for you if ithits this timeout. Whereas this is just you have something using a connection and if you want todrop that connection, if it's not actively being used, you have an idle session. You can use thistimeout to kill those sessions. So this is a great way to release resources back to the database.So this looks like it will be added in Postgres 14.

[00:08:16] The next piece of content is our predictive analytics in Data Science work usingPostgreSQL. And this is from Crunchydata.com. And it looks like this is the final post in the datascience series that's been done by Crunchy data. This is the final one where they're using themodel to do predictions. So they're talking about numer models and using R to predict theprobability of fire. So if you want to check out this data science post and cover all of the itemshere, all running within postgres doing data science analysis, definitely check out this post andthe previous series talking about using data science with PostgreSQL.

[00:08:53] The next piece of content is the mapping of OID and rel file node in PG. This is fromHaigo CA, and they're talking about how the OID of an object in the PG class table maps to atable called Rel file node, and als --o how those change in the architecture of them. So if you're interested in learning howpostgres works under the covers as it were, you can check out this post.

[00:09:17] Another post, also from Higo, CA, is how PostgreSQL executes sequential scans withthe help of table access method APIs. So in postgres twelve they released the table accessmethod API so that others can use it, and with it they've looked at and analyzed the functions fordoing sequential scans. So if you're interested in learning more about that, you can check outthis post.

[00:09:42] The next piece of content is Sysbench in Memory postgres postgres is boring. This isfrom Small Datum blogspot.com, and here they're talking about an analysis they did of postgreseleven through 13 and basically reports no regressions for the items that he discovered, hence Ibelieve why he calls it boring. But if you want some more detailed analysis of performance forsome lower level workloads, you can check out the results of his tests here.

[00:10:12] The next piece of content is how to run some tasks without user intervention atspecific times. This is from Dep, and essentially the first item he mentions is just using Cron. Sousing Cron you can start the psql session and send any data you want into it to say, queryagainst the database or create a table or do a backup or whatever. So essentially Cron is themain one he goes toward and he talks about how he uses it with the postgres scripts he wantsto run. Now, some people don't have access to the underlying operating system to use Cron andthey're looking for a scheduling tool. So he also mentions the new scheduler tool called PGTimetable. So that's an option and he discusses using it and some use cases for it. And anotherone that I'm also familiar with is PG Cron, which I believe is being added as an extension that'sbeing built into say, the Amazon RDS instances. So therefore that would be readily accessiblefor you to use to schedule tasks. So if you want to learn more about scheduling tasks inpostgres, definitely check out this blog post.

[00:11:19] And the last piece of content is the PostgreSQL Person of the Week is Gunner NickBluth. So if you want to learn more about Gunner or Nick, go ahead and check out this blog postthat does it. For this episode of Scaling Postgres, you can get links to all the content mentionedin the show notes. Be sure to head over to Scalingposgres.com, where you can sign up toreceive weekly notifications of each episode. Or you can subscribe via YouTube. RyTunes.Thanks. --

episode_image