background

Indexing Documents, GIN Indexes, Data Science, Generated Columns | Scaling Postgres 71

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss indexing documents, how GIN indexes work, using Postgres for Data Science and generated columns.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres we talk about indexing documents, gin indexesdata science and generated columns. I'm Kristen Jameson and this is Scaling Postgres episode71.

[00:00:21] Alright, I hope you're having a great week. Our first piece of content hint is indexingdocuments for full text search. This is from Dave's Postgres blog at Pgsnake Blogspot.com. Sohere he said his boss asked about the new full text search in MongoDB that apparently they'verecently released and he was saying maybe we should do something similar for Postgres. Andhe said, well we can do that already. So he basically did in relatively little lines of code. He says36, but of course there's comments and some other things, but basically it shows you relativelyeasily how to index documents for full text search. The first thing he does is create a table, andthis is for specifically indexing HTML documents. So he has a path title body, a TS vector fordoing the searches, creates a gen index on the column he's going to be searching the TS vector.He then creates a trigger that's going to generate the TS vector whenever data is changed. Soupon an insert or update, it will update this column. And then he has some Python code herethat basically goes through a list of files, extracts the data and inserts it into the table he created.And then lastly, he does a search using this query to look for the keyword trigger within thedocuments that he indexed. And he also ordered it by the rank and then limited it to 20 results.And you can see those here. So not a lot of code to create a full text search when you'reindexing a number of documents.

[00:02:03] The next post is PostgreSQL's indexes gen, and this is from Louisemeda.com and it'sbasically describing the gen index or the generalized inverted index. And of course the first thingshe mentions here is they're generally used to index arrays, JSON, B fields and TS vector fieldsfor the full text search as we've just seen. So this explains how they work at a pretty detailedlevel. She uses examples of crocodiles and teeth for doing her examples. But the key thing totake about the gen structure is that it is basically a B tree index for getting all the keys andvalues, but it doesn't store every instance of a value the way a B tree does. But what it's storingis essentially the key values and storing those only once. So one reference for every value andthen from there it creates a list or even an additional b tree index to find where all thoseoccurrences happen. So for example, she has a diagram here and there's a value of six. Wellthat six only exists in one place, there's no other six in a normal B tree index. If you had tendifferent sixes, you would have in the leaf pages, the ten one index entry per row in the table,whereas here it just stores a single instance of it, but then an additional list or a tree thatidentifies all those instances. So here's an example of it where you have a value that goes to theleaf in the posting tree here or posting list. So this is what makes it very effective for identifyingthe occurrences of a word in a document or multiple documents, or finding the occurrence of avalue within an array of columns. So if you want to learn more about how a gen index isstructured, definitely a blog post to check out. Now, related to that is another blogpost@louisemeda.com called PostgreSQL's Indexes Gen Algorithms. Now, this covers how thegen index is used, so how it does searching from this phase of scanning keys, scanning thepending list, scanning the main index, and then updating the bidmap. And then she also goesinto the process of inserting into the gen index and what's required for doing that, as well asdeleting from the gen index. So again, if you want to get more in depth about how a gen indexworks, definitely another blog post to check out.

[00:04:36] Continuing our theme of indexes. The next piece of content is deep dive intoPostgreSQL's indexes webinar q A. This is from the Percona.com blog. So they did apresentation or a webinar.

[00:04:51] This first post. They have some of the questions that occurred during the webinar, butthey have a separate link here with the slides and the recording, and it actually has a PDF. Andthey cover things such as all the different index types from Btree, hash, Brin, gin, Gist, et cetera,as well as talking about partial indexes, expression indexes, as well as things like index onlyscans. So if --you want a more general review of indexing, definitely a post to check out. And they have thisslide here, 22 of the presentation where they're talking about the different index types and usecases for each. Now, what I found interesting is that they don't list Gin as a full text searchoption, which is generally the one that I tend to use. I tend to use Gin as opposed to Gist. Andeven looking at the PostgreSQL documentation for version eleven where they're talking aboutthe Gin and Gist types they actually list. Now this wasn't always the case, but they've startedlisting Gin indexes as the preferred text search index type. But they talk about there is someadvantages and certain use cases for using Gist, but just something to keep in mind as you'rechoosing your index types. The next post is PostgreSQL Meets Data Science and AI. So this isan opinion piece, but I actually think it's pretty relevant. So the first section here they're talkingabout in terms of for data science or AI purposes, are you going to use something likePostgreSQL versus CSV files versus commercial databases? And they've run into issues withcertain clients where their data quality within a CSV file is really bad. Like for example, he saysquote, let me quote 3% was stored as a string. So with something like that it's quite hard tocompare different data. And he also included this quote here if you're doing artificial intelligence,80% of your effort has to go into data preparation. Only 20% are the real model. And hecontinues in short, you cannot seriously expect a model to learn something about interest rates ifthe raw data does not even ensure that the interest rate is actually a number. So in general, ifyou're going to be cleaning up data, you might as well put it into a system such as relationaldatabase system such as PostgreSQL where you can ensure data integrity and dataconsistency. And he talks about what system is better for sorting data. All the experience thatPostgreSQL has versus writing your own Python scripts. What about filtering? Again tailor madefor relational databases as opposed to trying to develop your own script to go through CSV files.So basically what he says makes a lot of sense and he actually goes into classical SQLanalytics versus machine learning. And so many people are using the term machine learningwhere all they're wanting to do is just add up numbers or make a small prediction about what willhappen. And basically you can look at historical data and extrapolate you don't need to usesome of these very complex machine learning tools. But he does go over some considerations ifyou are wanting to do data science and machine learning or AI on some tools you can usepotentially in conjunction with PostgreSQL. So, definitely an interesting blog post I encourageyou to check out.

[00:08:16] The next post is how the CIS benchmark for PostgreSQL eleven works. This is fromCrunchydata.com and again the CIS is the center for Internet Security and they have abenchmark for establishing your PostgreSQL instance when running on CentOS that it issecured following a set of best practices. So they've updated the benchmark for Levin and theytalk about the various areas that it covers in terms of installation and patches, directory filepermissions, logging, monitoring and auditing, user authentication, access controls andauthorization, connection and replication, as well as PostgreSQL settings and specialconfiguration considerations.

[00:08:58] Like some of the updates they did for eleven are actually in reference to the SSLpassphrase command where you can actually define where you can get the SSL passphrase forthe private key when starting PostgreSQL on the server. So if you're interested in securing yourPostgreSQL setup, definitely a blog post and a benchmark to check out.

[00:09:21] The next post is generated columns in PostgreSQL twelve, and this is fromsecondquadrant.com. And they're talking about the new feature called Generated columns,where you can actually define a column that will be generated based upon existing data in thetable. In the example here, they actually used a concat function to concatenate an address to adelivery address that you can just simply run and use that will output the address as follows herebasically broken out appropriately. Now, they actually had to use their own Concate function dueto some potential mutability that can happen because some string functions are local --e dependent. So they actually had to generate their own function to do this type ofconcatenation. But he said as another use case for it is that perhaps you want to generate in adata warehouse example, separate columns for month, day, year, quarter, day of the week,things of that nature, all from a single date. And generated columns are a good use case for that.So whenever the raw data is updated, that generated column is regenerated and he says youcan't directly update the data, but of course you can index it. And then finally he has a verycomprehensive set of functions for splitting out this type of data in a data warehouse use case.So if you're interested in using generated and columns in this way, definitely blog post to checkout.

[00:10:53] The next post is PostgreSQL interval date, timestamp and time data types and this isfrom secondquadrant.com and basically they're covering all the different data type as it relates todates and times. So they cover the basic data types, talk about also talking about ranges, thedifferent styles that you can use for defining your dates and times, performing arithmetic onthem. Basically a very comprehensive post showing different ways of using date times andintervals. So if you're interested in that, definitely a blog post to check out.

[00:11:30] The next piece of content is actually a PG Bouncer update called Afraid of the World.

[00:11:39] The main feature added for PG Bouncer one point ten is adding support for enablingand disabling TLS 1.3 as well as a number of bug fixes. So if you use PG Bouncer, definitely anupgrade to check out.

[00:11:55] The next piece of content is actually a vimeo video called Database as API withPostgreSQL and Massive JS by Diane Faye. So this talks about using PostgreSQL with a nodeenvironment. Now many node users use MongoDB, but if you are wanting to potentially usePostgreSQL, Massive JS is a library you can use for interfacing with PostgreSQL. So if you're anode user, this is a potential video to check out to see if you want to potentially use PostgreSQLwith it.

[00:12:29] The last piece of content is actually a ton of videos in the last week or so have beenposted to the PGConf Russia YouTube channel. Now, a lot of these are in Russian and even thepresentations, but they do have English dubbed versions for some of them and I believe some ofthem are in English. Like for example, this presentation is. So if you're interested in additionalvideo content, perhaps this is a channel you want to check out to see if there are any videosyou'd like to review that does it. For this episode of Scaling Postgres, you can get links to all thecontent mentioned in the show notes. Be sure to head over to Scalingposgres.com, where youcan sign up to receive weekly notifications of each episode. Or you could subscribe via YouTubeor itunes. Thanks. --

episode_image