background

Index Include Clause, Exporting CSV, JSON, Zedstore | Scaling Postgres 62

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering the index include clause, exporting data to CSV, using JSON in Postgres and Zedstore.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres we talk about index include clause, exporting,CSV, JSON and Z store. I'm Kristen Jameson, and this is scaling postgres episode 62. Youalright? I hope everyone's having a great week. All first piece of content is a close look at theindex include clause and this is from Usetheindexloop.com blog. Now he also covers more thanjust PostgreSQL, so he's talking about Microsoft SQL Server IMDb two oracle of course, but theinclude clause was something that was added recently to PostgreSQL eleven. So to start off, hekind of goes back to give a recap of Btree indexes and basically you have the binary tree thatyou walk down to find a particular key and then he says there's also this Doubly linked list. Nowthis is usually what you consider part of the index, which is the index indicated here and then thisis the table data, or the heap sometimes it's called. And then this points to the particular tabledata. So the operations is you search through the b tree and then the Doubly linked list and thenyou pull out the table data to present the rows to the client. And then he also does a recap of anindex only scan. So in this case he looks at an index on the sales table looking at a subsidiary IDand then I guess the euro value of it in a query that takes the sum of the euro value from thesales table by subsidiary ID. Now, because this value is a part of the index, you can do an indexonly scan to do this calculation. So you go through the B tree and then through the double elinked list and you can get the value because it's in the index, you can get the values that youneed. You do not need to then go to the table data. So that's what is mean by an index onlyscan. So what the include clause does, it allows you to have an index only on this key value, sothe index just includes the keys. But in the Doubly linked list you can include additional values.So when you actually find the particular key or keys you're looking for, all the values are thereand you don't have to go to the table data. But yet the vast majority of the index at this stagedoes not need the value here because it's only included in the leaf nodes, as it were, of the Btree. Now what he says here quote compared to the original index definition, the new definitionwith the include clause has some advantages. So the tree might have fewer levels becausethere's less data to store. The index should be slightly smaller, although I would think that wouldbe a little bit more percentage. But he says it's around 3% because the leaf node level is whattakes the most space up anyway and the other has an other benefit is it documents its purpose,it's not being searched on, however, it's to be included with the index. Now, one consideration hementions is that if you needed to add a use case where you also wanted to sort by the TScolumn, which he's referring to as a timestamp, you would have to think about what your columnorder is to make searching efficient. However, if you had this Include and had your index like thisadding an additional column, yes, it does make the index bigger. But now a use case that needsto search on subsidiary ID as well as the timestamp, you can just modify this index withouthaving to go over the debate, without the include of what's the best order do. We need two ofthese types of index. One, index with value include can handle many more use cases. He alsotalks about something that you typically can't search through a Btree index like using like youcould potentially use this type of filter on Include columns. However, he does make a note this isnot the case with PostgreSQL, so there is some limitation given how their implementation ofMVCC is. And he also considers cases where unique indexes can work with the Include clause.Now, the reason why some of the PostgreSQL with not being able to filter on the Include isbecause of row visibility and being able to track those x minx max values to be able to identifywhat row is visible to a particular client that exists on the table data. It's not within the index. Theindex does have a visibility map, but it's not very granular. So a lot of times you have to checkingthe visibility map means you also still need to go to the table data. Now, maybe thisimplementation can change with new storage solutions they're talking about with PostgreSQLnow that they have added some an API to be able to have different table storage engines. Butthis time --this limitation still stands. So if you're wanting to go into more depth about how the newInclude clause can be used with your indexes, definitely a blog post to check out.

[00:05:25] The next post is exporting CSV from PostgreSQL. And this is Felipe Blog. And thefirst part of this post just talks about basically how you export data from a psql session so youcan use the copy command. And he talks about know you have to use absolute values for it.And if you need to use a client to do it, there's also the backslash copy command and you canspecify what format like CSV add headers, so it goes over just the basic copy command and thefeatures that are available for it. And he also shows how you can use a psql command in orderto run this copy statement to be able to output the values. And he was finding he was doing thisso frequently, he actually made a utility that called psql two CSV. That's actually a thin wrapperaround psql, and it just basically seems to save a bit of typing in order to run some of thesequeries. And it has the advantage of being able to handle some of these variables. So forexample, he can send in some variables to vary the output of a particular report. So if you havea need for something like this, definitely a blog post to check out.

[00:06:38] The next post is having lunch with PostgreSQL, MongoDB and JSON. This is fromthe Ongres.com blog and he says the purpose of this post in the front, this is basically AlvaroHernandez's opinion on the topic of postgres JSON developer productivity and MongoDB.

[00:07:00] So it's a counterpoint to a previous blog post that was done by Buzz Moshetti. Sobasically what this does is he counters the different points that were made saying MongoDB hasthe advantage and his opinion is postgres is pretty good in these respects. Now, I don't want toget into debate about it, but this blog post does go over some JSON use cases and somepractices that you may want to put in place. So it's a good overview of different ways you canuse JSON and some suggestions on how to work with it and structure it within your application.So putting the debate aside, it's a good learning post for JSON in PostgreSQL. So if you'reinterested in doing that, definitely a blog post to check out.

[00:07:48] The next post is waiting for PostgreSQL twelve. Allow vacuum to be run with indexcleanup disabled and this is from Depom and it does exactly what it says. Basically you don'thave to run the index phases in vacuum potentially in PostgreSQL twelve. Now this could be ahuge benefit for some databases I work with because they have a lot of indexes on some largetables and it's the index phase that takes so long to do. And particularly with the enhancementthat is also coming with twelve where you can re index indexes concurrently. I could see incertain use cases being able to vacuum just the table in the heap and leaving the indexes,maybe you just want to do a reindex of those. That could have some advantages. So definitelyan interesting feature added and this blog post goes through and looking at some of it. So ifyou're interested in that, definitely a blog post to check out.

[00:08:47] The next post is the second one waiting for PostgreSQL twelve add settings option toexplain to print modified settings. So basically this post talks about it and shows what it lookslike. And the advantage is basically if you're sharing an explain plan, it can output the settings soyou can know for example, the sequential scan has been turned off for this particular explainplan. So I could see this could have some use cases and potentially this is coming withPostgreSQL twelve.

[00:09:18] The next post is Zstore compressed in core column storage. And this is from thePostgresql.org message board from the psql hackers list. And basically with the advent of thenew table access methods APIs, basically being able to replace the storage engine ofPostgreSQL, these individuals have started an implementation of something they called Z store,which is a compressed encore. That means core part of PostgreSQL, I believe columnarstorage. So it's basically storing data by columns as opposed to storing data by rows. And theybelieve they have some performance thing here. Yes. So for example, in here, taking an averageof a single column, because data is stored by columns, performance can go from say, 4600milliseconds to 380 milliseconds. So over it looks like over a tenfold improvement are the k --inds of things that a columnar storage could do. So definitely very interesting with the newtable access method API that's recently been committed to PostgreSQL with this and potentiallyZheep coming down the line as well. So if you're interested in this, definitely Post to check outthe next post is one to one relationship in PostgreSQL for real.

[00:10:45] So basically they're talking about defining a one to one relationship where you can'tenter a record here that doesn't exist already there. And he said years ago they did someimplementation that didn't quite work because you could have a zero to one, not a true one toone. So tried it this way, but doing a foreign key and referencing each other. But it doesn't reallywork. It gives an error. But what you can use is deferable constraints. So you can define thesetwo tables and have them reference these to others, each other in terms of foreign keys andthen make it deferable, initially deferred. And this kind of implementation will work. So if you'rewanting to do this type of constraint enforcement, definitely a blog post to check out.

[00:11:33] The next post is Masquerade, a postgres proxy to mask data in real time. And this isfrom Tonic AI blog. And what they're doing, they're showing exactly right here. When you gothrough a proxy that they have set up, it obfuscates the data. So you can see the name here isGlenn Raymond, and here it's been scrambled and even it looks like the birthdays. So what itdoes in real time, it alters the data as you're seeing it. So there's a proxy that sits between theclient you're using and the PostgreSQL database. And their use cases they're thinking about is ifyou have a Dev or a QA that wants to test off production data but don't want to necessarilyreveal it to them, you could use this type of obfuscation or maybe you want to generate somereports, but not let them see particularly sensitive data. Now they have this as part of a dockercontainer now to set up and test out. And there's also a second post that goes into some of thedetail about how it functions. Now of course, I was looking and thinking about this. How is itworking? Is it looking at the protocol across the wire, which kind of looked like it was doing? I'llhave to look into it more, but this is potentially very interesting to me. Now, it doesn't supportHttps yet, which is how I would want if you have data that you want to obscure like this, you'regoing to want to keep it secure anyway, at least between these two points. But it doesn't supportHttps yet. But this definitely looks like a project I'm going to keep track of, and if I hear moreabout it, I will share. So if you're interested, definitely a project to check out.

[00:13:17] The next post is Indexes in PostgreSQL Gen, and this is from Haber.com, and Ibelieve this is translated from the site Postgres Ru, which is a site in Russia. So this is a super indepth description of the gen index and how it works. So if you want to get into technical detailsof how gen indexes work, in particular, use cases, definitely a blog post to check out.

[00:13:47] The next post is actually a YouTube video and it comes from the channel Pgcasts. Sothese seem to, I believe, stopped a while ago. But actually this one came up in my feed recentlyand it's called Altering Databases and it goes over a few alter database commands forPostgreSQL and it's only about three minutes in length, so it's super short, very brief. It is on thesimple side, but I'll keep track to see if they're going to keep producing them and share themwhen they're available.

[00:14:18] The last post is Chicago open rideshare data set getting started. And the subline ishow to get started with mapping GIS data. And this is from the CMC 1213 GitHub IO blog. Sothis goes over setting up PostgreSQL PostGIS, as well as a JavaScript library called Leaflet topresent the data in this way. Now, this is a fairly long post, but he goes over in a lot of details, sohe goes so far as to show you, all right, here's how you set up your DigitalOcean server, here'show you install PostgreSQL and PostGIS. Here's how you extract the data, load it intoPostgreSQL, and then set up the JavaScript library Leaflet in order to present it. Some of it is alittle bit basic, but if you're wanting to get started with PostGIS, perhaps this is a prettyinteresting way to do it.

[00:15:16] That does it. For this episode of Scaling Postgres, you can get links to all the c --ontent mentioned in the show notes. Be sure to head over to, where you can sign up toreceive weekly notifications of each episode, or you could subscribe via YouTube or itunes.Thanks. --

episode_image