background

External Compression, Parallel Vacuum, JSON Aggregates, JSON & Arrays | Scaling Postgres 130

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss external compression, vacuuming in parallel, working with JSON Aggregates and the JSON & Array datatypes.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about external compression, parallelvacuum, JSON aggregates, and JSON and array I'm creston. Jameson. And this is scalingpostgres episode 130.

[00:00:22] All right, I hope you, your friends, family, and coworkers continue to do well. Our firstpiece of content is need for external compression methods in PostgreSQL. This is from AmitKhan at PG Blogspot.com, and he's talking about a means of doing compression within Postgresusing external methods. And there are some utilities that Postgres offers the ability to docompression, but it's usually just gzip. And there's some internal ways of doing compression, likewith the oversized attribute storage technique or the toast tables. But this author indicates therecould be a need for being able to use external compression mechanisms to compress a table orto compress data to get better performance. So, as he says here, the idea is smaller data sizemeans less data pages to scan, which means lesser disk I O and faster data access. And ofcourse, the data decompression methods need to be fast enough not to hamper that queryperformance gains that you would get. And some of the compression methods that it would begood to utilize in some way are zlib, LZ, four, ZTD, Snappy, gzip, et cetera. Now, he alsomentions another benefit of putting this feature in would be to utilize hardware acceleration forcompression and decompression. And he mentions a few of those here and looks like a few ofthem on the Arm 64 platform. Now, he does say that PostgreSQL uses its own built incompression algorithm based on LZ for toast table compression. But he says, what if you couldchoose that compression algorithm, and what if you could configure the compression level, andwhat if it could use hardware acceleration? And he says that the community is working onsomething exactly like that. So there is a proposed feature in the making with an experimentalpatch available. So he took a look at it. He created two tables, a Zlib tab table and an LZ tabtable. The zlib is, of course, compressed with zlib with a level of four, and then he just used Pglzcompression on the other one. Now, what's interesting with each of these, you get differentbehaviors, you get different levels of performance depending on what you're doing. So, forexample, with a copy where you're inserting data, the Zlib was almost twice as fast as LZ. Thetable size, LZ was smaller, looks like maybe a 1.2gb versus 1.7gb. But in terms of selectperformance, LZ was twice as fast as Zlib. So what's interesting is that it seems as you couldchoose your compression algorithm and get different levels of performance. Maybe you want tohave the fastest queries, even though maybe the table is a little bit larger or it takes larger toinsert. Or maybe you're just interested in raw insert performance. Well, you can choose analgorithm that is best at that now he mentions you could even define your own compressionalgorithms in defining a particular compression handler and he gives an example of that. So Ifound this blog post super interesting in how we could potentially use compression anddecompression of data in postgres to get better performance. And you being able to define thetype of performance you're looking for. Are you looking for a way to save space? Are you lookingfor fast queries? Are you looking for fast inserts? So definitely encourage you to check out thisblog post.

[00:03:44] The next piece of content is Parallelism comes to vacuum. This is fromsecondquader.com. Now first they cover the different phases of vacuum as they exist. So you'vegot the heap scan, the index vacuum phase, the heap vacuum phase, index cleanup phase andthe heap truncation phase. Now, of all of these, the vacuuming of the indexes takes the longestand then that is exactly what they're looking to make happen in parallel with version 13 ofpostgres. So by default when you run a manual vacuum, so vacuum some table, the paralleloption will be enabled. So it's on by default. However, and they say here quote, parallel vacuumis always disabled in auto vacuum. So that's a little bit of a disappointment because I wouldreally love to have it run there. But I assume they may be enabling this in a future version ofpostgres. Because normally you see new features such as this come out where it's not a default,where it wouldn't potentially impact the vast users of postgres but then maybe in a late --r version it's deemed safe enough to be the default but we'll just see what happens in futureversions. And then if you do specify the parallel clause, you can define how many workers you'regoing to be using or you can disable the vacuum by setting a parallel setting of zero, so noworkers. Now, he also mentions down here that there are also different settings based upon theindexes and that you're not always going to get a parallel scan even though you ask for one. Soif a B tree index is small enough, it still may not do parallel operations because it would takemore time to set up and coordinate those workers than it would to be just to do the work itselfwith one process. But otherwise you do get Btree indexes being vacuumed in parallel. Now, wealso mentioned you do have support for different phases for the other indexes that arementioned down here such as the hash gen, gist, brin, et cetera, and depending they may ormay not be performed in parallel depending upon the phase. Now, in terms of performance, heset this up with a six gig table with eight three gigabyte indexes and then ran it in parallel fordifferent processes. So as you can see here, there's a dramatic performance gain by going toeight workers, which is the number of indexes that are on the table. And in what he mentionshere is, quote the execution time of the index vacuum accounted for more than 95% of the totalexecution time. Therefore, parallelization of the index vacuum phase helped to reduce thevacuum execution time much. So definitely a great improvement coming in postgres 13. Andagain, like I mentioned, I look forward to future versions where they may enable auto vacuum todo this in parallel as well.

[00:06:26] The next piece of content is postgres JSON aggregate functions are pretty cool. Thisis from the blog at Knut EU and he basically has a set of tables where a query looks like this.You have a username, an age and a group name, and he wanted the API to return somethingthat looks like this in terms of a JSON where things are grouped together. So different users anddifferent groups presented in this JSON format. So first he shows the different tables and howthe data is inserted to be able to get a query that looks like this. Then he converts it to JSON b.By combining this information to here, he removes the ID column because this wasn't necessaryin the output, which gives the results shown here. He then uses the JSON aggregate function, orJSON underscore AGG to put each row for each group in its own row. So you can see here youhave two records inside of this and grouped on the group name. Then lastly, he did a JSONobject aggregation, where it combines this into JSON again into the final output that he waslooking for. So this is a pretty quick post, but it shows you how you can get postgres to outputJSON in a format that could be directly served by an API. Now, we've covered different articlesthat show this and what they mention is that they usually get some big performance gains havingpostgres do this work through queries as opposed to using, say, your application framework todo the manipulation. So maybe you lose potentially some flexibility, but you generally gain someperformance. So if you're interested in learning more about how to do this, definitely check outthis blog post.

[00:08:06] The next piece of content is webinar JSON and Array contemporary PostgreSQL datatypes follow it. This is a webinar that's given by second quadrant.com, and you can click here tolook at the webinar, and there's a ten or eleven minute preview down here. But it basicallyreviews the JSON data types as well as arrays and working with them to avoid having to followany kind of an entity attribute value pattern which can really give poor performance. So basicallyit takes what was in the previous post and goes into those functions and more manipulations thatyou can do in more detail in a webinar format. So if you're interested in learning more aboutJSON and manipulating it in postgres, definitely check out this webinar.

[00:08:51] The next piece of content is how to get the best out of PostgreSQL logs. This again isfrom Secondquader.com and they're talking about setting up logging for postgres. Some of theirrecommendations are don't make manual changes to postgresql.com, basically use some sort ofconfiguration tool like Ansible or Puppet or Chef. They recommend using the logging collector.Some people choose not to use it --and just use the system logging of their operating system. But they advocate using theLogging Collector how to set your log destination to a particular format. They haverecommendations that they suggest with regard to that, how to configure your log file names,how to configure your long line prefix, and the different configuration settings for setting up yourlogging. And then they cover some final things like using tools to analyze the logs, such as PG,Badger and other management applications to be able to track log changes. So if you'reinterested in learning more about that, definitely check out this blog post.

[00:09:53] The next piece of content is PostgreSQL versus PostgreSQL versus postgres versusoracle versus all the rest. This is from Cybertechyphen Postgresql.com and they're talking aboutsearching within postgres where you're doing kind of similarity searches. Now, it's not purelyabout similarity searches, but there are ways to find information through text searching. The firstthing they cover is the Citex data type, which is available at this extension, or the CITEXT, whichis a case and sensitive text. So basically you can do searches without having to worry aboutcase and it will find the appropriate records if you use this data type with this extension. The nextthey go into like queries and how you can search define portions of records using like and I Likeagain, I like is Case insensitive using a similarity search using PG Trigrams. They give someexamples of using that and then going into full text search which does an exact search, but onlyon certain words that exist within rows you're searching for. And then using a phrase searchwhere you can define certain words, follow others in terms of a phrasing. So if you're interestedin learning more about these types of searches, definitely check out this post from CyberTechPostgresql.com.

[00:11:09] The next piece of content is Oracle to PostgreSQL cursors and ltrees. So this followson a post that we discussed in a previous episode of Scaling Postgres where they're talkingabout when you go to Oracle to postgres, you don't have to use as many cursors. Now the lastone, they talked about working with hierarchical data using recursive CTEs. This one coversLtrees, which is a label tree which is an extension that exists for postgres that helps you workwith very complex hierarchical data. So for example, they have the taxonomy list here and howthis goes 21 levels deep with over 600,000 records and shows you how you can use Ltrees tobe able to work with this type of data. So if you're interested in that, definitely check out this postfrom. Secondquader.com, the Next Piece of content is three easy things to remember aboutpostgres indexes. This is from Crunchydata.com, and it's talking about indexes can speed upother operations too, meaning not just searching on the where statement, but also when you'redoing joins or you're looking to group and sort data, having indexes in place can help thoseoperations and speed them up. Second, that indexes aren't always used, so depending uponyour table and how you query the data, like for example, they're using the like syntax here.There are cases where an index will not be used in that case. And then thirdly, that indexescome at a cost. Every time that you're inserting updating data, those indexes have to be updatedand there's a performance impact for that. So it's a pretty simple post, but if you're interested inlearning more, definitely check out this one from Crunchydata.com.

[00:12:47] The next piece of content is an overview of trusted extensions in PostgreSQL 13. Soprior to Postgres 13, you needed to be a super user in order to create extensions on databases.But apparently with 13 there are certain extensions that are classified as trusted and certainones that are untrusted. So something like they mentioned H Store here is in 13 considered atrusted one, so you just need certain permissions like create DB privileges to be able to add thisto your database. And you don't need to have postgres's super user permissions, whereas otherextensions like the File Farm Data Wrapper is considered untrusted, and therefore it says youmust be super user to create this extension. So it's an interesting adjustment that's coming with13. And they mentioned there are 24 trusted and 24 not trusted extensions. So basically it helpsyou give more power to users who have access to the system if they have c --ertain privileges to enable some of these extensions for the database they work with. If you'reinterested in learning more, check out this post from several nines.com.

[00:13:56] The next piece of content is PostgreSQL Backups. This is from Higo CA, and theycover all the different ways that you can backup postgres, from doing logical PG dumps and PGdump alls to file system level backups using manual backup methods such as starting thebackup, backing up all the files, and then stopping the backup to using PG based backup. So ifyou're interested in looking at different backup solutions for postgres, check out this post.

[00:14:25] The next piece of content is combining PG Backrest and Streaming Replication, aPostgres 13 update. So it shows you how to use PG backrest with Streaming Replication andgetting that all set up for version 13. So if you're interested, check out this post from PgstuffGitHub IO. And the last piece of content is the PostgreSQL Person of the Week is Marco Slott.So if you're interested in learning about Marco and his contributions to postgres, definitely checkout this blog post that does it for this episode of Scaling Postgres. You can get links to all thecontent mentioned in the show. Notes be sure to head over to Scaling Postgres, where you cansign up to receive weekly notifications of each episode. Or you can subscribe via YouTube oritunes. Thanks. --

episode_image