background

Columnar Storage, Docker Containers, Notify & Listen, Tuning RHEL | Scaling Postgres 156

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss Citus 10 columnar storage, using Postgres in docker containers, using notify & listen for real-time apps and tuning RHEL.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about columnar storage dockercontainers notify and listen and tuning Red Hat Enterprise Linux I'm creston jamison. And this isscaling postgres episode 156. You all right? I hope you, your friends, family and coworkerscontinue to do well. Our first piece of content is Citus ten brings columnar compression toPostgres. This is from Citusdata.com. They're talking about the newly released open sourceCitis Ten and its support of columnar storage. So typically, Postgres uses a heap accessmethod, but Citus has implemented a columnar access method using the Postgres Twelve tableaccess API. So to get started, the prerequisites are having Postgres version twelve or later andhaving the open source Citis version ten or later, although I'm sure you could use the non opensource version. And they're using the psql client. So the first thing you need to do is you need toadd Citis to the shared libraries of Postgres, make sure it's enabled and restarted. Then youneed to create the extension in any database you want to use. And for their example here, theycreated a simple row using the standard heap storage, and then a simple columnar table usingthe columnar storage here. Then they generated a series of 100,000 records in each and thendid an average between each of these. So no indexes have been created so far. And they'resaying here you can see that the table access method is listed as heap for the row and thenlisted as columnar for the columnar table. Now they have a video that kind of runs through thiswhole blog post, and I encourage you to also check it out as well. So you can just click here andit brings you to a YouTube video that essentially reiterates the blog post so you can check thatout. So in terms of the benefits of doing this, columnar storage are, number one, that it reducesstorage requirements because they're using compression to store this data. It takes much lessspace, and I think we'll see that in an example here. In addition, because it's compressed,there's less I o needed to actually scan the table. So not as many blocks need to be read toorder to read what's in the table. The next benefit is projection push down, meaning that queriescan skip over the columns they don't need. So if you have a table of, say, 20 columns, and youonly need to return or look through two columns, it will only focus on those columns. So that'swhere columnar storage has huge benefits, in that you can only access the columns that are ofinterest. And then the fourth benefit is they use chunk group filtering. So it's a way to skip overgroups of data to identify where they are. And they say it can skip past a lot of the data quicklywithout even decompressing it. So I guess it stores kinds of references of ranges that existwithin the data so that it knows what chunks of data to access or not. So they did an examplehere of using these row column storage and column storage to look at compression and beingable to skip over unneeded rows. Although I'm not sure how much of a use case this will bebecause eventually you would want to query against it and would you want to store wide rows ina column based table? Not really sure how much of a benefit that is. I can definitely see thecompression but a lot of analytics things that I've looked at table width has kept minimal for thatreason. So I'm not sure how this is advantageous in the real world. But looking at all of theseintegers that it's stored, you could see that the compression ratio is eight, meaning that thecolumnar storage takes up eight times less room than the row based storage. So that's howgood the compression is. Now, when actually querying the data, they're only using three out ofthe ten columns. So again, you're going to get a big difference in performance out of this. I'm notsure how applicable this is to the real world because why would you put all that data into a tableyou're hardly querying because it has to be in columnar storage. So, again, I'm not sure on theuse case for this, but you could see with this example, they're getting a 25 x speed up becauseagain, you're only accessing the columns that you need to. And they're also showing you thenumber of buffers. Red is over 5 million for the row storage, whereas it's only 27,000 for thecolumnar storage. So that's a big reason for this speed up. And again, remember, there's noindexes involved with this. Now let's get into the limitat --ions. Number one, as of right now, there's no update or delete support. So it's best for appendonly tables or append only solution. Now they said even if they eventually implement update anddelete it's not going to be great performance because it's using column based storage. Rightnow there's no index support right now there's no logical replication or logical decoding support.And there's a few more limitations they don't mention in here, but it's in the README. Now withthat being able to update and delete, they said, well, there is one solution you can use, is that ahybrid columnar and row table storage using range partitioning. So in the example they used,they created an events table and partitioned it by a date range. The older partitions you can usethe column based storage, the newer partitions you can use as row based storage so that youcan do updates and deletes if you want. But then essentially you can rewrite the row basedstorage into a column based storage once you're moving on to a new partition. So they're sayingthis kind of gives you the best of both worlds and they're using the special function that theydevelop called Alter table Set access method that's included in the Citizens extension to be ableto do this. And they give a little bit of some of the query examples and the size improvementsyou could get by doing this. Now they also mentioned this is based upon a previous extensioncalled the CSTORE Foreign Data Wrapper, but that has a lot of disadvantages compared to thecurrent one. So the old extension, CSTORE Foreign Data Wrapper, did not have transactionalintegrity. So for example, did not do Rollbacks, it did not support the write ahead logging. Youcouldn't do physical replication with it. With the Citus columnar storage you can do physicalstreaming replication with it. And the newer one does support PG upgrades, which is great newsand it's probably because it's based upon that table access method API in Postgres twelve. Sothis is a great addition and if you think that your workload could benefit from column basedstorage, maybe you want to check out this extension to see what it can do for you.

[00:06:33] The next piece of content is running Postgres in Docker. Why and how. This is fromCyberTech Postgresql.com and I think this post is based upon the other one that said resistanceto containers is futile, when they were saying that Postgres is essentially going to be running oncontainers a lot. And this I think has a balanced view of it. And they talk about how Docker andcontainers in general were basically built as immutable ways to transport code, but essentiallywith a database it is mutable. You have data constantly changing in a database. So is it kind of asquare peg in a round hole using containers with databases? Because containers are supposedto be stateless and databases are all about the state. Now there are ways to get around this withcontainers essentially setting up separate volumes so that the state is stored outside of thecontainer. So one of the first questions the post asks is should I use Postgres with Docker orcontainers in general, and particularly for production workloads? And their advice is if you do,you should live fully on a container framework such as Kubernetes or OpenShift. And secondly,no, you're going to have to depend on some sort of third party software projects not affiliatedwith the PostgreSQL Global Development Group. And you may have to maintain your owndocker images to include extensions that you need or other scripts to handle upgrading to majorversions and things like that. So it's possible, but there will be work involved and I imagine thatover time that will continue to improve. But as of right now it's still early days and you're probablygoing to have to do a little bit more work for that. And this Post talks about a measured approachthat is similar to my thinking that yes, you can do it, but you need to be aware of theramifications of doing it. But one area they do advocate for using postgres with containers is intesting. So they say, quote, it's a tester's dream. So, for example, he says he personally hasevery version of postgres as a container he can just spin up to do work in, which is great. Andthen the post rounds out actually discussing how to actually use docker with postgres and howhe uses it. And he gives the different commands you can use, taking a look inside differentimages and the docker containers he --uses as well as how to set up volumes and some different configurations. So overall the posttalks about that. Yes, you can use containers with Postgres, but quote, you had better be awareof the pitfalls one basically talking about it's meant for immutable state, so you need a way tohandle that. Containers won't give you any automatic magical high availability capabilities forthat. You have to look for container frameworks such as Kubernetes. Third, they say life will berelatively easy only when you go all in on some container management framework likeKubernetes and additionally some operator software such as Solando and Crunchy Postgres. Sobasically another post talking about using postgres with containers and the issues you need tobe aware of. And I thought this was a very balanced post that I agree with all of their points. So Idefinitely encourage you to check out this post from CyberTech Postgresql.com.

[00:09:39] The next piece of content is Postgres notify for real time dashboards. This is fromArchetype.com and they're talking about using the postgres features, notify and listen. So theuse case that they were designing was being able to develop a real time dashboard to keeptrack of production of a manufacturing operation. So they had a backend where employees wereinputting the work that they did that inserted when a particular step was complete into postgresthat would fire a trigger that eventually notify a Node, JS and WebSockets front end to retrievemore data and present it. So they showed how they actually built this. So they gave you thebase schema of what they developed for tracking the production changes and then they createda function. And what that function did is simply did a perform PG notify with event, type of order,progress event and then some sort of payload. Now they just went with a simple payload. Theydidn't want to actually transfer the data as a part of this, although you could do that, they justwanted the clients to receive this and then they would retrieve new data from the database. Sobasically they were just using as an event notification system and then you place this function ina trigger on a particular tape. So before Row got inserted, it would fire this procedure. Now theywent on to the listen syntax and again, like I mentioned before, the client would receive thisnotification, but then they would go do their own query against the database for the data thatthey needed. So they developed a view for that purpose. So whenever it received a notificationthat something had been updated, it would just query this view. So this is the essentiallyJavaScript that was used. So at the point that the client connects, it actually runs a query tolisten for this event, type in postgres, and then when a notification occurs, it calls this callbackcalled event callback, and that event callback queries that view and then presents the data to theclient. And then they have a little picture of how it presents that data. So this is a very simplegreat example of using Postgres listen, notify for real time notifications. If you're interested inthat, check out this blog post.

[00:11:49] The next piece of content tuning red Hat Enterprise Linux family for PostgreSQL thisis from Enterprisedb.com. They're talking about Configuring Linux or Red Hat Enterprise Linuxfor Postgres use case and in this example they're using the tuned package to do it and thatenables you to set different things like systemctl, some CPU metrics as well as the VM. And theyshow some of the configuration changes that they've made specifically for running on Postgres.They explain all the different parameters and how they've determined that those were the onesto set. Then they went into optimizing the file system in terms of doing things like setting noaccess time and then they closed out talking about huge pages and how to determine the bestconfiguration for huge pages and how to enable it and how to make sure that Postgres is usingit. So if you're interested in that, you can check out this blog post.

[00:12:40] The next piece of content best Practices for Amazon RDS for PostgreSQL majorupgrades and Replicas this is from AWS Amazon.com. They're talking about the things to takeinto account when you're upgrading your Postgres RDS databases. And basically you canupgrade essentially in parallel, although it does the primaries in serial first and then does theReplicas in parallel. But you can also cho --ose to only upgrade a primary and then create a new Replica. And with RDS there's sometrade offs for that. If you choose to do a concurrent upgrade, the upgrade will take longer. So wehave longer downtime using this method. This method, the downtime is shorter to do theupgrade, but depending on the size of your database to create a new replica, you could berunning without a replica for quite a while as it takes time to actually rebuild it. So there's a tradeoff. You have to consider full downtime of the application using the database versus continuingoperations without an existing Replica. And they go through the different process of how you cando this, as well as give advice on different testing scenarios you can use to make sure that theupgrade works before you actually do the implementation. And this also has some good advicethat you could apply to postgres upgrades in general. Now, it works a little bit different in RDSbecause as far as I know, they don't use standard PostgreSQL physical streaming replication todo their Replicas. It's more of a disk based replication to my knowledge. So you can doindependent upgrades like this with a Replica, but if you're upgrading postgres not on RDS andusing physical streaming replication, there is a different method you can use to upgrade theReplicas. So when I've done upgrades for clients, we use PG Upgrade to upgrade the masters,and then a specific technique that's included in the postgres documentation to upgrade thosestreaming Replicas as well. But definitely check out this blog post if you want to learn more.

[00:14:40] The next piece of content is speeding up PG bench using Copy Freeze. This is fromPgsqlpgpool blogspot.com. They're talking about when you're setting up PG bench and you areloading up data for the first time. The area that takes the most time to load the data at the startof the Pgbench run is actually the vacuum operation they talk about. Why is vacuum so slow?Well, vacuum takes a while because it does a number of different things, such as update the hintbits on each tuple, update the visibility map, update the freeze map, basically do things to makesure that the table is optimized for doing queries as fast as possible. But there is an alternativeand that is Copy Freeze. But it only does the first required thing that Vacuum does, but not theothers. But according to this post, Copy Freeze will be enhanced in postgres 14 to do all of thesethings. So this post talks about updating PG bench to using Copy Freeze for this as opposed toVacuum. And what they've noticed is a dramatic improvement in the startup time and that theyimagine this enhancement will actually get in postgres 15. But if you use PG bench, definitelysome enhancements to look forward to.

[00:15:56] The next piece of content is PostgreSQL GitHub Actions Continuous Integration. Andthis is from Cybertechn Postgresql.com, and they're talking about using GitHub Actions inrelation to postgres. So if you have an interest in learning how to use GitHub Actions, particularlywith working with postgres, definitely check out this blog post.

[00:16:16] And the last piece of content. The PostgreSQL person of the week is Guillaume LeArge. So if you're interested in learning more about Guillaume and his contributions to postgres,definitely check out this blog post that does it. For this episode of Scaling Postgres, you can getlinks to all the content mentioned in the Show Notes. Be sure to head over toScalingpostgres.com, where you can sign up to receive weekly notification of each episode, oryou can subscribe via YouTube or itunes. Thanks. --

episode_image