background

Citus Open Source, Time Series Performance, Subscripting Updates, target_session_attrs | Scaling Postgres 155

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss Citus 10 open source, time series performance in native Postgres, using subscripting for updates and new target_session_attrs.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about Citus open source, time seriesperformance, subscripting updates, and target session attributes. I'm Kristen Jameson, and thisis scaling postgres episode 155.

[00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our firstpiece of content is Citus ten columnar for postgres rebalancer, single node and more. This isfrom Citusdata.com and they basically released a new version of Citus Open Source of versionTen, and it has a lot of great features to it. The first feature is columnar storage. So basically itsets up column store as opposed to a row store, which is the default that enables very highcompression of your data and speeds up analytical type queries, like when you want to averageor sum a single column. That's what column based storage speeds up for you. You can now alsoshard on a single node. So basically, if you want to get prepared early for scaling out yourdatabase, you can now do it on a single node and then be able to easily move to multiple nodesto scale out. The next one now available in the open source is a shard rebalancer. So now youcan rebalance your shards and allows you to add or remove them. Next feature is joining foreignkeys between local PostgreSQL tables and sites tables. The fifth is functions to change the waytables are distributed and more. So, looking at the first feature, the column storage, you basicallycreate a table using columnar, once you have the Citus extension installed, of course, and ituses Zstd compression to achieve 60% to 90% reduction in data size, which is huge. But it doeshave a number of significant limitations. First, it doesn't support indexes, so basically it relies onthe small table size to be able to scan that table very quickly for doing calculations. Secondly,updates and deletes are not supported, so it's an append only table.

[00:02:09] And then third, it's best to avoid single row inserts. So basically, since thecompression works well in batches, they say here, this is the kind of thing you want to batchupdate the table to store data. And one suggested way to do it is you use the standard rowstorage for more recent data and then archive in the columnar storage. The next one they talkedabout is being able to start out with Citus on a single node so that you can scale it when you'reready. And again, like I mentioned, you just basically add Citus to your preload libraries, createthe extension, and you can create a table and then create a distributed table based upon aparticular shard key. The next item is a shard rebalancer, so it allows you to rebalance your loadacross multiple shards that's now available in the open source version. Joins and foreign keysnow work between Citus tables and postgres tables. And they give an example here of whereyou can only apply citus to particular tables. So for example, your main PostgreSQL databasehas publishers, advertisers and campaigns, but you have a reference table here for a very largetable, like clicks is huge. For example, well you would distribute this with Citus and you wouldclassify this ads which helps reference the connections between these. This would be areference table and then the rest of these would stay stock postgres tables so you can get intoCitus slowly. They've added some enhancements to be able to alter the sharding and a lot ofother new features. So this was particularly interesting because this has now been added to theopen source version of Citus. So if you wanted to try this out for free, you could just downloadthe open source version and give it a try.

[00:03:52] The next piece of content is designing high performance time series data tables onAmazon RDS for PostgreSQL. This is from Amazon.com and even though this post talks aboutAmazon RDS, you can do this with stock postgres. What they are proposing here basically tryingto use stock PostgreSQL to do time series data. And they have an example here where they arecollecting almost append only data and needing to upload it into postgres. So they have areadings table and a few different indexes on it. And apparently this was probably a customerbecause this walks through them making particular transitions, changing column types to makethings more efficient. So for example, one of the first things they cover is data typeconsiderations. So it's basically determining the smallest type of data type to use for something.--So for example, with regard to the ID, is this best to be an integer, a small int or big int? Sothey have a discussion about that. They also discussed whether double precision makes themost sense and they actually decided to go with something smaller, a real data type, because itwas smaller. So basically making the size of the data as small as possible by choosing the mostoptimal data type size and doing that reduced the data size by about 5% and then they looked atthe indexes. So they did have a standard b tree index on the time field. But looking at the usecase, particularly with time series data, they said a brin makes more sense and a brin is a blockrange index. So it doesn't index each specific value like a b tree does, but it references ranges.And they chose a particular number of pages per range and that significantly dropped the sizedown. And now the table and indexes went from 126GB down to 101gb. So a 20% spacesavings by moving to brin indexes versus the b tree. Now, you have to be aware of that. Youmay lose some query performance, but it depends on your use case. If you're looking forindividual values, you want to stick with a b tree, but with a brin that works best for ranges ofvalues. And then they looked at partitioning because they were doing a metric ingestion testwhere they did a parallel load using multiple threads into the database, I believe about a half abillion records, and they were seeing this performance. So it started off pretty well and then itstarted dropping down here and then it started getting very irregular here. Now, what they saidaround the 302nd mark is when the checkpoint started kicking off and that generated wall andgenerated full page writes, which caused the insert performance to struggle. And then here it'sbasically running out of memory due to the data size increase. So the database exceeded thisspace in memory, was having to go to disk. So they actually mitigated this problem by doingpartition tables and they created around 360 partition tables for all of this data and they showedyou how you did it here and it resulted in a dramatic increase in performance that looks like this.So the ingestion stayed a consistent about 1.6 million metrics per second over the duration ofthe test. Now, what's interesting is that they don't really specify why partitioning helped thisbecause, I mean, you're inserting the same amount of data and presumably everythingeventually has to get to disk, but they didn't explicitly say why this was such a greatimprovement to partitioning. Perhaps it has to do with them doing the load in parallel and theywere doing a lot of random I O that was changing pages a lot more with single indexes, whereasit was much more efficient when broken up. Or maybe like in a previous episode that was talkingabout the vicious circle of wall rights. Maybe auto vacuum was a component and because youhave to auto vacuum the whole table, maybe that was bogging down some of what we see here.Whereas when you have 365 partitions, they're a lot smaller and it's easier to vacuum them up.But they didn't really specify why this worked. I would be interested to know how does this lookwith different number of partitions? So if you had more partitions, would you see any difference?If you had less partitions, would it start dropping down and at what point? And then how is thatrelated to loading the data in parallel? So they didn't specify how many threads they used, butdoes that impact this performance as well? So this is a significant result, but I would beinterested to know exactly why this helped the performance so much. But if you want to learnmore about how you can use postgres for time series data without using an extension such asTimescale DB, you can check out this blog post.

[00:08:34] The next piece of content is how many engineers does it take to make subscriptingwork? This is from Earththelion Info and basically he's indicating that there's a patch that's beendone that is probably going to be in postgres 14 where you can actually use subscripting syntaxto set keys in JSON b as well as query from it. So for me this is huge. I love this developerconvenience and I don't know if it's me, but every time I need to update a JSON B column, Ineed to check the docs on how to do it. So this would be so much easier to work with comparedto now. So kudos to everybody who was involved with this. And actually this post is very long.So this is the --TLDR, but it describes all the process that went through getting this feature into postgres. Sothere's a lot of backstory, different developer comments, different things to be aware of whendoing postgres patches. So if you're interested in the backstory, you can definitely check out thispost.

[00:09:35] The next piece of content is New target session adders settings for high availabilityand scaling in PostgreSQL version four. This is from Cybertechn Postgresql.com and they'retalking about target session attributes. So basically when you connect up to psql using libpq youcan specify more than one server and it will connect to one. If that's not available, it will connectto another, where you can also specify target session attributes whether that particular target isread, write or not. Otherwise classified as any. So this enables you to say just target to theprimary server or not kind of, but in postgres version 14 they've added a lot more differentadders that you can use. So now you have primary, you have Standby, you have Read only, youhave read, Write and then prefer standby. So primary very similar to read write only. Differencebeing the default transaction read only is set or not. Same thing with read only and standby, itjust varies by this parameter as well and prefer standby. Basically it tries to connect to a standby.If none is available, it just connects to whatever else is available. He also mentions that ofcourse the Lib PQ library is used in a lot of different languages, postgres connections, so forexample they mentioned Python, Go, Rust, I know Ruby uses it as well, so a lot of them usethis. So you can use this syntax today and it will be even better in version 14. So if you'reinterested in that you can check out this blog post next piece of content, starting with PG wherehow can I set configuration parameters? And this is from Dep and he had previously wrote aboutlocating where the config files are, but this talks about where and all the different places youcould make changes to them. So there is of course the base postgresql.com file. However withinthat at the bottom you can include additional files. So you can include files or give an error orinclude only if the file exists so it won't give an error. It also usually by default includes adirectory called confd that's personally where I stick all of my configuration in here and I usuallydon't change the postgresql.com file because it just makes it easier for upgrades. And whatnotthe other place where you can find configuration is in the data directory and it's the PostgreSQLauto.com file. This file you should not manually edit and it should only be altered with the altersystem command. So you can from a postgres prompt say alter system, set this parameter toparticular value and it will change it in here. Now it won't immediately make it live because youmay need to do a reload or a restart. Mini goes into well how do you determine whichparameters you can do a reload on and have them work or it requires a restart. He says you canactually query the PG settings table in Postgres and it will tell you for each value whether it justrequires a reload such as the archive command or it requires a restart such as the actual archivemode. The next thing he covers is that all the items in the Postgresql.com file are global andeven the PostgreSQL auto.com file are global. However, you can set for particular users certainparameters and that they're specified by user here for example. And you can also set them perdatabase or you can set them per user per database and he goes through and shows anexample of here of how that hierarchy is established. And the last thing he covers is that you canalso set them per session. So per connection to Postgres you can change certain parameters forthat session that you're in and lastly covers different ways to reset those values to their default.So if you're interested in learning more about configuration and the different places you can do it,definitely check out this blog post, the next piece of content starting with PG. Where are thelogs? This is also from Depsc.com and he's talking about where you can find the logs in yourpostgres system. So he considers different places where it can be logged, like in Windows. Youcan log it to the event log in Linux you can do it to the Syslog, you can also log to the standarderror, or you can also do it using the logging collector or not, as well as --to the CSV log. So if you want to learn more about logging in Postgres, definitely check outthis blog post.

[00:14:05] The next piece of content is actually a YouTube channel and it's EDB's YouTubechannel and in the last week they've posted a lot of different webinars on Postgres. So if you'reinterested in video content definitely check out this YouTube channel.

[00:14:19] The next piece of content is security and containers in cloud native PostgreSQL. Thisis from Enterprisedb.com and they're talking about their cloud native postgres offering andbasically they are bundling up Red Hat Linux in a container with postgres all configured for youto be able to use in things like Kubernetes. And they go through the security process that they'redoing to deliver secured images via this means. And also how they've actually configured thecontainer itself following the principle of least privilege during execution. And then covered somepoints about why Immutability is a benefit in that. In this scenario, you generally would not installpatches on the server, you would just get a new container and install that and do a failoverprocess to a standby server to upgrade it, and that's the process you would go through forcontainer upgrades. So if you're interested in that, you can check out this blog post.

[00:15:14] The next piece of content is PostgreSQL Getting Started on Ubuntu. And this is avery simple post talking about the best way to install Postgres on Ubuntu, especially if you'relooking for a specific version and you don't want the Postgres version bundled with yourparticular version of Ubuntu. So if you're interested in that, you can check out this post fromCyberTech. Postgresql.com next piece of content is how PostgreSQL handles subtransactionvisibility and streaming replication. So if you're interested how Postgres works under the coversto accomplish this, definitely check out this post from Higo CA.

[00:15:49] Also from Higo CA is how to create a system information function in PostgreSQL. Soagain, if you want to get more detailed postgres information, you can check out this blog post.

[00:16:01] The next piece of content is Pgpool Two's Clustering Modes. This is from BpingBlogspot.com, and she discusses all of the different clustering modes that are available inPgpool Two and how they work and how to enable them. So if you're interested, you can checkout this post.

[00:16:18] The next piece of content is regression analysis in PostgreSQL with TensorFlow. Partone getting Started. This is from Enterprisedb.com, and this is the first post in a series that justgets TensorFlow set up with Postgres and Python Three so that you can start using TensorFlowfor machine learning. So if you're interested in that, you can check out this blog post.

[00:16:41] And the last piece of content is the PostgreSQL Person of the Week is DavidWheeler. So if you're interested in learning more about David and his contributions to Postgres,definitely check out this blog post that does it. For this episode of Scaling Postgres, you can getlinks to all the content mentioned in the show Notes. Be sure to head over toScalingposgres.com, where you can sign up to receive weekly notifications of each episode, oryou can subscribe via YouTube. RyTunes thanks. --

episode_image