background

Terabyte Scaling, Postgres 11, Connections, Backup | Scaling Postgres 35

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering terabyte scaling, PostgreSQL 11, managing connections and backup of terabyte databases.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about terabyte scaling postgres eleven,managing connections and terabyte backup. I'm creston. Jameson and this is scaling postgresepisode 35.

[00:00:17] One.

[00:00:21] Alright, our first piece of content this week is a notification that PostgreSQL elevenhas been released. So as expected, it was released on October the 18th. And this post talksabout all the new features that we've gone over in previous episodes. Basically increasedrobustness and performance for partitioning. So they've really fleshed out the declarativepartitioning that existed, intent to make it even easier to set up and use and with more featuresin PostgreSQL eleven, the introduction of stored procedures that allow you to do transactions,kind of like within a function. Improvements upon the types of queries that you can run inparallel. Doing just in time compilation for certain expressions as long as you set it on and theylist general user experience improvements. Talking about being able to add a column with adefault and not have it rewrite the whole table. That's a huge improvement. Adding coveringindexes where you can essentially include additional payload in the index, some additionalfunctionality related to window functions, and then being able to just quit and exit from the psqlcommand line. So if you want to know a little bit more about PostgreSQL eleven, you can checkout this announcement.

[00:01:33] Now. Related to that, what happens soon is that PostgreSQL version 9.3, the mostrecent release, is projected to be November Eigth, and I believe at that point it will be consideredunsupported or a period of time after that. So if you're on 9.3, you should probably considergoing ahead and upgrading. Now given that this is a new release. There was an interesting poston the Momgen US blog talking about a three year cycle and basically this is just a quick littlepost that he did, but he's talking about the types of problems that the PostgreSQL team isaddressing, take a long time to do and that because there's a release every year. Sometimesthese features take multiple years to complete. So that's why you tend to get essentially maybea quarter of what you expect in terms of features or half of what you expect. So parallelism kindof started in 9.6, more was added at ten, more was added in eleven, presumably more will beadded in twelve. The same thing with partitioning. Partitioning has kind of been there for manyreleases, but declarative partitioning came in ten, it's been made much better in eleven. I'm surethey're going to continue on that path in terms of twelve. But basically when looking at thePostgreSQL project, he basically says consider it essentially in a three year cycle to get certainfeatures until they perceive that they're in a complete state. So just an interesting insight into thedevelopment cycle of PostgreSQL.

[00:03:05] The next post is adding new table columns with default values in PostgreSQL eleven.And this is from the second quadrant.com blog. Now, we've mentioned this in multiple episodesand described it, but this short post was actually made by the author of the feature and he goesinto a little bit of the detail about how it works. So a great resource. But what's interesting here ishe says the default value doesn't have to be a static expression, it can be any non volatileexpression, but volatile expressions such as random will still result in table rewrites. So you stillhave to be a little bit careful if you're trying to deal with functions in terms of the default. So youmight want to check out this post if you tend to do that to see under what conditions you're goingto get a full table rewrite when you're adding a column with a default and when you're not.

[00:03:56] The next post is Ideas for Scaling PostgreSQL to Multi terabyte and beyond. And thisis from the CyberTech Postgresql.com blog and basically he goes over things you should keep inmind when you're wanting to scale your database up to a terabyte scale. The first area hediscusses is lightweight special purpose indexes and he particularly focuses on partial indexesthat allow you to index only a subset of the data. Now, I myself have been able to get huge winsusing partial indexes and I would say this is probably my favorite performance based featurebecause if you have a status column and there's only a finite number of statuses, you can createpartial indexes on t --hose statuses to get much higher query performance in a lot of use cases. Now, he alsomentions the gen index, which gen is typically used for columns that you want to do a textsearch on or maybe JSON B columns. But one thing he mentions is that quote, it can be perfectfor indexing columns where there are a lot of repeating values. So for example, if you have astatus that has Mrs miss, so a very finite number of statuses indexing on this, it could actuallycreate a very small index because Jen says, quote, gen only stores every unique column valueonly once, as opposed to what a Btree would do. So this could be a particular use case if youwanted to save storage and potentially get better query performance. He talks about Brinindexes and again, this can be a good use case in terms, again, space savings, but on very wellordered values like time series data. So that could potentially be a use case for it. And he alsomentions Bloom indexes, but they're relatively exotic and actually I haven't had experience usingthose as of yet. But basically the indexes are his number one way to start because they don'trequire application changes and you can get performance boosts simply by using the right indexfor the right use case. The other area he goes into table partitioning. So this is basically youhave a really large table and maybe you don't tend to access the historical data in it. That's aperfect use case for partitioning like perhaps by date time. So you can set up multiple partitionsby year, by month, by day. That will enable you to keep in the cache perhaps only the recent hotdata. And the data that is accessed infrequently won't be in the cache, but enables you to gethigh performance with more recent things in the cache. And again, depending on how you set itup, it doesn't require any application change. So that's another consideration when you'rewanting to achieve terabyte scale. The next area he mentions is table spaces. So basicallymoving data to maybe your cold data that is accessed infrequently to maybe slower moreinexpensive storage and keeping your recent or your hot data on SSD drives, that's a way tohelp scale your database or even spread them across multiple table spaces to get moreperformance. And he talks about using multiprocess features maximally. So it's basically utilizingthe new parallel performance improvements that PostgreSQL is offered in 9.610 and versioneleven. And then he talks about query load balancing with replicas. So basically creating multiplereplicas and sending your read only queries to those replicas. And he does note that talks abouta mirrored mode and what he means is that you can set up your synchronization in asynchronous fashion and only have commits on the master once the replicas have committed aswell. So basically you get one unified view across your master and however many replicas youhave, you won't have a delay between some data has been updated on the master but not onthe replicas yet. Now you're going to have a performance hit doing that. So you'll have a little bitof a delay in writes, but at least you'll have a consistent view across all the replicas in the clusterif you wanted to set it up that way. And as he says, this works well only if the read queries arepurely OLTP online transaction processing basically very fast. So he basically says that theseare the areas you first want to tackle. However, you can get into presumably more exoticapproaches. And one he's talking about foreign tables. So basically this is using foreign datawrappers where you can communicate with other database systems such as PostgreSQL oreven MySQL or Oracle or set up compressed data in files on a different file system that keeps itcompressed and then use a foreign data wrapper to be able to access that data. So it's basicallykind of a way to scale out where you're just pulling data from this foreign data source using aforeign data wrapper and then getting into essentially Sharding. Now, PostgreSQL doesn't haveSharding built into it, but you can start to use foreign data wrappers to be able to kind of achievethat process. And he talks about bringing in table partitioning and having child tables residing onremote nodes. So these are just some ideas in this post and I highly suggest checking out thisone. If you're considering how to potentially scale your terabyte database, the next post is howto manage connections efficiently in postgres or any database. This --is from Brandure.org and basically he's talking about you set up a database for yourapplication, things are running fine, and then suddenly you get a fatal error. Remainingconnection slots are reserved for non replication super user connections. So basically you've runout of connections in postgres. So what do you do? So generally people just start increasing thenumber of connections and essentially there's background processes that get created for eachof these connections so they have a fair amount of cost. And what he notes here is that there'sstill a limiting factor where there are certain processes that use shared memory that kind ofbecome bottlenecks. So apart from the memory, you're potentially going to run into thesebottlenecks. And he has a graph here that talks about the performance of a simple taskdegrading as the number of active connections in the database increases. So presumablythere's not necessarily a memory limit here, but just contention in the system as it grows fromone to 1000 connections. And you could see the delay in seconds starting to increase as you getup to the 1000 connection level. And of course he says what to do at this point is basically bringin a connection pooler. And he talks about applications a lot of times, provide connectionpooling. So he's mentioning Rails here and you can configure for active record, you canconfigure a connection pool and how large you want it to be. But normally when you hear aboutconnection pooler it leads to things such as Pgbouncer. So even though some of theapplications of connection poolers, you're probably going to want to get to PG bouncer onceyou're at the hundreds of connection stage. And he goes over the basics of Pgbouncer and thetype of pooling it can do. It can do session pooling, which is not necessarily useful for having alarge number of connections. But he talks about transaction pooling and statement pooling andhow typically your PG bouncer resides in between your nodes accessing your database and thedatabase itself. So again, this was a really good post that kind of talks through connections andideas to have in mind, not necessarily for postgres, but application developer, how best to workwith your connections.

[00:11:20] The next post is Pgbouncer monitoring improvements in recent versions. And this isfrom the Okmeter IO blog. Now this is kind of a follow on from a post we covered recently, whichwas use Red and Real Word World PG bouncer monitoring. And this one he talks about somethings that are new in version 1.8. They have some new settings you can get like totaltransaction time. So this is the total number of microseconds spent by PG bouncer whenconnected to PostgreSQL in a transaction, either idle in transaction or executing queries. Andyou can use a query time metric to be able to see how often essentially something is idle intransaction and you can graph it. So it tells you how often you can essentially get an Idlingpercentage using these new stats to see how often transactions are essentially open, but no realdatabase work is happening. So essentially these are connections that are just hanging outthere, not being used. And perhaps in your application you could do things to make this moreefficient. And so they also talk about total query count and total transaction count. So again, twomore metrics that you can get access to. And he demonstrates how to use that to get averagequeries per transaction and then you can even get the average query time and the average andwait time. So if you run PG bouncer, this is definitely a great blog post to check out in terms ofyour monitoring on looking at some new or different ways to potentially monitoring its behavior.

[00:12:50] The next post is managing PostgreSQL Backup and replication for very largedatabases. So this post basically goes over kind of why you would need to do backup. And helists some items here and he goes over what I tend to call logical backups using pgdump or PGdump all to dump a whole database cluster and you can then use the Pgrustore command torestore them. Now, from my perspective, a logical backup is a great thing to do when yourdatabase is relatively small. However, once you start scaling, you're going to want to move toessentially physical backups that utilize a point in time recovery. So generally this will be fasterfor larger databases and you can restore to a point in time, whereas logical backups you can justrestor --e to the point in time at which that was started essentially. And he talks about setting up pointin time recovery backup where you're using PG based backup to take a base backup of the datafiles, but then you also need to be copying over the wall files as well. And you could copy overthe wall files using an archive command, or you could use the streaming protocol to usePgrseive wall. And I have a few tutorials on this if you want to check out scalingposgres.com, butit goes over some of the advantages of point of time recovery and some of the potentialdisadvantages. Now he advocates a strategy for using both strategies for backing up yourdatabase. But once you get up to really large sizes, the logical backup takes a ton of resourcesand is really hard to do. So generally you're kind of using the physical backup process. Now youcan use just the PostgreSQL tools that are provided via the community postgres, which is PGbased backup and PG receive wall or using the archive command. However, there are othertools, again, open source tools like Barman that help coordinate this process or PG Backrest. Sothose are two other choices you can explore to help coordinate this process. And they saythey're using Barman at their organization. So if you're rethinking your backup and restoreprocess, this is definitely a blog post to check out.

[00:14:58] The next post is commenting your postgres database. So this actually refers to aprevious similar post that I reported on last week about Comment. Now first he's talking aboutbeing sure to comment your queries, particularly if it's a very long query, just using single linecomments using the two hyphens here. But you can also comment a schema using thecomment command. So you can say comment on table, give the name of the table, or commenton column and give the name of the column and give it a description. And when you describethat object, it will show the description here. So again, like I mentioned last week, I'm not sure Iwould necessarily use this in my database system, this comment command, because some ofthat documentation is elsewhere, but I can imagine for a large database with a large number ofusers, this feature can be beneficial. So if you're interested, definitely a blog post to check out.

[00:15:54] The next piece of content is actually one that I've mentioned previously, but they justkeep adding videos. So the Postgres Open SB 2018 YouTube channel has added many morevideos to check out about Postgres. So definitely a great resource and at some point I will get anopportunity to watch them to see which ones I particularly like. Now, related to that, there'sanother YouTube channel that's just posted a whole slew of videos and this is for Postgres.Comp in South Africa that happened this month. So they've posted about 20 different videos onPostgreSQL. So definitely another resource to check out for recent PostgreSQL presentations.

[00:16:32] And last piece of content is along with Postgres Eleven, PG Pool two 40 has beenreleased. So if you use PG pool or considering using it, definitely a blog post to check out.

[00:16:44] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingposgres.com where you can signup to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes.Thanks. --

episode_image