background

Sysadmin Concerns, Power of Indexing, pgbouncer Monitoring | Scaling Postgres 32

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering sysadmin concerns, the power of indexing, pgbouncer monitoring and pg_prewarm.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres we talk about sysadmin concerns, the power ofindexing, PG bouncer monitoring, and PG prewarm. I'm creston. Jameson and this is scalingpostgres episode 32. You alright? I hope you're having a great week. Our first piece of content isactually an announcement. Actually the first two are announcements. So postgres eleven, betafour has been released and it mentions some of the changes that have occurred since betathree. It seems like most of them are related to just general bug fixes from different features.However, just to call out the JIT compilation, which I've mentioned in previous episodes, it saysquote JIT compilation is disabled by default. To enable JIT Compilation, you must set JIT equalson in your configuration file or in a session so it looks like you don't have to do it at build time,which I believe that was something referenced in previous information I had seen about JITcompilation. So it looks like something you could trigger on and then potentially do a reload ofthe whole server config or do it within a particular session. So definitely something interesting tocheck out. But of course you can consult the link to review all the changes since beta three.

[00:01:24] The next announcement is that postgres 2.5.0 has been released and it mentionsalthough this release will work for PostgreSQL 9.4 and above, to take full advantage of what itoffers, you should be running PostgreSQL eleven, beta four and up. So if you use PostGIS,looks like it's ready for working with postgres Eleven.

[00:01:47] The next article is running a database on EC two. Your clock could be slowing youdown. And this is from the Heap Analytics.com blog. Now this is an interesting case where theysaw a significant portion of CPU time being taken up checking the system clock, and this is inpart they discovered due to the virtualization that AWS on EC two was using. Now ultimately itlooks like at the very end of this post they're talking about AWS EC two moving to a KVMvirtualization. That kind of bypasses this problem. However, this post is interesting because itgives you insight into how an organization is scaling a Petabyte scale PostgreSQL database orseries of databases. So any content that's produced by the Heap Analytics blog I generally wantto be familiar with because they're really scaling PostgreSQL. So some of the insights they talkabout is that they tend to store their data in I three instances. So they're not using the elasticblock storage or EBS of AWS, but they're actually using the NVMe storage that's on eachinstance due to its significantly higher I O capabilities. Now of course there are downsides to thatbecause to my understanding, if this instance has to be restarted, that storage is ephemeral andit just vanishes. So I'm not sure you would want permanent data storage being sitting there, butit does give you really high IO if you need to use it for particular use cases. And of course, allthrough this blog post they have links to other articles that the Heap Analytics blog has doneabout how they're scaling postgres. And all of them are, I would say, pretty much must reads ifyou're looking to scale postgres. So again, this particular issue I think is going away with timewhere the clock has taken a lot of resources and how they kind of got around it. And if yoususpect that type of problem, you should definitely consult this post. But really it's an opportunityto learn more about how one is dealing with a Petabyte scale PostgreSQL database.

[00:03:52] Now, related to that, there was a post on Hacker News that you also might want toconsider, and it's a database engineer at Heap Analytics discussing why they use EC Two asopposed to using RDS. And they say they do use RDS, but they use Postgres on EC Two fortheir primary data store. And the reason why he goes into is Cost. It's much cheaper to not usethe RDS platform.

[00:04:17] They're able to get a higher performance again, because they're using the I three Sand the NVMe performance they get compared with the elastic block storage in AWS, theflexibility of configuration. And he mentions they're actually using ZFS, which enables them tocompress their data twofold, and also the ability to do some introspection. So again, anotherpost that gives you insight into how someone is running a Petabyte scale PostgreSQL set ofdatabase clusters, essentially. So I would say both of these are essentially m --ust reads if you're looking to scale postgres.

[00:04:54] The next post is Power of indexing in PostgreSQL. And this is on the secondquadrant.com blog. And essentially this is a YouTube video. I believe it's about 50 or 55 minutes.And this is an excellent video that has actually helped explain better than I've found in otherplaces why Postgres does some of what it does. So he goes into depth into indexes, how itworks, how it works with the Heap as well as Btree indexes and why it does certain things. And itbasically takes some of the best practices that you hear people say. And he actually explainskind of why they're best practices and how it works internally and why it works better doingcertain things. So again, if you're a developer or a DPA, knowing how to use indexing enablesyou to get the most performance out of your database. So if you're looking to scale, I highlysuggest watching this YouTube video as it was definitely informative for me on a lot of differentlevels.

[00:05:54] The next post is use red in real world PG bouncer monitoring. And this is from theOkmeter IO blog, and in it he's covering two monitoring concepts. One is Use, which is anacronym for Utilization, Saturation and errors. And then another called Red, which is Rate errorsand duration. And essentially using these frameworks to look at monitoring PG Bouncer and howOkmeter kind of does that. Now, in my experience, PG Bouncer documentation is okay, but it'srelatively hard to get in depth articles on it. Pretty much if there's any Pgbouncer article I find I'mgoing to put up as something to consult. And this goes into a little bit the internals, of course, ofhow Pgbouncer works and then looks at some of the methods you could use to monitor it. Andhe talks a little bit about what Okmeter does in terms of monitoring it. So if you use Pgbouncer orare considering using it in the future, this is definitely a blog post I suggest checking out.

[00:07:00] The next article is Autoprim, a new functionality in PG Prewarm. And this is from thePostgres Rocks enterprisedb.com site. Now, this covers PG Preborn, which there was a newenhancement in postgres QL Eleven. And this new feature automatically warms the sharedbuffers with the same pages held before the last server restart. So it basically pre warms yourcache and keeps it up to date and it shows you the command to use to set it up. And basically itcreates a background process that periodically record information about the pages in the sharedbuffers to basically keep it cached up in the case of a restart. And he showed an example herewhere he runs this query after setting up prewarm immediately after restart, and he gets on thebuffers, a shared hit for the aggregate and the sequential scan, whereas when he didn't prewarmit, it had to do a read.

[00:08:01] And it goes into some additional tests that were done that show with pre warm ontransactions per second goes up immediately after a restart, whereas if you don't, you have alonger ramp up time to achieve the same state. So if you think this could be of advantage or youwant to take advantage of it in postgres Eleven, definitely a blog post to check out.

[00:08:23] The next post is on Rocks in sand, and this is from the second Quadrant.com blog.And they're talking about column order, basically in tables. And that postgres actually createspadding based upon what columns are next to each other. And when you have your columntypes distributed in a random order in the table, there's a chance that a lot of padding will beintroduced which will cause a table to be larger than actually the data it contains. And that ifyou're able to plan out what your column order is based upon the data types, you can actuallyachieve a space savings in the data being stored. Now, they have a comment here because itsays, quote, it's not uncommon for column orders to be dictated by a hurried dev, simply jottingdown attributes or an orm that generates output from an Arbitrarily hashed key position. Now, inmy experience in terms of doing development, it's just new features need to be supported andnew columns get added and that kind of causes havoc with any kind of ordering. You canestablish an ordering, but in the process development oh, you need feature X, you need featureY, you need feature Z. It's hard to maintain any kind of column order in that case. So some of theadvice in this post is hard to follow if you have more of a transaction --al load database that new features are being added to on a consistent basis. However, if youhave a more analytical database and essentially your data model doesn't change thatsignificantly over time, you could get some space savings by following the advice included in thispost to order your columns based upon data types as they suggest here. Basically doing thingslike having big INTs first timestamps, then INTs smaller INTs booleans, and then following up atthe end with numerics and or text fields. And in this example that they were looking at, they sawa 21% space savings. So again, with a rapidly changing application, I think this will be kind ofhard to do. But if you're working with a more analytical database where you have more flexibilitywith a column order or at least keeping it consistent, following this advice could potentially getyou some space savings.

[00:10:38] The next post is fun with SQL, Unions and Postgres, and this is from theCitusdata.com blog and it's a relatively simple and short post. Again, their SQL series is like that,it covers Union and Union All. So basically Union enables you to merge two tables together andUnion automatically does a distinct between the rows that are selected, whereas Union allshows all rows not distinct. So if you think you might have some use cases that could benefitfrom a union, definitely a blog post to check out.

[00:11:10] The next post is Understanding Check Constraints in PostgreSQL, and basically thisgoes over check constraints at the column and the table level and how to set them up. And interms of Scaling, it shows you how you can actually add constraints and have them not be validas a way to add them gracefully to a particular table. So if you're wanting to learn more aboutcheck constraints, definitely a blog post to check out.

[00:11:34] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can signup to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes.Thanks. --

episode_image