background

Global Indexes, Caching Aggregates, Vacuum Processing, Effective Cache Size | Scaling Postgres 91

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss global indexes, ways to cache aggregates, how vacuum processing works and the purpose of effective cache size.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres we talk about global indexes, cachingaggregates, vacuum processing, and effective cache size. I'm creston. Jameson and this isscaling postgres episode 91.

[00:00:21] Alright, I hope you're having a great week. Our first piece of content is Proposal forGlobal Indexes in PostgreSQL. This is from the Percona.com blog and basically PostgreSQL forpartition tables does not have a global indexing. The indexes actually exist on each partitiontable. And for explanation purposes they're calling this a local index in this post. But they've senta proposal to the community, according to this post, to propose having global indexes. So this is,as they are describing, a one to many index in which one index maps to all the different partitiontables. Now one area that I think this is incredibly important for, for data integrity is being able tohave a unique constraint that crosses all partitions. Because I've run into a situation where Iwanted a unique constraint that is actually not part of the partition keys. And you can't really doit, you can set a unique constraint on the local partition, but you can't really have a global one.And if there was a global index, presumably that would be supported. The other area that they'reconsidering is that when you're again wanting to cross partitions, you could potentially get betterperformance with a global index. And they talk a little bit about the syntax proposed and somepotential changes that would be needed. Now one disadvantage they mentioned is that you'reprobably going to take a hit on writes once this is implemented. But this is an interesting featurethat seems to be in discussion, so we'll see what comes of it. The next post is pulling the trigger.How do you update countercaches in your Rails app without Active Record Callbacks? Sobecause it's talking about Active Record and Rails, this is about Ruby on Rails, but a lot of itpertains to PostgreSQL database and triggers and locking and things of that nature. So theyhave a scenario where they have an orders table and users and they want to collect aggregateson them. So how many total orders per user for example. And they filled up a database with10,000 users and 100,000 orders. And basically the performance is not what they would hopewhen you're just querying it. So their idea was to essentially cache or create a user stats tablethat by user stores the total orders amount and the total orders count. Now with Ruby on Rails,one method you could use is Active Record Callbacks or Countercaches to do it. But as they sayhere, that really won't work for the amount portion. So they wanted to use a trigger. Now they'reusing a Ruby gem called Hair Trigger to actually implement this, but the raw SQL code is here. Itjust makes things a little bit easier to do. So they're just going to insert into the user stats tablesumming by the orders amount and the count of the orders and if there is a conflict with the userID, so that record is already there, it's going to do an update and set what the new ordersamount and orders count is. But they ran into a problem with this because they had some raceconditions. So they created a job that ran on four separate threads and tried to update and theyhad some race conditions. Now they mentioned they could change the isolation level, butactually what they wanted to do is use a transaction advisory lock. So these are application levellocks and all they had to do was create an application lock using the user ID and added to this.And that was able to do these inserts or updates and avoid the race condition that they wereencountering before. And they even have another way that they did that. They're calling the lockfree alternative where they're using Deltas. So essentially they're doing an insert or updating andthe update amount they're just adding to the orders whatever the new order amount is andwhatever order count they're adding to it. And they say as long as you don't use any subqueries, race conditions would not be possible. So these are two possible implementations tocache aggregates in your PostgreSQL database. Oh, and I should mention, this is from theEvilmarsians.com blog.

[00:04:40] The next post is an overview of vacuum processing in PostgreSQL and this is fromthe Several nines.com blog and it just gives a general overview of vacuum and kind of why itexists. So the first thing that they mention here is that Postg --reSQL does not do in place updates. So when there's a delete, it actually marks the row asdead and it must be vacuumed later. And on an update, the current row is marked as dead, buta new row is inserted in its place essentially like a delete and then a new insert. And because ofthis, these rows must be vacuumed up once they're no longer visible to any other sessionsconnected to the database. Along with it, they describe the visibility map that defines what tupleson pages are still visible to existing transactions. They talk a little bit about the free space map,the importance of freezing transactions to define what is still visible to existing transactions. Andthen they talk about vacuum's responsibilities in terms of scanning all the pages of tables to getthe dead tuples, freezing old tuples, removing the index tuples, pointing to the dead tuples,remove the dead tuples of pages corresponding to a specific table and reallocate the live tupleson the page, update the free space map, the visibility map, truncate the last page that's possible,and update all corresponding system tables. So a fair amount of work that vacuum has to do.Now they also mentioned this does not of course, free up space actually on the disk. It allowsspace to be available for new inserts within the database system itself, but you're not going tosave disk space. However, a full vacuum that reorders the data does free up disk space. Thedisadvantage of course is that it takes an exclusive lock on the relation so that prevents anyselects, updates, inserts, deletes happening. So generally you don't want to do this on a runningproduction database. There are certain specialized tools or extensions you can use to do theequivalent of a vacuum full and then they discuss the purpose of auto vacuum and go into a littlebit about vacuum and full vacuum. So if you're interested in learning a bit more about vacuumand its processing, definitely a blog post to check out. The next post is Effective Cache Sizewhat it means in PostgreSQL. And this is from CyberTech Postgresql.com and this is a setting inyour postgresql.com file that helps gives the database insight into how much memory is on thesystem that's potentially available for file system caching because the greater amount, the morelikelihood an index will be in memory and the more likely PostgreSQL will use an index scan. Sothey had a scenario they imagined here that you have a database system with 100 gigs of Ramand said okay, maybe 2GB is the operating system, 3GB for PostgreSQL. And then you set theshared buffers to example here they said 24GB and then pretty much whatever's left, at least ona Linux system would typically be used for file system cache. And then they talk and show theactual comments of the C code of how as you get up in memory database sizes. The benefit ofthe effective cache size is essentially increases the probability that an index scan will be usedbecause in large memory sizes more of this file system cache could be used for caching and it'sa way to communicate to PostgreSQL the rough size of how big this cache is. So if you'reinterested in learning more about this setting, definitely a blog post to check out.

[00:08:18] The next piece of content is actually a YouTube channel and they have just released anumber of videos about 20 or so for postgres comp in South Africa that happened in October.And this is on the Postgres Comp South Africa YouTube channel. So definitely some additionalcontent to check out with regard to PostgreSQL.

[00:08:41] The next post is it's all in database, and this is from Rafiasb Blogspot.com and it talksabout sizing and how to determine the size of your different tables or relations in PostgreSQL.And he talks about the backslash DT plus table name command to give you an indication of itssize. You can also use the system table's. PG table size and also PG relation size. And they donote that table size won't give you the size of the indexes and for that you have to do PG TotalRelation Size. And then he also mentions an extension called PG Stat Tuple to give you severaladditional metrics if that's of interest to you, particularly if you're trying to find Bloat or things ofthat nature. So if you're interested in different methods of determining the size of your database,definitely a blog post to check out.

[00:09:33] The next post is Twelve Common Mistakes and Missed Optimization Opportunities inSQL. And this is from --Hakibanita.com, and this post describes exactly that. The twelve are be careful when dividingintegers. Guard against division by zero errors, which are both of those seem generalprogrammatic things to be aware of in any language. Know the difference between union andunion all basically union should be excluding duplicates. Be careful when counting nullablecolumns. Be aware of time zones.

[00:10:07] Avoid transformations on indexed fields. Basically, if you're doing a function like loweror upper to it, that type of function may not be used on the index unless it's an expression index.For example, use between only for inclusive ranges to avoid this issue. I generally only usegreater than or less than I usually don't use the between syntax add faux predicates. Sobasically look for index columns that could help you get some better performance. Inline CTEs,which is something that they recently changed the default for PostgreSQL twelve fetch only whatyou need general good advice reference the column position in group by and order by so youcan use integers to reference the column position as opposed to doing the exact name formatyour query. Now again, this is a personal preference, but again some things that is mentioned.So a lot of these items are more on the simple or basic side, but if you want to become morefamiliar with some of them, definitely a blog post to check out.

[00:11:06] The next post is Similarity in postgres and rails using Trigrams. Now, a lot of this isactually based upon Rails, but they discuss Trigrams and the PG Trigram extension and theygive a rough Ruby implementation, but then they go into doing Trigrams in Rails. They enablethe PG Trigram extension and then use essentially the select count to get the similarity to whatword is entered against cities like Toronto versus cities. So if you want to learn more about thePG Trigram extension, definitely a blog post to check out.

[00:11:43] The next post is Enhancing PostgreSQL twelve Security with the CIS benchmark.This is from Crunchydata.com and this is a center for Internet Security benchmark that they'vebeen publishing and it has now been released for version twelve. So if you want to find out moreinformation with regard to that, definitely check out this.

[00:12:04] The next post is PostgreSQL Development and Configuration with Puppet. So ifyou're looking for a configuration management tool to be able to deploy your PostgreSQLinstances, definitely a blog post to check out. Personally, I use Ansible, but Puppet is also aviable alternative. And the last post is Ogrfdw Spatial Filtering and the headline is the Ogrfdwnow pushes spatial filters down to remote data sources. So this is definitely related to postgres,and I'm an area I'm not too knowledgeable about, but I know they're talking to foreign datawrappers and OGR is apparently a data model. So there is way to push down queries to thedata source that you're connecting to that presumably would result in better performance. So ifyou're interested in that definitely a blog post to check out.

[00:12:58] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can signup to receive weekly notifications of each episode. Or you could subscribe via YouTube oritunes. Thanks. --

episode_image