Scaling Real-Time Analytics, Covering Indexes, 1,500 Upgrades

In this episode of Scaling Postgres, we review articles covering real-time analytics at scale, covering indexes in Postgres 11, 1,500 Postgres upgrades and PostgreSQL vs. MySQL.

Content Discussed

YouTube Video

Podcast Audio

Transcript

[00:00:00] In this episode of Scaling Postgres, we discuss real time analytics at scale,covering indexes, 1500 postgres upgrades and postgres versus MySQL. I'm Kristen Jameson,and this is scaling postgres episode eight.

[00:00:25] Welcome to Scalingposgres.com. As a reminder, we prioritize developer relatedcontent first and then follow up with more database administration type topics. So our first articleis Building real time Analytics APIs at scale. And this is from the algolia.com blog.

[00:00:47] And in it they have a search service and they were looking to change the data storethat they were using. So they were using Elastasearch and they started looking for a differentdata store that would meet their needs better, particularly for their analytics. They investigatedredshift and BigQuery and ClickHouse, but eventually they settled upon using Cytus data. SoCytus is an extension for PostgreSQL and they have a service that they work with as well.

[00:01:20] And they also made note of several extensions that were useful to them from ananalytical perspective. One being hyperlog, log and top end. Now, what I find very interestingabout this blog post is not necessarily their usage of Citus, but how they've structured it toachieve a very high level of performance with the data they're using. So it's the technique I wasinterested in, not as much the sharding, but in terms of aggregating the data to deliver very fastresults.

[00:01:55] And that starts under the section creating near real time analytics. And this discusseshow they ingest raw events into the system, periodically aggregate these events into dedicatedroll ups, and then the actual queries hit the roll up tables to get the results in milliseconds. Sothey're doing roll ups at a five minute resolution as well as I believe they mentioned a dayresolution. And they have some of the queries about how they've created these roll ups and howit's pretty much refreshed every five minutes. And this allows, with these roll ups, it allows themto have very fast query performance. And here's a graphical representation in terms of the datastreaming into the raw tables. Then they have a five minute roll up and then a daily roll up. Andagain, they mentioned the API targets, the rollup tables, never the raw tables.

[00:02:50] And they have another example of a query about how those are done.

[00:02:55] And they have a big picture overview of the analytics API and the SQL queries thatare hitting the Citus database. And then the ingestion pipeline for Populating, that Citus datadatabase. And they talk a little bit about the architecture, about how they designed it. So this is areally great blog post to demonstrate how to structure your data to give you the bestperformance. So I definitely highly suggest checking out this article. It's really great.

[00:03:23] The next article is postgres eleven, highlight covering indexes.

[00:03:29] So this is a new feature that is hopefully going to be coming in postgres eleven. Andthis is by Michael Pacquiao, or Pacquiao. So what this feature will do is let you use index onlyscans more frequently. So in the example where you have a table that he uses here withcolumns ABC and you have a unique index on A and B.

[00:03:57] When you do A select where A and B exist in the where statement or in the columnsreturned, you can get an index only scan which is very efficient. You don't have to go to the heapto pull the data for the table because everything you need is right in the index.

[00:04:14] So today with postgres ten and lower, if you're actually going to choose threecolumns, one column C that is not in the index, you're going to get a standard index scan whichis still fast, but it's going to have to go to the heap to pull more data. However, they're offering aninclude keyword. So here we're creating an index still just on A and B, but we're including thecolumn C. When you do a query that is just pulling C as an option, it won't be in the wherestatement, but in this scenario, C will be available and you can get an index only scan, which willmake these queries much, much faster. So it's not essentially a key column of the table that's definitely something to be aware of and track. To see if this feature makes it intoPostgreSQL eleven, the next article is idle in transaction session timeout terminating idletransactions in PostgreSQL. Now in terms of PostgreSQL, idle transactions can be very, verybad because they cause problems with vacuum, they can cause locking when you're trying toalter a table. So basically you want your transactions to happen as quickly as possible. And ifsomething is idle in transaction, that's usually a bug or some type of mistake.

[00:05:58] Not always, but particularly if it's held for an extended period of time. So this postgoes over this particular feature. You can set, you can set a timeout that if these are detected,it's going to go ahead and close the connection.

[00:06:12] And it discusses a little bit about the problems of Bloat and how to simulate it bystarting a transaction, doing a select and then doing nothing. But then in a secondary session,you're updating deleting, inserting updating, that's going to cause the table to Bloat. Essentially.It's not going to be able to free these rows mark for deletion as a result of these operations.

[00:06:35] And this essentially sets a session timeout such that it will go ahead and close thatconnection if that's left open, which helps protect your database from long transactions. Now, hedoes make a note here maybe don't consider setting this at the database level, but perhaps at auser or a session level. So definitely a setting to be aware of that can help protect and scale yoursystem.

[00:07:02] The next post is from the same blog again, the CyberTech Postgresql.com blog getRid of Your Unused Indexes. So this is a pretty short post. It talks about why you should get ridof them in terms of indexes use up space. They can slow down data modifications with insertsand updates, and they prevent hot updates, which are a heap only Tuple update, which basicallymeans with updating certain type of data, it's more efficient to do it when a column is notincluded in an index. But he goes over, of course, all the benefits and how indexes are used, butstill, if they're not used, it is a best practice to go ahead and remove them. And he includes aquery to be able to try to identify the unused indexes in your system. So definitely a blog post tocheck out.

[00:07:54] The next post is called PostgreSQL Bloatbusters, and this is from the Dataegrit.comblog. And he goes over the importance of vacuum and how it works to help remove obsoleterows after deleting update statements. But how when you're actually running vacuum or autovacuum, it doesn't necessarily free the space in the table, it makes space available, but itdoesn't essentially compact the table once vacuum is run on it. To do that, you would need touse the command Vacuum Full, but that will lock out the entire table for selects and updates,essentially all activity while it's run. But it does compact the table. And he mentions two tools youcan use that can help you compact the table if this is something you need to do. One tool iscalled Pgrapak and the other is PG compact table. So if this is a particular issue you're dealingwith and you don't want to run Vacuum full, maybe check out these two utilities he mentions inthis blog post.

[00:08:58] The next post is PostgreSQL data types, date timestamp and time zones. So there'sactually a series of blog posts in the first link here that says PostgreSQL Data types, where hegoes over all the different data types. I just picked the one that I found the most interestingbecause doesn't everybody love time zones? And I know there's a bit of a, I guess I'll say adifference of opinion on whether time zones should exist in the database or in the applicationlayer, or at least the setting and manipulation of them. Like, for example, I tend to use Ruby onRails as an application framework, or Phoenix, and they tend to not store the time zone in thetimestamp field, whereas it seems a lot of database focused individuals advocate storing thetime zone in the timestamp fields. So I felt this blog post was a good viewpoint about how to usedate and times and timestamps in PostgreSQL, so you may want to check that out to see if youwant to move some more of these use cases into the database as well as check out the otherdata type blog posts he's done. And again, this is from the Tapoeh.org blog.blog post. And this is from the Dianemphae.com blog and it is biased towards PostgreSQL,but I felt it was interesting to look through at all the different advantages that are listed forPostgres versus MySQL. I know years ago I made the decision when I was evaluating whichdatabase I was going to be using for my applications, which database system I was going to beusing. I had predominantly been using Oracle prior to starting my company and after evaluatingboth of these open source databases, I chose PostgreSQL and haven't looked back. But it's avery interesting overview on some of the differences, biased though it may be, but an interestingpost to check out.

[00:11:14] The next article is titled very simply postgres ten upgrade. However, it is a massivepost about how they upgraded 1501,500 postgres clusters to Postgres Ten. And previously theywere in version 9.3. And they go in extensive depth on all the planning for all of these databasesand doing the upgrades in an automated fashion with a downtime window of just 15 minutes,with 30 minutes on the outside maximum. And they go through each of the steps and each of thecommands to run. So it is a really comprehensive post and if you're embarking on somethingsimilar to this or your own upgrade, I definitely suggest checking it out to see the techniquesthey used and see if any of those could be used in your upgrade planning that may be comingup. Definitely a great one to check out.

[00:12:15] The next blog post is how to perform Failover with the PostgreSQL Kubernetesoperator. So this is from the Crunchydata.com blog and they're discussing a product that theyhave using essentially container technology to doing failover multiple PostgreSQL instances.And essentially it states that enables users to provision and manage thousands of PostgreSQLdatabases at scale. So I haven't investigated this particular product, but in the interest of scalingpostgres, this is definitely another approach to take in terms of containerizing postgres andmanaging multiple, multiple instances of it. So if this is something you're considering, definitelycheck out this blog post.

[00:13:09] And the last article is PostgreSQL replication failback with PG rewind. So this is an indepth video tutorial I set up that talks about doing replication failback using Pgrewind. So I haddone a previous tutorial talking about easy failback, and that is if you promote a database andthe old primary does not have any rights to it, after that point in time of promotion of a replica,you can easily fail back to it, but by just getting it to follow the new primary database. However, ifyou have rights to that old primary, it's not going to be able to follow the new primary. So onesolution is to delete the database cluster and do a full restore and resynchronize with the newprimary. Another option is to use PG Rewind, and that rewinds essentially the wall files to get itback to the state that it was at the time of the promotion. So that old primary can now follow thenew primary database cluster. The thing to keep in mind, of course, is that with PG Rewind,essentially you may be rewinding transactions that had occurred on that old primary. So it's justsomething to keep aware of. And I go into detail about how to set this up and how to do it, andthe errors that you run into. So if you think this might be useful, I welcome you to check it out.

[00:14:40] That does it. For this episode of Scaling Postgres, you can get all the links to thecontent presented in the show notes. Be sure to head over to Scalingpostgres.com, where youcan sign up to receive weekly notifications of each episode. Or you could subscribe via YouTubeor itunes. Thanks, Sam. --

Scaling Real-Time Analytics, Covering Indexes, 1,500 Upgrades | Scaling Postgres 8

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript