background

Insert vs. Update, Select * Performance, Debezium Set Up, Standardizing Data | Scaling Postgres 142

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss inserts vs. updates, select * performance, how to set up Debezium and methods to standardize data.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about insert versus update, select starperformance, debesium setup and standardizing data. I'm Kristen Jameson, and this is scalingpostgres episode 142. You alright? I hope you, your friends, family and coworkers continue to dowell. Our first piece of content is insert only data modeling to smooth peaks on slow disks. Thisis from Cybertechgresql.com. They're talking about a situation they ran into a client where theyhad a database server that essentially had slow disks and wasn't that performant, but they had alot of updates that needed to happen within a particular period of time. And with updates,essentially, you create a new row, and then you have another row that needs to be vacuumedup, left around. And they discussed moving to an insert only model so that they don't have morerandom I O with updates and also don't have to deal with the vacuum consequences of doingthat. And they talk about the benefits of choosing to do more inserts versus updates. So, forexample, you get more sequential I O patterns when you're doing inserts. You get a lot fewer fullpage writes after a checkpointing process, so smaller wall is being generated. You get fastersingle transactions compared to an update. You also get the action history for free, essentially,because it's all there. And the other thing not mentioned is that essentially you don't havelocking, usually an insert, you don't have to do a lock, whereas an update, you always have to.And the thing he noted about his example is that they weren't using fast modern SSD disks,which would be easier to do these updates in the random I O four, and also when the active dataset does not fill well into the shared buffers. Now, I mentioned the downsides of doing this is thatyou do need a more complex schema because now you're inserting all the data changes into atable and eventually you're going to want to update or backfill your main table with the datavalues. You get a bigger total disk footprint because again, you're storing more data. Everyupdate has its own row. Now, you need to schedule some maintenance work over non peakhours, basically, to update that table. And you can't benefit from heaponly Tuple updates. Thenhe goes through his testing process of where he was testing out 100 million rows and did somerandom updates of data coming in into those 100 million rows. And he shows his test systemhere. Four virtual CPUs, eight gigs of Ram, and he's using just hard disk drives, so not SATAdrives, so that's going to impact things even more. So but you can see for doing the updatemethod, it took three milliseconds, whereas the insert only method just took one millisecond,which constitutes about a 70% performance improvement. And even the standard deviation iswider. Eleven milliseconds for updates and six milliseconds for the insert only queries, whichconstitutes a 50% reduction in the standard deviation. So definitely the inserts went out overupdates. Now, I see a fair number of applications where developers are essentially needing tokeep a counter of something so that constantly gets updated whenever any data changes. Butyou need to be very careful of that because if that gets updated, say 1000 times a day, wellyou've generated 1000 duplicates of that row when you do an update because it does an insertand then it flags that old row that used to exist for eventual vacuum. So essentially howevermany times you're updating it, you're duplicating that full row however many times and then allthose rows have to be vacuumed up. So sometimes it is better and more performant to doinserts of all that new data coming in and then periodically rolling an update into the main tablethat you're interested in and updating it. But definitely an interesting blog post if you want tocompare the difference between using an insert only method to maintain your data versusupdates.

[00:03:55] The next piece of content is reasons why select Star or Asterisks is bad for SQLperformance. This is from Tanopoter.com. Now this is not a postgres specific post. Most of theexamples here are based upon Oracle, but he's talking about the reasons why you want to try toavoid select Asterisks or select star if you can. And he even goes into very specific reasonslooking at bytes processed and a lot of analysis within Oracle itself on why you should beactually selecting the columns that you're interested in. But the areas he cov --ers are it generates increased network traffic because of course, however many rows you'repulling over, if you're pulling over every single column, that's going to be a lot larger than if youonly were choosing specific columns to go from the server to the client. You get increased CPUusage on the client because it has to accept all of this data and process through it all you havesome query plan optimizations are not possible. You're also going to increase a server sidememory usage. You're going to of course increase the CPU usage on the server side becauseagain, more columns that you have to deal with. Now, hard parsing and optimization takes moretime. This may be Oracle specific and cache cursors take more memory in the shared pool.Well, that definitely takes more memory in the shared buffers in postgres. So it's an interestingpost describing all the reasons why you really should target the number of columns you want toget the best performance out of your relational database systems. So if that is of interest to you,you can check out this blog. Post next piece of content is setting up PostgreSQL for Debesium.This is from Elephanttamer Net and he goes through the process of setting up Debesium whichis a change data capture solution for postgres. So basically you can use Postgres logicaldecoding to read the wall stream and use debesium to grab updates that are happening and hegoes through the process. So first you want to make sure your wall level is set to logical. Youwant to pick a specific debesium version and he's advocating because of a pretty bad bug to dosomething greater than 1.3.0. Be sure to check your database logs to make sure that it isconnecting successfully. He advocates creating the slot and publication manually as opposed todebesium doing it because he's had some issues where it could silently block all rights to thetables where you don't have a primary key because the publication defaults to essentiallyreplicating everything. And he goes through some of the configuration to debesium and then ofcourse advocates monitoring closely because you don't want to stop replicating from the slotbecause now you're going to have wall buildup on the primary. So you're going to want to makesure that that logical slot is being consumed on a regular basis by debesium and make sure itdoesn't fail or it restarts if it does fail. So definitely a good blog post about how to handledebesium on PostgreSQL.

[00:06:46] The next piece of content is using Postgres for Statistics Centering and Standardizingdata. Now this is a follow up from a number of posts that were done about datascience@crunchydata.com and here he is talking about generating the z score for a particularset of data. And this is a means of standardizing or normalizing analysis of the data. And he waswondering how to come up with this calculation for each piece of data. And at first he thoughtabout using a generated column so that right next to the column of interest, you could generatethis zscore. But the problem is you need to be taking a mean and the standard deviation of eachone and you can't really do that from data outside of the row. And then he covers a number ofdifferent options he came up with, including transform columns into the original table, creating awhole new table for zscores, creating a view the way that he seemed to like was creating a newtable just for the aggregates that are being generated and then use a trigger on the original tableto update the averages and standard deviations that exist in the separate table. So then doingthe zscore calculation would be pretty fast. But he ultimately chose another path where he did allthe calculations in a new column on the table. So if you want to learn about different techniquesto set up a zscore for analysis of your data, maybe you want to check out this post fromCrunchydata.com.

[00:08:09] The next piece of content is Postgres the batteries included database. This is fromCrunchydata.com and he's talking about the idea that everything essentially is included inpostgres. So the batteries are included and he's covering these in a few different areas. Youhave a solid database foundation, meaning it has MVCC. At its core, it has a variety of datatypes such as range types and timestamps and JSON B columns to be able to handle any sortof data you can throw at it. It uses expressive SQL and has support for things like common tableexpressions and Windows func --tions and generating your own functions. It also has full text search bundled into it, that is, ashe phrases here from another post, it's just good enough. Then of course, it's extensivegeospatial support when you're using it with PostGIS, which is an extension to handle that. Thenlastly, he covers all the different extensions that help enhance the PostgreSQL database, fromthe frequently used PG Stats statements to track statements that are running in your database,to PG partman for doing partition tables, foreign data wrappers, a data science one called Madlibas well as others. So basically this is a quick post of all the different features that make PostgresA batteries included database. So check this out if you're interested.

[00:09:29] The next piece of content is PostgreSQL high Availability setup and uptimeconsiderations. This is from Higo CA. Now, they're not talking about specific High Availabilitysolutions, but kind of what are the components of High Availability? Number one, you want tohave a way to set up a quorum and avoiding a split brain situation when you have a cluster ofmultiple PostgreSQL servers, because one's going to be the primary and it's going to have topotentially decide which other replicate takes over in case there's a problem with a primary orloses connection. So you're going to want to be able to define a quorum size, generally three ormore end being an odd number, as well as the process to promote a standby. And in connectionwith promoting that standby, if the primary comes back, you need to be able to handle thatsituation. So that is stoneth or smith and fencing. So Stoneth is shoot the other node in the head,which means to bring down another instance. Or Smith shoot myself in the head, meaning thatyou are the primary and you determine there's another one that exists, so you should notcontinue any longer. And fencing is a way to keep the old primary out of the cluster and preventit from rejoining. He then talks about indirection, whereas how you connect to the primary, youneed to be able to flip that easily. That could be domain name reassignment to a new IP. Youcould use Load Balancers to do it, or assigning an old primary as virtual IP to the new primary'sdatabase system. And also talking about configuring Postgres SQL so that you don't have to doconfiguration after a switch over. In other words, you want all the parameters to be up and readyto go on the replica so that the switching over is not going to require any sort of reconfigurationor restart of the database. So if you want to learn more about the components of high availability,you can check out this blog post.

[00:11:20] The next piece of content is making postgres stored procedures nine x faster inCytus. This is from Citusdata.com and they're talking about a new enhancement they added toCytus, which is an extension for postgres that allows you to do scale out of your postgressolution. Now, they already had the concept of distributed tables where you can scale out yourtables to multiple nodes. Now they support distributed functions, so you can dedicate a functionto run on a particular node. And with that, with a benchmarking workload called a HammerDB,they were able to get a nine times performance using these distributed functions within Citus. Soif this is of interest to you, you can check out this post from Citusdata.com.

[00:12:06] The next piece of content is combining PG Backrest dedicated repository host andstreaming replication. This is from Pgstep GitHub IO and he describes a process of setting up adedicated backup repository for your backups using PG Backrest. So if you're interested in thatand setting it up, definitely check out this blog post.

[00:12:26] Next piece of content is that PG Bouncer 1150 is released. This prompt primarilycontains enhancements and bug fixes, including more secure authentication, failure reportingand less log spamming when monitoring systems. So if you use PG Bouncer, maybe you wantto consider upgrading to this release. Next piece of content is waiting for PostGIS three one gridgenerators. So this is a new feature coming in 3.1, so if you're interested in doing gridgeneration, you can check out this blog post from Crunchydata.com.

[00:12:58] And the last piece of content is the PostgreSQL person of the Week is Pablo Stele.So if you're interested in Pavel and his contributions to postgres, definitely check out this blogpost that does it. For --this episode of Scaling Postgres, you can get links to all the content mentioned in the showNotes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weeklynotifications of each episode, or you can subscribe via YouTube ride tunes. Thanks. --

episode_image