background

Reducing WAL, What is Null?, Understanding SQL, TPC-H Benchmarking | Scaling Postgres 104

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to reduce the size of your WAL files, what is null, how to better understand SQL and exploring TPC-H benchmarks.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about reducing wall. What is null?Understanding SQL and TPCH benchmarking. I'm Kristen Jameson and this is Scaling Postgres,episode 104.

[00:00:22] All right, I hope you're having a great week. Our first piece of content is reduce wall byincreasing checkpoint distance. And this is from Cybertechn postgresql.com. And to start off, hejust creates a basic postgres twelve cluster, sets the max wall size relatively low and just turnsoff synchronous commit to be able to have the tests run quickly. Next, he creates a PG benchtest database with 10 million rows and then does a PG bench benchmark ten times 1 milliontransactions. And it processed about 13,700 transactions per second. And doing that calculationof wall generated about 77gb of wall files. So basically the lower max wall size causes it tocheckpoint more often because it can't keep that much wall around, it needs to go ahead andcheckpoint save those changes that are occurring to the data files. And with that it generated alot of write ahead log wall 77gb. Next, he increased the max wall size to 150GB, which is over1000 times as much, changed the checkpoint timeout period to one day, still left synchronouscommit off. And now in running it, it only generated 5GB. So 5.3 compared to 77gb. So whywould the same set of transactions cause ten times the amount of wall to be generated? And heexplains it here. Basically, after a checkpoint, anytime that page or block is altered, they need towrite the full contents of it to the wall to ensure consistency. Now, once that whole block hasbeen committed to the wall, subsequent changes won't require that to be done. But it needs tostart with a consistent starting point to make sure that after a checkpoint, any alterations, thosepages or blocks are committed to the wall files. Now, an older post talking about this from 2016is actually called on the impact of full page writes. And this is from a second quadrant.com. Andhere they're talking about full page rights or the storing of full page images in the wall files. Andas an example, they are looking at size differentials in terms of different workloads comparing aserial or a big serial compared to UUID. Now, because the serial is essentially sequential,whereas UUID is randomized, whenever it needs to make alterations, you're going to get morefull page images stored in the wall. Therefore, using UUIDs causes a great deal more wall to begenerated normally. But this is a great post that goes into more in depth on the reason why theneed for having these full page writes as well. So just something to keep in mind if you want toresearch how to potentially reduce your wall size as your database increases in size and somepotential solutions to save space and make it more efficient.

[00:03:25] Next post is what is null, and this is from procona.com and they're talking about theconcept of null has different definitions based upon different languages. Like for example, theytalk about C and c plus plus. Here that null is defined as zero in the standard definition headerfile. Whereas when you're talking about Java, null means variable references do have value andit can be tested against null with equality operators. But if you're using PostgreSQL, nullbasically means an unknown value. So they have a quick comparison here where you do oneequals one is true, does foo equal foo for text? Yes, it's true. Does null equal null? No. And thereason is because you can't use this equality operator, an unknown value and equal an unknownvalue is unknown. Similarly you can't say is this unknown value not equal this unknown value?You don't know, it's not known. And how you need to work with nulls in PostgreSQL is use the isor is not. So when you say null is null then you'll get a true or null is not null, then you'll get afalse. And similarly you can use the coalesce function and it will return the first set of not nullarguments and you can test this expression, but normally you're going to use null is null. Nowthey also mentioned null if, which I personally haven't used, but it basically returns a null if thesetwo values passed in or equal otherwise return the first variable and it goes into a couple ofquick examples of how you can use null. And they have some example data here where nullsare present you can't say is it blank? Because it's not blank, it's actually null. So you need toactually say look where --the s naming the surname is null. So this is a relatively brief post that talks about null as itrelates to PostgreSQL and just some things to be aware of if you are relatively new to postgres.

[00:05:26] The next piece of content is SQL. Distinct is not a function and this is from theblog JO Oq.org. Basically he's saying he's seen instances of someone writing some SQL where theyuse distinct and parentheses around a column. But really this distinct applies to this whole selectstatement. It's not a function that you would use to execute or have column as a parameter. Sothese params are unneeded, they don't do anything because the select applies to the wholething. And he goes through and explains a little bit why this is the case and how to work throughit. Now, one exception he says, is the distinct on keyword, which is something that postgres hasan extension to the SQL standard. So distinct on does require parenz around a parameter that'spassed in describes how distinct on is required. Now, related to this, there is a further post hedid the next day called Five Ways to Better understand SQL by adding optional parentheses.Now the first part of this he goes through all sorts of different posts that goes into additionallessons on SQL. So if you want a refresher or you need more education with regard to that,definitely check out some of these posts. But the rest of the post is talking about the placementof parentheses and how it can help you potentially understand what's going on with SQL. Now,he first talked about a row value expressions where you can write your SQL this way, but youcan also put parenz around it. It doesn't mean anything and you can even use a function calledrow with parenz around it or a keyword row and still it gives you the same result. Now you caneven put parenz around two different columns, so you can say with the first name followed bythe last name, and you can look for that here to get the same set of data. And you can even usethe N command or a sub query to say looking for the first name and the last name where it's inthe set and it will only return the records that match the set as well as doing it with a sub query.Next, he moves on to joins and explains how joins work and how you can use parens with that.But he also goes over some things, explaining some precedents you could potentially do. Butpersonally I just like doing joins this way without any parens to understand what's going on. Andhe agrees this is a nice and clean SQL, but he's using these parens as a reason to try and helpunderstand what's kind of going on under the covers with SQL. And then like the previous post,he talks about distinct and how it applies to the whole section as opposed to a single column. Hegoes over a union intersect and accept and how using parens can help you better understandwhat's going on with that. And finally follows up with sub queries along with some conclusions.So if you want a refresher in SQL or you want to increase your knowledge of it, then definitely ablog post to check out.

[00:08:32] The next piece of content is learnings from Benchmarking PostgreSQL on TPCH.

[00:08:38] This is from the it's all in database blog on Rafisabee blogspot.com. And here theyare benchmarking PostgreSQL using the TCPH benchmark, which is a more analytical or anOLAP based benchmark because doing a lot of testing with parallel queries. And he covers thelessons learned in terms of parallel scans needing to use indexes tuning related parameterssuch as your random page costs, generally dropping that down to one if you're using SSDsproperly, setting the effective cache size to get better performance. He talks about parallel joinsand the importance of workmen. Now, I was shocked to see where he was saying this workman20% of the Ram size, which that's huge. So clearly this is for an analytical database where youhave fewer larger queries running, because I've never seen this for online transaction processingdatabase and he talks about a parallel merge join and then shared parallel hash joins as well assome other conclusions. So if you're interested in optimizing for parallel processing of ananalytical database, maybe that's a blog post you'll want to check out.

[00:09:57] The next post is what to look for if your PostgreSQL replication is lagging. Now, thistalks about streaming replication from one primary database to a replica, or from a replica to areplica. Now, first they go over al --l the different types of replication, but this is primarily talking about streaming replication andit's talking about essentially monitoring your replication. And the number one way to do that isthe PGSTAT replication view and they cover what all the different parameters are, what you canuse on the standby node, what can you use on the primary node, and then how to identify thelag and the source of it. So if you're interested in monitoring your replication lag or you have aproblem you'd like to explore, definitely check out this blog post from several nines.com.

[00:10:44] The next post is creating a PostgreSQL procedural Language part Four handlingInput Parameters So this is part four where they're creating a procedural level language usingJulia. And this is the part four where they're actually passing off and executing code in Julia. Soif you want to continue on with this series, the next post is available from secondquadrant.com.

[00:11:08] The next post is developing PostgreSQL for Windows part two. So this is the secondpart where they're discussing actually developing on PostgreSQL on, say, a Linux environment,but you want to ensure that it works on a Windows environment as well. Well, this series ofposts walk through that and this is a part two for those posts. The next piece of content is thePostgreSQL person of the week is Joe Conway. So it's an interview with Joe Conway and hiscontributions and work in PostgreSQL. So if you're interested in checking out this, you can checkout PostgreSQL Life website for the Joe Conway post.

[00:11:47] The next series of posts are primarily all about tools and updates that are available.So this is postgres AI, and there is a product called Database Lab, where basically, as it sayshere, it clones multi terabyte database in seconds. So it has a terabyte database and thenessentially it does diffs as you're working with it to give you your own separate slices of thedatabase for doing different experiments. Now with that, it also has a SQL performance advisorcalled Joe. So this lets you run those experiments through a slack chat interface to be able tocheck the performance of your queries using this database lab. So it's an interesting set of toolsto be able to explain, analyze on production data and analyze the best queries. So if you'venever heard of them before, maybe you'll want to check out the tools that they have available.

[00:12:46] The next post is Crunchy spatial, it's PostGIS for the web. So basically, this is a toolthat Crunchy data is working on. They're calling it Crunchy Spatial. And basically they're buildinga web interface leveraging PostGIS to the data within PostgreSQL. So they're developing aseries of tools or microservices that are web interfaces into PostGIS. And the PostgreSQLdatabase and the number of services they want to set up are a Vector Tile server, a Vectorfeature server, a data import service, a routing service, geocoding service and a base mapservice. To start off, they have two tools available of this feature set a PG Tile Serve and a PGFeature Serve. Now, related to this, they have another post that's crunchy Spatial Tile servingand they go into more in depth on PG Tile Serve about how it works. So it's a microservice thatcommunicates with the data layer. It basically provides a web user interface into the Spatial databeing produced. And they have relatively short number of steps to go ahead and spin up yourown example of this. So if you're interested in using PostGIS for these purposes, definitely checkout these set of blog postgres.

[00:14:09] The next piece of content is PG. Backrest auto select Backup. So this is a post aboutPG Backrest and how a new version was released that has the ability to do an auto selection ofthe proper backup set when doing a restore based upon a target time. So if that's of interest toyou, maybe you won't want to check out this blog post. And the last post is Cytus 9.2 speeds uplarge scale HTAP workloads on Postgres. Now, HTAP actually is a hybrid transactional analyticalprocessing database and thus covers all of the new features that have been introduced in Cytus9.2, as well as a number of different performance improvements. So if you have any interest inCitis, definitely a blog post to check out.

[00:14:57] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show Notes. Be sure to head over to Scalingpostgres.com --where you can sign up to receive weekly notifications of each episode, or you can subscribevia YouTube or itunes. Thanks. --

episode_image