Data Directory, Vacuum Features, Binary Performance, Network Latency | Scaling Postgres 114

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the Postgres data directory, new vacuum features, binary storage performance and the impact of network latency.

Content Discussed

YouTube Video

Podcast Audio


[00:00:00] In this episode of Scaling Postgres, we talk about data directory, vacuum features,binary performance and network latency. I'm Kristen Jameson, and this is scaling postgresepisode 114.

[00:00:21] You all right? I hope you, your family and coworker continue to do well throughoutthese times. Our first piece of content is new PostgreSQL releases. Twelve point 311.810, point13, 9.618 and 9.522 are released. The only security issue relates to the Windows installer and itwas running executables from uncontrolled directories. So if you run on Windows, maybe youwant to upgrade faster. Otherwise there are over 75 bug improvements. So I would probablykind of look through this list to see if there's any reason you want to upgrade relatively quickly.And this is from the website. There's also this post that was done on DevramGundas's PostgreSQL blog that says a Yum user, some hyphen development RPMs require anew repository. So he says if you're using Red Hat Enterprise Linux, CentOS Seven, thendevelopment subpackages in Eleven and Twelve require an extra repo enabled for LLVM andClang. So he says you just need to update this for CentOS or Red Hat Enterprise Linux, youneed this command. So just something to keep in mind if you use those versions.

[00:01:36] The next piece of content is don't manually modify the PostgreSQL data directory.And this is from the Robert Haas blog on And this is a big no no. Unless you haveactually done PostgreSQL development, I would probably never modify anything in the datadirectory. And what's interesting, a lot of people, when he gives this recommendation, he says,really? What if I do X, Y and Z? Don't do it, just don't do it. And he says in 100% of cases myanswer is that it's not safe. But going back to, hey, what if I do this? He kind of goes through thispost on what happens if you do it. So for example, if you remove the postmaster PID, there's achance of course, you're going to be running multiple versions of that same PostgreSQL cluster.That could lead to who knows what kind of effects. And he says if you suspect it's presentwithout postgres running, assume that you're wrong. Because he says in his 20 or so years, he'snever really seen it happen that way. So definitely something not to do. They said removing wallfiles is another big no no, because people look, at least in older versions of PostgreSQL, the wallfiles were stored in the PGx log directory. So people assume, hey, these are logs, I can deletethem. No, they're part of the wall. And they've actually renamed it to PG wall in recent versionsbecause you don't want to delete these because this helps with crash recovery and maintainingconsistency of your database. And there's all sorts of LFX he goes through that can happenhere. But he says instead of removing files manually from PG wall, consider why the system is,say, keeping so many of these around. So he says maybe you have a failing archive recoverycommand. Maybe you're just not keeping up with the rate of wall generation. I've had like slowdisks that could cause this type of thing. Do you have an existing replication slot or an orphanreplication slot that is preventing the walls segments from being removed? Maybe you have longrunning transactions or maybe you have your max wall size or wall keeps segments are reallyhigh preventing them being removed. So again, these are things you can fix without having tochange things in the data directory. Next we covers is removing files from the PG transactdirectory, the Xact directory. He covers a PG multitransact directory, talks about trying to dosingle page restores which has a large degree of risk. He communicates here and all the typesof things that can go wrong. So if you're interested I definitely suggest checking that out. Andthen there's a section on is manual modification of the data directory ever justified? And again, ifyou are in a corrupt state in certain cases, again very few where this is perhaps warranted, butagain I would leave that to an expert. Leave it to someone who has done PostgreSQLdevelopment and knows exactly what they're doing. He makes it a comparison to a surgery. Soessentially you want to find a surgeon to do this. You don't want to just willy nilly go in and trychanging things because you'll probably make things worse. So this was a great post. If you'veever considered modifying the data directory, basically just don't do it. So --definitely a post I suggest you to check out.

[00:04:52] The next post is improved Vacuum and Auto vacuum in PostgreSQL 13. And this isfrom Emmet Capilla's Improvement number one is vacuum will be allowedto process indexes in parallel. This is huge and where it does it in parallel is the indexes. And theindexes in my experience take the longest part of vacuum and you can do various configurationsettings looking at the max parallel maintenance workers min parallel index scan size to be ableto configure it so that you are indexing in parallel. And it looks like this will be the default. Sothat's good for doing auto vacuum as well. But you can specify the number of workers whenyou're going to do a manual vacuum. So this is a great, great improvement coming in 13.Improvement number two is allow inserts to trigger auto vacuum activity. So again, by defaultwhen you just have inserts, there's nothing really to vacuum up in terms of deleted rows in atable or updated rows if all you're getting is inserts. And the first vacuum that's going to be is ananti wraparound vacuum which will probably take a really long time and potentially use moreresources to do it, you can't cancel it. So there's some disadvantages to it. But with someconfigurations that they've introduced it will trigger a vacuum once so many inserts are done intoa table so that you can do some general cleanups such as freezing rows and assuring heappages are visible and also allow index only scans to skip heap fetches. So being able to do thisvacuum is important for those reasons. Improvement number three is allow auto vacuum todisplay additional information about the heap and index in case of an error. So this is good that itcan actually tell you where particular errors are if vacuum runs into a problem. Now, knock onwood, I haven't experienced this, but this reporting would be good to have in case it happens.Proven number four, auto vacuum will now log wall usage statistics along with other information.So this is great to understand how many things like full page images are used and things of thatnature. Improvement five make vacuum buffer counts 64 bits wide to avoid overflow of bufferusage stats. So this will just make the stats accurate as opposed to invalid, so that's a goodbenefit. And then six is add weight event vacuum delay to report on cost based vacuum delay.So he says this will help us to monitor the auto vacuum throttling. So a lot of great features incoming 13 to vacuum, so especially this parallel. So if you want to learn more about it, definitelycheck out this blog post.

[00:07:24] The next piece of content is binary data performance in PostgreSQL. So this postcovers performance using binary data types in PostgreSQL. Now, they cover three ways thatyou can do this. One is storing the data in files and then storing a reference to that file. Andthese have advantages and disadvantages. Disadvantages that consistency is not guaranteed,like when you write the file, when you put the record in the database, those are going to bemisaligned. Benefits are the database is small and the performance can usually be very fastbecause reading directly from the file system is better. Next option is storing the data in largeobjects. This is an older way of storing binary data in PostgreSQL. Now, it's the slowest, but itdoes have advantage in that you can store arbitrarily large data with large objects and the APIhas support for streaming so you can read and write large data in chunks. The third option isstoring data as a byte A data type and that's the new way of storing binary data. Now, it says themain disadvantages are there's an absolute limit of 1GB and all the data has to be stored inmemory, there's no streaming support. And he says with byte A this basically uses toast to beable to store that data so it stores it separately from the table. Now, along with this, becausenormally PostgreSQL compresses large data being stored, you can avoid that by setting thestorage to external for your byte A data type. So for example, here setting this to external itactually avoids the compression. So if you're already compressing it, use the external here toavoid trying to compress it again. Next they talk about a Java implementation they did to test outthe speeds of these different solutions. And they come up with a table at the bottom here thatshow using the file system directly for two different data size t --ypes. You can see that's the fastest, but again, the disadvantage is inconsistency the largeobjects pretty darn slow, but it has support for streaming. And then the Byte A data type fasterthan large objects, but nowhere near the speed of the file system. And then in the summer herethey talk about the advantages and disadvantages of each type. So if you're wanting to storebinary data in PostgreSQL, definitely check out this post so you can make the best choice forwhat you're choosing to do. And I should mention, this is from CyberTech

[00:09:53] The next post, also from CyberTech is PostgreSQL. Network latencydoes make a big difference. So here they wanted to test performance with increased latency. Sothey're using a tool called TC Config to configure latency in Linux and then using PG bench toset up some transaction tests. So running at full speed with no latency introduced, you get 8813transactions per second when introducing ten milliseconds of delay and then running the samebenchmark, it drops to 461 transactions per second. So a huge drop off. So essentially these areclients waiting for their results to come in before they send the next transaction. That type oflatency of ten milliseconds can result in that. And then if you bump it up to 50 milliseconds, nowyour TPS is at 88. Basically it dropped over 100 times. So network latency is hugely important.And they mentioned specifically here for online transaction processing because you have a lot ofsmall short transactions, any latency you introduce from wherever the client is to the server isgoing to significantly decrease your transactions that you can execute per second. So definitelyinteresting blog post showing you how latency can impact your queries.

[00:11:14] The next post is postgres distinct on, and this is from John This isan interesting post because he's a developer and he walks through a problem he was havingwith performance and how he wanted to fix an N plus one query to get better performance. Sothis talks a lot about Ruby as well as the SQL he used to solve his problem. Basically thesolution he chose as the most efficient one is using distinct on to pull up the first record that heneeded. And then he goes through his Ruby implementation of how he introduced this query intohis code to get the best performance. So if you're interested in a developer's perspective ofseeing a problem and coming up with a solution using distinct on, definitely check out this blogpost.

[00:12:01] The next post Overcoming Imposter Syndrome. Working with PostgreSQL's SSLmodes. This is from He's talking about all the different SSL modes that areavailable in PostgreSQL. Now, what determines what uses SSL or not from the server'sperspective is the PGHPA. Comp file. So however, in that first column you have it set willdetermine whether SSL is used. Now Local uses a Unix domain socket, so it's not going to useany SSL at all. Using Host, it could use SSL or it could not. Using host SSL. It only uses SSL.And then using host no SSL means don't use SSL with us. Now, convenient way to reject all nonSSL traffic is to put this in the top line of your PGA HPA. Comp file. Basically for host? No. SSLfor all databases, users IP addresses reject it. So basically reject all no SSL connectionattempts. And then you would need to put in a Host SSL line for others wishing to connect. Sothis is how you enforce it on the server side. Now, there's also an SSL mode on the client so thatdictates how the client connects. So it can prefer SSL. It can require it, it can disable it. There'salso some certificate verifications. So here he goes through the different ones. So he set up theserver as described to rejecting all non SSL. It's saying only use SSL with connecting with apassword. The first attempt here where the SSL mode of a client was disabled, it rejects it. Itwon't work because the client has said, I don't want to use SSL, but the server says, no, I'm onlygoing to use SSL. So the connection doesn't work. The connections are permitted through whenthe SSL mode is allowed for the client. It works if it's preferred for the client, or it works if it'srequired. However, you'll notice that verify CA fails as well as verify full fails. And this is whereyou're asking for the client to verify the certificate used by the server. So this will only work if youhave set up and manually installed certificat --es on the server and the clients trust the certificate authority who signed those certificates.Verify CA basically means verify the CA who signed the certificate of the server you're using.Verify full means. You're also verifying that the common name or the server alternative name thatis in the certificate matches the database server that you're trying to connect to. So basically,this is the strongest way of verifying authenticity that you're not connecting to a server. That'snot what you expect. So if you're wanting to learn more about these settings, definitely check outthis post.

[00:14:47] The next piece of content is benefits of external key management system over theinternal and how they could help securing PostgreSQL. So this is talking about the benefits of anexternal key management system. Now, why are we talking about this? This is because thisorganization, Higo CA from this blog post is working on a transparent data encryption feature.This is where Data gets encrypted at rest by the PostgreSQL server. Now, in order to do that,they mentioned that encryption is relatively simple, but what is not is key management. Sothey're working on this transparent data encryption feature, potentially for version 14, and they'retalking about having some internal key management features for it. But they're going throughand talking about this post from the Benefits of an external key management feature. So if you'rewanting to learn more about the benefits of this, may want to check out this post

[00:15:43] The next post keep your libraries updated and nobody gets hurt. This is They're talking about the importance of keeping all your libraries up to dateeven when running PostgreSQL, because PostgreSQL relies on a number of libraries and ifthose aren't kept up to date, you could run into problems and issues and they go over. Notupgrading may give you a false sense of stability because things are fine and you may have afear of upgrading, but eventually you're probably going to run into issues. And it's usually relatedto an old software version that hasn't been updated. So they go into some actual scenarioswhere, say, someone was using a PG dump, but then they discovered that it had a PG auditextension that hadn't been updated in more than a year and it was resulting in an error whentrying to restore the database. So that's a big issue. The next scenario they're talking aboutwhere they had a huge performance difference when trying to do a Postgres query and theytracked it down to an issue with an old Glib C version. So once that was updated, theperformance went back to normal. So it's basically communicating the importance of doingupdates to your system, not just your PostgreSQL versions, but also the extra libraries thatpotentially PostgreSQL is using.

[00:17:00] The next post is multicolumn Partitioning, and this is, and this is a way of partitioning tables using multiple columns. AndI actually didn't know this was possible where you can create a table and have different columnsand looking in different ranges. So if you look at this here, it's like a range from one to 20 andfrom 110 to 200 and from 50 to 200. And this goes through all the possibilities and things towatch out for and how queries work when using multiple columns for your partitioning. Now, Ihaven't encountered a use case for this, but this was definitely interesting blog post you maywant to check out in case you have a use case for it.

[00:17:41] The next post is PG Badger X ray vision for your queries. This is and they're basically talking about using PG Badger to analyze your logs andthey go through some of the things you may want to additionally configure in inorder to make sure you're collecting the most information. And it shows you how to walk throughand how can use PG Badger to analyze your logs. So if you're interested in doing that, check outthis blog post.

[00:18:10] The next post is Key Parameters and Configuration for Streaming Replication inpostgres twelve. So in this example, they're looking at setting up a primary server with twoReplicas and you can choose to have them synchronous or not. And they're using replicationslots and they describe the settings that need to be made on the primary and the Replicas. So ifyou're interested in setting this up for your system, you can check out this blog po --st. Also from

[00:18:37] The next post is Fuzzy Searching with PostgreSQL, and this is from a Dev twowebsite and it's talking about Trigrams basically, so using the PG Trigram extension to do fuzzysearches or similar like searches in your database. Now this goes over just a very basicimplementation of it and how you could use it. It doesn't cover things such as adding indexesand things of that nature, but if you want to get a good basic description of using the PG Trigramextension, definitely check out this blog post.

[00:19:09] The next piece of content is advanced SQL and database books and resources. Thisis from, and we had previously talked about other books that were presentedas educational resources, and here he goes into several different online courses and books foryou to boost your knowledge of SQL. Again, this is not necessarily PostgreSQL specific, but anumber of his resources of course talk about PostgreSQL. The next post is Build PostgreSQLand Extension on Windows and it tells you how you can build postgres along with extensions inWindows. So if you have a need to do that, check out this post from Higo CA. And the last postis that the PostgreSQL Person of the Week is Fabrizio de Royce Mello. Forgive me if Ipronounced your name incorrectly. So if you want to learn more about Fabrizio and hiscontributions to PostgreSQL, definitely check out this blog post.

[00:20:04] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the Show Notes. Be sure to head over to where you can signup to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes.Thanks you our. --