background

Collation Issues, Fast Loading, Recovery, SCRAM Authentication | Scaling Postgres 72

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss collation issues, how to load data quickly, recovery changes in Postgres 12 and moving to SCRAM authentication.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres we talk about collation issues, fast loadingrecovery and scram authentication. I'm Kristen Jameson, and this is scaling postgres episode72.

[00:00:19] You all right? I hope you're having a great week. The first piece of content is ICUcorrelations against Glibc 2.28 data corruption. And this is from Cybertechn Postgresql.com andthey're basically talking about changes to Glibc with version 2.28 could cause corruption of data,mostly for how things are ordered, and it could cause issues with indexes, for example, maybeunique indexes basically now storing more than one. In other words, it's no longer unique basedupon correlation changes or you could get ordering differences once your OS upgrades to 2.28.So for example, I checked Ubuntu and the most recent LTS version was at 2.27, so not quite at2.28. So basically they reiterate here. Why is it a problem? Basically the result of order byclauses and select statements in the order of keys and beatry indexes. So that could lead toincorrect query results as they say here, or to duplicate entries and unique indexes. Now youcan do a reindex command. Thankfully version twelve coming up has reindex concurrently, butwe don't have that in our versions today. But what this post talks about is potentially moving toICU correlations as opposed to relying on the Glib C ones. Now, more of this was discussed in aprevious episode of Scaling Postgres with this post here beware of your next Glibc upgradetalking about this issue and he demonstrates some of it with two different versions of Glibc. Andif you're interested, I also encourage you to check out the comments because here they'retalking about where they had the issues with the duplicate key violations on unique constraintswith some of their indexes and kind of what they had to do to work through this. So basically thisis a problem to be aware of. And this post talks about hey, maybe you potentially want to changeto ICU correlations because it would be a little bit more independent than the Glib C changes.And they list several advantages here of moving to ICU such as they work the same on alloperating systems, you can actually create your own correlations, you can use case and Accentand sensitive correlations, they're versioned. But the unfortunate fact of the matter is he sayshere quote you cannot use ICU collations as default correlation for a database or a cluster, soyou have to explicitly specify them for all table columns.

[00:02:50] This restriction may be lift at some point in the future. So basically this would be apain to do, but if you wanted to go ahead and move it to ICU coalition he goes over anddescribes kind of how they're laid out, how to create them, how you can update the versionsusing ultracolation refresh version how to actually switch to ICU correlations. Again, you have todo it per table and of course you have to re index your indexes. But of course the issue that healso talks about is how to do this without downtime and it's basically using logical replication andthe post finishes through describing how you could potentially do this. So definitely this issomething that's coming down the pike for different OS versions on where you have PostgreSQLinstalled that you need to be aware of. And it looks like it's mostly just an ordering change, but itcould cause greater problems if you have unique indexes on text columns that are impacted bythese correlation changes coming in Glib 2.28. So it's just something to be aware of. And twoposts for you to check out.

[00:03:54] The next post is fastest way to load data into PostgreSQL using Python. And thesubtitle here is from two minutes to less than half a second. And this is from Hakibanita.com.Now basically they have retrieved data and using Python went through a number of differentsteps on what's the fastest way to load hundreds of thousands of rows of data. Now this is avery long post, pretty python centric, not so much PostgreSQL, although the ideas are still there.Basically single inserts are usually slower, multi line inserts are faster, although interestingly, thatwasn't that significant in his tests. He has here at the very bottom, he covered his resultsummary of the different iterations that he did and the execute many actually didn't improve it bythat much. But then of course the fastest one is using Cop, the Copy command, and he had away of parsing the data out and using --the Copy command. And that's what got it down to this. Less than from 128 seconds doingone insert at a time to half a second. So definitely if you're a Python developer, this has someinsight to be gained on how you can load data pretty quickly. But the general rule is single insertsone at a time are the slowest ones, doing a multiline insert gives you better performance andthen the Copy command is probably the fastest one. And this is similar to what I found in a videotutorial I did talking about Ruby. So it's fast PostgreSQL data loading using Ruby, and again itshowed the same results. Single inserts the time slowest, multi inserts, multiple rows per insertsfaster, and then the Copy command is the fastest. So if you're a python or Ruby developer, youcould potentially check out these two pieces of content.

[00:05:51] The next post is big news and databases. Summer 2019. And this is from Windnd At,and this is his newsletter and kind of the subtitle for the page title here is the SQL StandardExtensions and Cloud War. So basically he talks about how the SQL standard continues togrow. And this is something this individual regularly monitors the changes to SQL and tracksthem between the Oracle, Microsoft SQL Server, PostgreSQL DB Two, all the relationaldatabases on how it's keeping up with the standards. And they're talking about introducingthings like multidimensional arrays, things like actually a graph query language, GQL, and alsosome streaming data related features. And then with reference to the Cloud Wars he mentioned,basically there's this quote new add on features, options and management packs are onlyavailable in Oracle Cloud and on engineered systems. This means only on Oracle hardware,which is interesting, so you could no longer get it and install it on your own hardware. And hehas the subtitle here, Cloud War Strange Allies working against your own hardware. So again,this is another great thing about PostgreSQL is that it's open source and you can take it and runit wherever you want, or run it on a cloud environment and pretty much get the same thing. Andthen he also goes into different discussions about some technology and science, some newreleases for the different database systems that he tracks, as well as new content he's created,as well as some other areas. So definitely if you're interested in more news related items for thissummer of 2019, definitely a blog post to check out.

[00:07:31] The next post is postgresqlandrecovery.com and this is from Luca Ferrari at flukin1978 GitHub IO. And he's referring to the fact that with version twelve of Postgres therecovery.com file is removed. And he says here that the server will literally not start with it. Sothey've moved all of the things that you configured in recovery.com into the main Postgresql.comfile, or if you included files with it, there's also some new potential files that trigger files that getcreated, a standby signal and a recovery signal. Basically a standby is a hot standby in recovery.It recovers to the point it is ready to take on primary duties. And he makes a point that some ofthe reasons why this was done is he says, quote, this makes them moving them intoPostgresql.com makes them a good candidate to be changed by an alter system statement. Sobasically when you upgrade to version twelve, you need to make sure that your processes takethis into account or your backup recovery software that you're using takes this into account tohandle the absence of this file as well as putting the configuration in the main postgresql.comconfig file.

[00:08:47] The next post is how to upgrade your PostgreSQL passwords to Scram. And so withversion PostgreSQL ten they've added Scram authentication as an option, so it's much moresecure and standardized versions of doing password management for your users. They go overa brief overview of Scram and what it is. It's basically an acronym that's salted challengeresponse authentication mechanism. And they go through the steps of how to upgrade fromdetermine if you can upgrade to Scram because there's some things that can't use Scrams. Sofor example, to my knowledge, PG Bouncer by default does not support Scram. I think theremay be some GitHub patches that may allow it, but by default. I don't think it does at this time.The next step is validate your PG HPA comp settings, and they suggest having MD Five for yourusers that are connecting. Change a PostgreSQL's password --authentication method. So basically set it to Scramshaw 256. Determine who needs toupgrade. So they give you a script that you can look through the users that need to upgrade touse Scram. Upgrade each user's password. So each user's password needs to be reset, andthey give you two methods to discuss how to do this. One thing they mentioned here you want toavoid is don't use ultra role username password, new password, because the plain textpassword could end up being logged. So it kind of defeats some of the greater securityimprovements with Scram. And lastly, update your Pghba conf to use only Scramshaw 256 nowbecause you change this in the Pghba comp file. For example, if you do have utilities that don'tsupport Scram yet, you could have say, general users that are connecting interactively to useScram, but maybe PG Bouncer you could leave on MD Five, for example. But if you're interestedin moving to Scram, here's a process you can use. The next post is actually a YouTube channel,and this is the Itpug Italian PostgreSQL Users Group, and they recently had a PG day in Italy2019, so they posted a number of videos. The first set is for the individual speakers, basicallywho they are and what they're talking about in their presentations, and they're relatively shortone to three minutes. Then they have the greater presentations here that occurred at Pgday It.Now, I did notice some of them are in Italian or a language that was not English, but if you'reinterested in video content, there may be some presentations here you'd like to check out.

[00:11:27] The next post is implementing autonomous transactions in Postgres, and this is fromCyberTech Postgresql.com. And he says, what is an autonomous transaction? It's basically youwant to send off or fire and forget some subtransactions that are not connected to the maintransactions. So you're in a transaction, you're doing some sort of work, and you want to sendsome sort of notification. And he says here mostly logging, auditing, progress tracking into sometable. So basically this information would persist even if the transactions rolled back. Now, hesaid there's a hacky way to do it where someone literally uses the Copy command to essentiallywrite something to a file. But that has some downsides regarding the cryptic structure. The Copyprogram has a super user requirement, and you need physical access to the database server toread the log. Another alternative he thought up was using a DB link, so it's creating a separatelink to the database and inserting into a particular table. And then another alternative is using PLPython and some Python driver again to make a connection and do an insert into the database.Now, interestingly, I wonder if you had a requirement for this, if there is another alternativepotentially using an Asynchronous notification mechanism, listen and notify essentially I wonderif these types of notifications would work and still be essentially fire and forget whereas thetransaction gets rolled back. So if you have a requirement for something like this, definitely ablog post to check out.

[00:13:06] The next post is the database deficit. How? Full stack developers, blind spot harmsagile teams. Now, this is not explicitly related to PostgreSQL, although he does in general talkabout relational databases and it is an opinion piece. But basically he's talking about with fullstack developers and particularly with the rise of Node JS. And being able to use one languageon the client and the back end server side, essentially keeping track of all the different changesthat are happening. Basically the database side gets the short end of the stick usually in terms ofthe skill set in a quote, full stack developer. So it's interesting piece talking about this databasedeficit and kind of what some examples look like not using a migration strategy for schemachanges. And this is something I've quite thought often about as developers that basically don'tspend a lot of time focusing on the database side, they focus on everything else. But it's aninteresting opinion piece and I definitely encourage you to check it out if this type of content isinteresting to you.

[00:14:11] The next post is PostgreSQL wall retention and cleanup. PG. Archive cleanup. So thisis a utility that you can use to be able to clean up wall archives. And he basically describes thisbuilt in command that's provided with PostgreSQL. You can use it as a dry run to s --ee what potential files would be removed and then do it in a delete mode. But again, what'scritical is do not run this in your main PostgreSQL data directory like the PG wall directory orPgxlog directory. Do not do that, only do it to archive destinations because otherwise you'llcorrupt your database. But of course in using this utility, again, the consideration is you need tomake sure not to delete wall that you're needed. And they go through four different criteria herethat are very important to follow in terms of being able to restore your database to a point intime. So if you're going to be using this utility for cleaning up your wall archives, definitely checkout this blog post and follow their criteria to ensure that you're only deleting the wall files that youintend to.

[00:15:20] The next post is generate primary keys almost automatically and this is from LucaFerrari at fluca 1978 GitHub IO he's talking about. He was recently looking at a quite largedatabase with a lot of tables and essentially almost all tables did not have a primary key. Sobasically he went through and developed a couple of different ways that you could query thedatabase. Looking at the system tables and generate queries where you can automaticallygenerate a primary key. And at first he just used and he used a couple of different methods. Sohe used this CTE method to be able to generate this, and then you could run this file against thedatabase. Then he used a function that offers the ability to add different parameters to be able tocontrol how the SQL generation works. And then he actually did everything within a procedure toexecute the Alter table commands. So three different ways to do it. So if you ever need to run alot of different commands using information that's located in the system tables, definitely a blogpost to check out.

[00:16:29] Last Post is serving dynamic vector tiles from Postgres. Now. Again, I don't know thatmuch at all about PostGIS, but if you're interested in content where you need to serve thesedynamic vector tiles, here's a new blog post that covers these techniques that does it. For thisepisode of Scaling Postgres, you can get links to all the content mentioned in the show notes.Be sure to head over to Scalingposgres.com, where you can sign up to receive weeklynotifications of each episode, or you could subscribe via YouTube or itunes. Thanks. --

episode_image