background

Postgres Migration, Bulk Data Loading, Updates vs. Upgrades, UUID Benchmarks | Scaling Postgres 144

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss a Mongo to Postgres migration, the best way to bulk load data, running an update vs. an upgrade and benchmarks for UUIDs.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about a postgres migration bulk dataloading, updates versus upgrades and UUID benchmarks. I'm Kristen Jameson and this isScaling Postgres, episode 144.

[00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our firstpiece of content is Seamless MongoDB to PostgreSQL migration. This is from the Coinbaseblog@coinbase.com and basically a lot of their major transactions that they track withinCoinbase, they migrated from MongoDB to postgres. And it looked like this happened in 2019.And they go through the process that they used to do it. Now they wanted to do it veryaccurately because it involves transactions, but they also wanted to do a lot of iteration. So theymade it a very repeatable process for what they were testing out. And this is the general processthat they use. So they had their legacy system and the clients against it. So then what theychose to do is continue to read and write from this legacy database, but then do parallel writes tothe new postgres system. And then during this process that they are parallel writing betweenessentially phase two and phase three, they backfilled the data. In the phase two stage, theywere only reading from the legacy, but they were writing to both. In this stage, they were stillwriting to both, but reading from the new database system. And then finally, phase four indicatesthe cutover process. Now they described their repeatable backfill process. Basically, they weretaking MongoDB, sending files to S three, then importing them into the postgres RDS systemfrom there. And because it took a while to do the backload, they actually did it into temporarytables and recreated all of the indexes. And then at the cutover process, they used a transactionto essentially rename the tables. Now they also mentioned they did use triggers that tracked theinsert, updates and deletes to keep the data in sync between the live and the backfill tables. Sothat's something else that they did. And they had this list of the process they went through to dothe backfill and transfer process. And they said the automated process to do this data transfertook about four to 6 hours. But this was done without any downtime of their system at all, theprocess that they worked out here. So if you're interested in checking out how they did thisprocess, definitely check out this blog post because I found it very insightful.

[00:02:37] The next piece of content is bulk loading into PostgreSQL options and comparison.This is from Higo CA, and they compared a couple of different processes of bulk loading data.They described the system they're using here, and they used a CSV file that had 5 million rows,14 columns, that was about 624 megabytes in size. Then they show you essentially the schemahere. So the first thing they used, of course, was the copy command. They also tried using theclient copy command that's a part of a psql. They tried using a file foreign data wrapper to do itas well as a PG bulk load tool. Now, I haven't used this tool, but it says it skips the sharedbuffers and wall logging process to be able to upload data fast and here are essentially theresults. This is the method used here, the copy client copy foreign data wrapper, the bulk loadand the top of the table is a standard table without an index. Now, as you can see here, the bulkload command wins out in every instance. So bypassing the shared buffers and the wall is avery efficient way of loading data if you need the most efficiency. But I would still probably resortto just using the general copy command myself, particularly if you care about your data and youwant it to be able to handle any sort of failures. Now, they also tried this with an unlock table andinterestingly it shaved a little bit off of the times from the copy commands and the foreign datawrapper, but it actually was a little bit longer on the bulk load so it didn't make much of adifference. And then he compared what it's like to load data with even just one simple index. Ibelieve it was on the country column and you can see suddenly it's over five times slower andeven in an unlocked with one index. Still, it wasn't an appreciable difference, but the well knownprocess of removing indexes from a table, loading the data and then applying the index, that wasaround 25 seconds, which was double the time of without an index. But it's less than half thetime --of leaving the index in place. So this well accepted standard of dropping indexes, loading thedata and then reapplying the indexes is still the fastest way even from this analysis. But it lookslike that if you want the most performant way to load data, it is this bulk load that bypasses theshared buffers and the wall logging. So, definitely an interesting blog post and if you want to findmore about the numbers and the process used, definitely check out this post from Higo CA.

[00:05:12] The next piece of content is upgrading and updating postgresq wall. So they'remaking a distinction between upgrades and updates. So an update they're classifying as movingto the next point release. And to do that, all you have to do is upgrade your binaries. You don'tneed to alter the data files at all, so you don't need to do a literal upgrade process. What they'recalling an upgrade is where you're moving to a major version release. So you're going from say9.5 to 9.6 to ten 1112 or 13. Then you need to go through an upgrade process. The simplestones just do a PG dump and then reload the data. But on a large database that takes a longtime. The next technique you can use is a PG upgrade and that basically copies the data, asthey say here, on a binary level. But if you use the hard link process that you're able to keep allthe data files in place and the upgrades happen very fast, like a terabyte multi terabyte databasecan be done on the order of seconds. Now, one thing they did mention here is they said, quote,what is important to note here is that PG upgrade is never destructive. If things go wrong, youcan always delete the new data directory and start from scratch. But there is a caveat with theLink option because if you're using Link option, you essentially don't have a copy of the datafiles. And what the postgres documentation says is if you use the link mode for the PG upgrade,the upgrade will be much faster, use less disk space, but you will not be able to access your oldcluster once you start the new cluster after the upgrade. So things definitely change and youcan't go back. Generally I do use the link mode when I do PG upgrades, but I always have areplica in place. So essentially I'm upgrading that replica or I upgrade the master and keep thereplica offline as a backup to be able to fall back if I need to. And then the rest of the post givesan example of going through the process of doing an upgrade using PG upgrade. So if you'reinterested in that, you can check out this post from Cybertechgresql.com.

[00:07:15] The next piece of content is benchmark version four UUID generation in Postgres.And they're describing two different functions that exist for generating UUIDs in postgres. One isthe UUID generate version four which is available via the UUID OSSP extension, but as ofversion 13, there is a gen random UUID function that's built in. So we actually did a test of thisactually using version twelve of Postgres because he actually used the PG crypto extension andhe determined how fast you could generate 10,000 UUIDs across a set of different hardwareand different platforms. And what he found is much better performance for the gen random UUIDfunction for generating UUIDs. So definitely you're going to want to reach for that, andparticularly since it's built into postgres 13, that makes sense. But in particular the Windowsserver was awfully slow with the UUID generate version four function. So if you run Postgres onWindows, you definitely want to be using this gen random UUID. So if you want to learn moreabout this, feel free to check out this blog post from sushan info. The next piece of content isusing Plpg SQL to calculate new postgres columns. This is from Crunchydata.com and they'retalking about analyzing information from a data science perspective and that generally they wantto calculate zscores right next to the data that they're analyzing. And to do this, he actuallygenerated a plpgsql function to generate those extra columns containing those zscores. So hehas a link to the full function on GitHub right here, but then he walks through each portion of it.So he has a function where he determines, okay, what schema we're going to be changing, whattable name, what prefix to append to the beginning of the column as well as an array of columnsthat he wants to do the calculations for, as well as an indicator of what the primary key is, to beable to update the data easily. So the func --tion does a for each loop through those array of column names, it generates what the newcolumn name should be, alters the table to add that new column as a numeric and thenpopulates that column using this function that contains two CTEs to do the calculations for thezscores.

[00:09:36] So if you're interested in using functions to have this ability to assign zscores and docalculations with particular tables, definitely check out this blog post from Crunchydata.com.

[00:09:48] The next piece of content is what's the special for logical level of PostgreSQL? Wall.So this post is asking the question of when you have a wall level of replica and then move it towall level of logical. It says it records more information. Well, what is that? More information?That's what this blog post covers. Now predominantly it talks about differences with what itrecords, with whether the wall record is a full page image or not. And full page images have tohappen after a checkpoint to make sure it has all the necessary information to do a recovery. Soif you're interested in determining the differences, you can check out this post from Heigo CA.

[00:10:30] The next piece of content is waiting for PostGIS three one, GEOS three Nine. Andthis is from Crunchydata.com. And it's talking about PostGIS and how it has additional libraries itworks with. And the one that they're mentioning here is GEOS that have some enhancementsthat have been added specifically to the overlay engine function that allows for greatercapabilities and more performance. And they're mentioning two times faster than the original. Soif you use PostGIS and specifically GEOS, you may be interested in this content.

[00:11:04] The next piece of content is hands on with OSM two S, PG SQL's new Flex output.So this is talking about the OpenStreetMap data and being able to get it into postgres. Andapparently it has a new output that can serve as an input for loading into postgres and it has alot more flexibility and you can use JSON B fields. So if you work with the OpenStreetMap data,perhaps you want to check out this post from Rustprooflabs.com.

[00:11:34] The last piece of content is the PostgreSQL Person of the Week is Carol Arnold. So ifyou're interested in Carol and her contributions to postgres, definitely check out this blog postthat does it. For this episode of Scaling Postgres, you can get links to all the content mentionedin the show. Notes be sure to head over to Scalingpostgres.com, where you can sign up toreceive weekly notifications of each episode. Or you can subscribe via YouTube or itunes.Thanks.

[00:12:04] Our channel our. --

episode_image