UUID vs Bigint Battle!!! | Scaling Postgres 302
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss when and how you should use UUIDs or not, how to optimize space with column order and arrays, whether it is better to tune parameters or queries and what devs should know about Postgres with Andrew Atkinson.
- Postgres Performance Demystified Webinar
- UUID Benchmark War
- UUID in Postgres: PgSQLPhriday #015
- PGSQL Phriday #015: Primary keys: UUID, CUID, or TSID?
- Postgres.AI bot interview for PGSQL Phriday #015
- Reducing table size with optimal column ordering and array-based storage
- Tuning Parameters vs Tuning Queries: presentation and recording
- What Developers Should Know About Postgres With Andrew Atkinson | Rubber Duck Dev Show 114
- Modern SQL
- Postgres again elected DBMS of the Year in 2023, but I'm worried
- PostgreSQL is Enough Gist
- Podcast highlights on benchmarking Postgres performance, from Path To Citus Con Episode 11
- Benchmarking Postgres Vector Search approaches: Pgvector vs Lantern
- Pgvector vs Lantern part 2 - The one with parallel indexes
- Contemplating Decentralized Extension Publishing
- How to load JSON data in PostgreSQL with the COPY command
- Changing a Column from Integer to Boolean in One Transaction
- Load StackOverflow's StackExchange data in Postgresql
- Disabling “alter system” in PostgreSQL?
- Data Navigation with PostgreSQL Cursors – Part I
- Data Navigation with PostgreSQL Cursors – Part II
- Big News in Databases — Review 2023
- Introduction to PostgreSQL in OpenShift
- PGXN Tools Docker Image Updated
- pgenv gains a new command (and contributor!)
- Building the PgConf.Dev Programme
- FOSDEM 2024 in Brussels
- Storing IoT Data: 8 Reasons Why You Should Use PostgreSQL
- Reducing Amazon Aurora Costs
- Crunchy Bridge: Announcing Postgres Insights in Your CLI
- Empowering Control: Production and Developer Support for Self-Managed TimescaleDB
Before we get started, I want to remind you that my Postgres performance webinar is coming up the week that this episode is being released, and there will be another webinar next week as well. The webinar is "Postgres Performance Demystified. Learn A Simple Framework To Achieve 1,000 Improvements". It's entirely free and I welcome everyone to attend. The link to it will be the first one in the episode for this week, but I hope you, your friends, family, and coworkers continue to do well.
The actual first piece of content this week is "UUID Benchmark War". This is from ardentperf.com. There is a series of blog posts due to PGSQL Phriday being about UUIDs and whether you should use them or not, and especially should you use them as primary keys, although it seems like a lot of the blog posts were focusing on should you use them for primary keys or not. But this was the most interesting post of the bunch because he ran some benchmarks. So he looked at using a text field to store UUIds, using the native UUID data type, using UUID version 7, which has a timestamp order in the first part of the UUID, and compared those to bigints. He did the benchmark by, I think, creating a table of 20 million rows and then inserted 1 million rows as fast as they could.
He says, quote "...the results successfully demonstrate how each option is better than the previous... until you get to bigint". So, I think this part should have been bolded as well, because he says, quote "...UUIDv7 performance is so good, it basically matched bigint". So here are some of the results you could see. The time to insert 1 million rows for the text UUID was 410 seconds, UUID v4 was 375, for UUID v7 was 290, and for bigint it was 290. So identical. So not only is there a speed difference, but there's also a size difference. So UUID v4 is 63% smaller than the text version, UUID v7 is 7% smaller than the UUID v4 version, and bigints is 25% smaller than the UUID v7 version. So that's one disadvantage of the UUIDs compared to bigints is that they're twice the size in terms of bigints being eight bytes, whereas the UUID is 16 bytes.
So he shows the graph here of the performance. He compared the table size here and broke it down by index and primary key. The records showed his benchmark stats showed a time graph performance which is transaction per second across seconds. He thinks this drop here is due to checkpointing. He shows the table size differences tracked over time. So definitely what to get out of this. Never use a text version of a UUID if you can definitely use the built-in data types with Postgres when you can. He shows how much more data reading goes on with the text in the UUID v4 compared to the other two, as well as caching efficiency. It's really bad for the text and the v4 UUIDs. So I thought this was fascinating and definitely encouraging for UUID v7, so can't wait for that to land in Postgres. Hopefully, it will be in 17.
Next post related is "PGSQL Phriday #015: Primary keys: UUID, CUID, or TSID?". This is from mydbanobook.org. Basically, her TLDR is "Please don't". But she says, well yeah, it depends, but really her opinion is you should really never use UUIDs for your primary key, and you should only be using integers. Now I haven't actually heard of CUID, but apparently, it's been deprecated because it was supposed to be even more unique than a UUID, and a TSID is basically a timestored unique identifier. So that's basically what UUID v7 is going to be. But she definitely likes Postgres auto-generated integers and not really UUIDs. But you can check out this blog post if you want to learn more.
The next UUID-related post is "POSTGRES.AI BOT INTERVIEW FOR PGSQL PHRIDAY #015". This is from cybertec-postgresql.com and here they use the new Postgres.AI bot. If you don't know what that is, check last week's scaling Postgres episode where I covered it there to interview it, asking it the UUID question. Basically, what it is and how it should be used in their systems. I'm not going to really cover this, but you may find it interesting to take a glance at what AI thought.
Next piece of content- "Reducing table size with optimal column ordering and array-based storage". This is from pganalyze.com, and this is Lukas's "Five minutes of Postgres", and it's actually his 100th episode, so congratulations, that's great. Now in this episode, he actually covers some older posts talking about a particular topic. Because of the way that Postgres handles data types, it pads smaller ones, so there's actually a way to order them to save space on disk. I've never really used that because, with the applications that I tend to work with where I'm called in to consult on, they're typically changing fairly regularly.
But if you have a table you know is going to be static and you know it's going to be a huge table, then it can be an advantage to optimize your column order in that table to make the table as small as possible. So he covers talking about that and also some of the advantages of array-based storage. So that's another way to get more compressed data in the table as opposed to storing many rows of data stored in a single column in an array. He demonstrated this in his own application, pgAnalyze, where they saw a 5x size improvement by moving some data into arrays. So that was pretty interesting. So if you want to learn more, definitely check out his piece of content.
Next piece of content- "Tuning Parameters vs Tuning Queries: presentation and recording". This is from hdombrovoskaya.wordpress.com and she posted a presentation as well as the slides here, talking about the differences between optimizing parameters versus optimizing queries, and how a lot of people say, hey, I want to really tune my database. Whereas tuning your parameters of the database gets you, as she says here, maybe 10 or 20% improvements. Maybe 50% improvements if something's really misconfigured. But tuning queries can give you 10-fold improvements or 100-fold improvements. So if you're looking to optimize performance, doing it at the query level, how the data is laid out, that will give you the most bang for the buck. But check out this presentation if you're interested.
Next piece of content- "What Developers Should Know About Postgres With Andrew Atkinson". This is from The Rubber Duck Dev Show, and this is my interview with Andrew Atkinson. We were talking about his book- High Performance PostgreSQL for Rails, and also, of course, what developers should know about Postgres. So this is a longer piece of content. It's an hour-long podcast essentially, but if you're interested you can watch it on YouTube or your favorite podcasting provider. He did include a discount link for his book, I think about 35% off. So I'll include that link in the show notes as well.
The next piece of content, there was another episode of Postgres FM this week. This one was on "Modern SQL", and in it, Michael interviewed Marcus Winand, who runs use-the-index-luke.com and modern-sql.com and is the author of SQL Performance Explained. Basically, he talks about modern SQL, which to him is basically the SQL post, the 92 standard. So the 92 standard is what so many people use. It's what most ORMs are using. But the versions that have come out after 92 cover JSONB and window functions and a lot of different nonrelational concepts. So that's essentially what they discussed. They discussed some of the new things coming to the SQL standard. He also mentioned the number one thing to learn about modern SQL if you haven't learned it yet, is window functions because of the power that they offer. But if you want to learn more, you can definitely listen to the episode or watch it on YouTube down here.
Next piece of content- "Postgres again elected DBMS of the year in 2023, but I'm worried". This is from bloganayrat.info and he's talking about Postgres being elected the database of the year and all the accolades that have come with Postgres over the years and how it looks to be in a really strong position. But his fear, or the reason that he says he's worried, is not because of the state of Postgres, but it's basically the state of the DBA and how basically DBAs are going extinct. If that happens, what are we going to be missing if organizations no longer have any DBAs, and what knowledge is not going to be available anymore? Because a lot of developers just bring up a database on their cloud hosting provider and they just get to coding. So I thought this was an interesting read and if you want to learn more you can definitely check it out.
The next piece of content is actually a gist that references what someone wrote before about Postgres being enough like "Just Use Postgres for Everything". I definitely endorse that. I prefer using Postgres for doing search, and using Postgres for queues when it makes sense, but I don't think I actually use it for everything. The second post here "Simplify: move code into database functions", which to me, sets off alarm bells as a developer. I would much rather have the business logic in my application code than in the database. There are very few cases where I would want some of that handled by the database. Then they talk about background and Cron jobs. I tend to just use Cron on the instance itself, but if you're hosted, using one of these extensions is definitely something you could do. But you can check out this gist if you want to learn more.
Next piece of content- "Podcast highlights on benchmarking Postgres performance, from Path To Citus Con Episode 11". This is from citusdata.com. I haven't had a chance to listen to this yet, but you can definitely check out this episode if you want.
Next piece of content- "Benchmarking Postgres Vector Search approaches: Pgvector versus Lantern". This is from tembo.io and apparently, there's another extension called Lantern that lets you work with LLMs or AI in Postgres. They compared what the differences are in them because they are taking a slightly different tact. I think Lanterns, they said it's based on USearch and what they found is that Lantern does do faster index generation, but pgvector was better with performance, so seems like some bit of trade-offs going there.
Then they have a second article talking about, quote "The one with parallel indexes" where they tested version 6 of pgvector, but the build time still exceeded Lantern because you can do parallel index builds in Lantern as well. But check this out if you want to learn more about that.
The last piece of content is "Contemplating Decentralized Extension Publishing". This is from justatheory.com, and this is talking about the extension network that David Wheeler is working on as a part of tembo.io. Typically, you have to register to be able to push some sort of package or library to some of these library ecosystems. With Go, it's decentralized. So basically he did an in-depth study of how it actually works because he's contemplating potentially using this technique for creating the Postgres extension network. But if you want to learn more, definitely check out this blog post.