
NVME Wins? | Scaling Postgres 388
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss PG17 and PG18 benchmarks across storage types, more about Postgres locks, sanitizing SQL and can a faster software & hardware environment cause performance problems?
Content Discussed
- Benchmarking Postgres 17 vs 18
- #PostgresMarathon 2-006: Mysterious max_locks_per_transaction
- #PostgresMarathon 2-007: Should we worry about pg_blocking_pids()'s observer effect?
- #PostgresMarathon 2-008: LWLock:LockManager and prepared statements
- Sanitized SQL
- Long Queries and the Art of Full Scan
- What's Our Vector, Victor? Building AI Apps with Postgres
- PostGIS Performance: Indexing and EXPLAIN
- Keep Calm - TDE for PostgreSQL 18 Is on Its Way!
- The Fundamental Interconnectedness of All Things with Boriss Mejías
- Waiting for PostgreSQL 19 – Add IGNORE NULLS/RESPECT NULLS option to Window functions.
- Cost Optimization for PostgreSQL: Practical Tips for Technical Teams
- From Text to Token: How Tokenization Pipelines Work
- JIT: so you want to be faster than an interpreter on modern CPUs…
- Understanding and Setting PostgreSQL JDBC Fetch Size
- Exploring PostgreSQL 18's new UUIDv7 support
- Don’t Skip ANALYZE: A Real-World PostgreSQL Story
- Understanding Disaster Recovery in PostgreSQL
- Three Interviews
- Getting Ready for PGConf.EU 2025
- Join us for a Postgres Women Breakfast
- The PostgreSQL Village
- Seattle Postgres User Group Video Library
- How we used pg_query to rewrite queries to fix bad query plans
- Introducing the Next Generation of Heroku Postgres – Unlocking Performance, Scale, and Zero-Friction
- PGConf NYC 2025 Recap: A Clearer View of Postgres
YouTube Video
Podcast Audio
Transcript
The more that I hear about the performance differences between postgres on NVMe drives versus a network storage such as AWS's EBS volumes, it makes me seriously start to consider wanting to move to local NVME drives for the performance. Now I know that the data durability would be impacted by this because network attached storage is generally much more durable than say a single SSD on an instance, but the performance benefits of NVME are pretty amazing as we'll talk about this week. Also, be sure to stay until the end where in my Consulting Corner segment we're going to talk about a case where upgrading to a faster system might have actually led to a performance problem, but I hope you, your friends, family and co-workers continue to do well.
Our first piece of content “Benchmarking Postgres 17 vs 18”, this is from planetscale.com and they did a whole slew of benchmarks comparing 17 to 18 and also all of the different asynchronous settings in Postgres 18 including worker and IOuring. And they even used different storage backends as well. So they used an R7i2x large on AWS with 8 virtual CPUs, 64 gigs of RAM and 4 different types of disks. One was a GP3 with 3000 IOPS, which is basically what you get by default when you're using EBS. Then they looked at one with 10,000 IOPS and a throughput of 500 megabytes per second. Then they looked at the iO2 disk type which is much more performant and had the IOPS set at 16,000. And then finally they looked at an NVME local storage and they did this with an i7 instance and its IOps was around 300,000. So much higher. And they did use Sysbench for these tests and tested different number of connections. One connection, 10 connections, 50 connections, so not a ton of connections and then looked at different range sizes which is how many rows are going to be returned because they only used the OLTP read only benchmark because again Asyncio only works with selects at this point in time. So checking out a single connection, this is what the performance looked like for those four different storage types. And one thing you'll notice off the bat with the storage systems with more latency, this is basically the network attached storage. These three here Postgres 18, both sync and worker outperformed 18 with IO uring and also Postgres 17 and of course the NVME drive. There was basically no difference. All of Them were performing pretty equivalently. That's one thing we'll notice is that the iouring we will see frequently underperforms in a lot of these settings. Then he took a look at a single connection with a range of 10,000 rows being returned, and the performance starts to improve and normalize a little bit, getting closer to the nvme. So as you pull back more rows, the performance starts to normalize. Then he checked out a higher concurrency with 50 connections. And here the differences between Postgres 17 and 18 basically went away. Even with AsyncIO, everything is pretty much normalized at 50 connections. But the stark difference that is still clear is that the performance scales with how many essentially IOps you can put through the volumes. And it's not necessarily a direct correlation, because CPU utilization may be a factor as well. But clearly the NVME outperforms the other storage solutions. Now, when you bring it up to a wider range of rows returned, then the performance differences start to normalize. But they were saying that's mostly due because they're starting to hit CPU bottlenecks, although they don't chart that. So I'm assuming that if this was tested on a machine with more CPUs, maybe you would still see this type of graph where it scales based upon the disk iops, because you can tell here who queries Per second is 12,000 for NVMe when pulling back around 100 rows at a time. If you increase that to 10,000 rows at a time, so 100 fold more rows being returned, the queries per second drop below 4,000 for the NVME. And again, they said this was mostly due to some CPU bottlenecking. And they looked at the cost comparison down here, showing how all of these systems are around the $500 per month mark, except for the one with the io2 performance and, and the 16,000 iops, it's around $1,500 a month. Now, some of this of course, may be biased because PlanetScale, of course has their PlanetScale metal on Postgres where they advocate using NVME drives and they emphasize the performance differences in it. So that's kind of what this blog post is also doing. But still, the performance looks pretty impressive. And the other thing he mentioned is that he was surprised that the IO uring did not do better. He said there was one instance where it did win. Like for example, here at the 50 threads with a range of 10,000 rows returned, it did quote unquote win, but with statistical variation. Did it really win? Not really sure. But you know, because of that he said, well, maybe workers is the way to go now, but again, it probably also depends on your workload, so you'll definitely want to test that out. And in this particular benchmark, he did not change the number of workers at all. It was left at the default of three. Just because that's a question I had. But if you want to learn more about this, definitely encourage you to check out this plug post.
Next Piece of Content we're continuing on with the Postgres AI marathon at Postgres AI. The next post is talking about the “Mysterious max_locks_per_transaction” and basically you usually hit the setting. Maybe if you're doing a PG dump and you have a lot of objects, maybe you'll hit the limit as you’re exporting it because the default of 64 is insufficient. Or you may get some errors in the log that says out of shared memory. Hint, you might need to increase max locks per transaction. Now as he says here, you know out of shared memory is not about the buffer pool at all. It basically helps quote, define the size of the main lock table that resides in shared memory. So he shows an example of the source code here, but then he proceeds to actually test out and empirically show how postgres works to better understand it. But unfortunately, what he was left with was more questions than answers. This is a pretty long post, so if you're interested in the mystery of the max logs per transaction that you discovered, feel free to go ahead and check it out.
Next piece of content, also from the Postgres AI marathon is “Should we worry about pg_blocking_pids()'s observer effect?” So PG blocking PIDs basically allows you to give it a process ID and determine, okay, what is blocking this particular process. So if you have a process that's running and it's hanging, you could put the PID for that into this function and it will tell you what is blocking it, what's preventing it from running. Now he says you want to be careful with this because the docs for it say frequent calls to this function could have some impact on database performance because it needs exclusive access to the lockmanager shared state for a short time. So a lot of times how you may use this is query some rows from pg_stat_activity and call this function for every pid. So you want to be careful about doing something like that because you're going to hit this essentially observer effect. The fact that you're trying to track something will actually cause performance issues. So he did do some tests with regard to this, and he basically found negligible effect with something he was testing out. So for example, he noticed maybe a 10% impact on throughput, but still it's just something to be cautious and be aware of.
Next post in the Postgres AI marathon is “Lightweight Lock, LockManager and prepared statements”. So last week on scaling Postgres, we covered his 002 Marathon where he was talking about if you have queries with a lot of partitions or tables with a lot of indexes or a lot of joins, you're probably going to have to do slow path locking and that can really impact planning performance. Well, one way to get around that is to use prepared statements, because that essentially does planning once or a number of times as he shows here, to basically lock in an execution plan so you don't have to keep doing that planning. And he talks about that process of postgres running a custom plan five times and then switched to a generic plan on the sixth execution. But in his empirical analysis it was a little confusing what was going on and he finally discovered what was actually happening is that the first five executions it is doing a custom plan. So it builds a plan with params and the planner does all the different lockings the table plus the five indexes. The sixth execution it is using a generic plan type. This is where it actually builds the generic plan, but the planner still does its locking. So you still have the six locks table plus the five indexes. And then finally the seventh and higher executions are using the generic plan. It uses the cached generic plan and it only locks on execution and it only needs to lock the table during this stage. But if you want to learn more about that, definitely check out these blog posts.
Next piece of content “Sanitized SQL”. This is from ardentperf.com and he's posting about a desire that people have had to be able to sanitize SQL, particularly in logs, because it would be great to have more access to logs for different developers or engineers to diagnose issues. But the problem of that is potentially leaking protected personal information. And he finally set out some requirements and developed a PL PGSQL function to do it, and then ran through these whole sets of tests to be able to verify its output. So if you have a need to sanitize SQL, maybe you'd like to check out this blog post and the function he created.
Next piece of content “Prairie Postgres Birthday Meetup”, this is from hdombrovskaya.wordpress.com and what's interesting here is the presentation here embedded, talking about long queries and the art of full scan. So this is talking about techniques to use where you're not really using an index query because you're pulling back so much. But how can you efficiently pull back data, particularly when you're joining other tables and you basically have a lot of data to pull back. How can you do that efficiently? So you can check this out if you're interested in that.
Next piece of content “What's Our Vector, Victor? Building AI Apps with Postgres”, this is from pgedge.com and this is mostly focused on using the tools of PGVector and PGVectorize within Postgres to basically build rag like apps where you can actually do a query of an LLM. It then converts that into an embedding and compares that to embeddings you have already stored in your postgres database. And he says if you use a tool like PGVectorize, you can avoid a lot of application code to basically build this type of functionality and just have it run within postgres. Now it's not too hard to do it in languages as well build the RAG app. But he walks through and talks about if you want to use the vectorize extension for that purpose and he has a link to his actually recorded webinar of it here as well.
Next piece of content “PostGIS Performance: Indexing and EXPLAIN”, this is from crunchydata.com and I personally haven't analyzed any GIS information, but this post does talk about when you're asking a pretty typical question like how many populated places are there within each country? And he's using a particular data set that has this type of information and when. And he queries it took over two seconds to run, but he says, you know, you can analyze it and you can see where the high costs are. Basically this nested loop is where immense amount of the costs are coming from. But you can actually place an index on the geometry of pop places in this particular table and now when you run that query it returns in 200 milliseconds, so basically 10 times faster. So very quick short blog post on how to add indexes to geometry, specifically using a Gist index to do it to get you better performance.
Next piece of content keep calm “Keep Calm - TDE for PostgreSQL 18 Is on Its Way!”, this is from percona.community and basically this is transparent data encryption. Again, this is only available in the percona distribution for PostgreSQL. So basically they have to add patches to it. The intent is for this to eventually be an independent extension, but at this point it isn't. But they're still doing the tests on 18 and it will be ready essentially when it's ready.
And the last piece of content “The Fundamental Interconnectedness of All Things with Boriss Mejías”, and it looks like he works at OpenAI so he discusses postgres and its operation.
And now it's time for my consulting corner. So I had a client who recently upgraded their system. They didn't change the major version, they went a few point releases higher. Maybe it was six or eight point releases higher. I don't remember off the top of my head, but same major version, but the hardware was actually improved by two or three generations and the operating system version was similarly a three or four version jump. And as soon as the migration was done, everything seemed fine. The performance was much better than the old system, even though it had about the same number of CPUs. The performance seemed twice as good. But then suddenly ran into an issue where a performance problem caused by a order by limit query suddenly started causing huge performance issues. Now we quickly mitigated it using the technique I think I probably discussed in a previous scaling postgres, where we were able to actually just increase the amount of statistics collected on the columns in the where clause that was able to resolve it. The other way you could resolve it is append essentially a nop to whatever the order by query is. So if you're ordering by an integer, add a zero, if you're ordering it by text, add an empty space essentially if you're ordering it by timestamp, add 0 seconds or something like that. Basically make postgres think the order by is a function and then it won't use it to try and do a reverse scan to make that query efficient. But what's interesting is that this query with this performance problem was happening in the old system, never caused an issue, and it was actually happening in the new system for a number of days. But then something triggered a cascading series of events, probably related to locking and the amount of activity that caused a performance problem. And again, what is odd, it's the same major version as was running on the old hardware that never experienced a problem. So the only theory I have at this point is that perhaps, this slower environment, the slower hardware was actually keeping in check some type of potential race condition that could happen. Like different lucks happen in fast succession or some type of tipping point was encountered on the fast hardware, but the slow hardware actually protected it against this race conditions, it was harder for it to happen. At least that's my theory. Right now, I don't have another good answer for it. But if you have any theories about moving a database from a slower system to a faster system and actually running into locking contention issues suddenly where that wasn't happening before, I'd be interested to know.