OrioleDB More Free? | Scaling Postgres 383

In this episode of Scaling Postgres, we discuss free availability of an OrioleDB patent, pgEdge going open source, pg_duckdb hitting 1.0 and methods resolve to slow order by limit queries.

Content Discussed

YouTube Video

Podcast Audio

Transcript

I keep hoping that Oral DB will eventually become a storage engine option in Postgres where you can just install an extension and be able to utilize it maybe for only update heavy workloads. Well, with some news this week, maybe we're a bit closer to that. Also be sure to stay till the end of my consulting corner where I talk about a query optimization that seems more and more necessary for later versions of Postgres as opposed to earlier ones. But I hope you, your friends, family and co-workers continue to do well.

Our first piece of content is “OrioleDB Patent: now freely available to the Postgres community”, this is from supabase.com and I was not aware of this, but orioledb or at least some components of it has a patent. And this is Supabase which acquired Oral DB and he says we have now wrapped up all legal activities and we fully owned us patent with the number here. Now what they're doing is they're making an available non exclusive license to use this to all Oriole DB users, including proprietary forks in accordance with the OrioleDB license, which I think it's Apache too, I'm not quite sure. And they say what is Orioledb? Basically it's a new storage system for postgres that's designed to take advantage of modern hardware and cloud infrastructure. But the most important point for me is having an undo log as opposed to storing all the row versions in the heap itself. So this should make update heavy workloads perform much much better. And they say from benchmarks they've seen OrioleDB is 5.5 times faster than the heap looking at a TPCC with 500 warehouse benchmark. Now they do say the OrioleDB license is based on the Postgres license. Although I do have a question. Why did they make it available as a non exclusive license? Couldn't they have just abandoned the patent? Now what they do say “we believe the right long term home for Orel DB is inside Postgres itself”. So this is great news and maybe this patent is some reason why some of the upstream changes haven't been done to Postgres to accommodate it. I'm not quite sure but check this out if you want to learn more.

Next piece of content “pgEdge goes Open Source”, this is from pgedge.com and pgEdge is a distributed database. It basically does master master replication and they're focused on the enterprise primarily I believe and they did have a source available license for their product, but it was not open source. Well that's changed and they said all the core components of PGEdge distributed Postgres along with any other PGED repositories that previously used PGS community license, have been relicensed under the permissive PostgreSQL license. So basically, if you want to try a multi master distributed Postgres, you can go check it out now.

Next piece of content “Announcing Pg_duckdb Version 1.0”, this is from motherduck.com and duckdb is basically a vectorized analytical database engine and they put it right inside a PostgreSQL process. So that's what the pg_duckdb extension does. Now it doesn't transform Postgres into a data warehouse as they're saying here, but it offers users a path to have some sped up analytical queries. And here they're looking at some performance benchmarks with TPCH. And they did the analysis two ways. One, with PostgreSQL indexes created. So there are a fair number of indexes to satisfy queries and PostgreSQL with only primary keys. Now with all indexes, the speed up for queries was maybe about four times faster, which they say is not astounding, but still four times faster. Something but against the PostgreSQL engine with only primary keys, pg_duckdb is much faster. Like a query that would timeout after 10 minutes now completes in 10 seconds with pg_duckdb. So it looks like pg_duckdb is really good against non indexed data you want to retrieve. So the scenario would be maybe you have a bunch of read replicas and you can just use pg_duckdb on them to answer all sorts of queries without having to creating an index specific to each one to give good performance. It just gives overall elevated performance without having to add additional indexes. So basically your primary would have all the indexes necessary for your OLTP workload and then your replicas could have pg_duckdb that would answer analytical queries. Now it's not going to be fast as DuckDB with its compression and column store, this is still reading against the row store. So there are limits on how fast it can go. And they show an example up here of how you can use the DuckDB engine to execute a query or from within Postgres even read an external parquet file. So this is a column oriented data store because with pg_duckdb now you can read S3, Google Cloud, Azure in all sorts of different file mats, a CSV file, JSON parquet, iceberg delta and it can read those get that data and even combine it with active tables in your database. So they show a join here comparing customers to event data that exists in S3 and then they talk about Mother Duck and what kind of services they offer to increase the performance of your analytics. So you can check that out if you're interested. So definitely some great announcements of things happening this week.

Next Piece of content There was another episode of Postgres FM last week. This one was on “When not to use Postgres”, and the first thing they mentioned is basically analytics because the ROW store is a hindrance, although with the pg_duckdb that mitigates it a little bit. But if you want the best analytical performance you're going to want to use a column store. And maybe some solutions they mentioned are of course DuckDB or maybe you want to use Clickhouse and even Timescale with its hybrid OLTP and analytical capabilities is an option. The next thing they mentioned on what not to use Postgres for is embedded solutions. Then you can get Postgres pretty small, but both Nick and Michael just basically said they'd probably go with SQLite for something, although there is also the pglite as well, so that could potentially be used in embedded circumstances. They also mentioned storing a bunch of blobs or binary large objects in Postgres isn't a great use case. For example huge videos or audio files, you'll definitely want to store them elsewhere. Maybe you would store a reference to them in the database. They actually mentioned vectors as well. There is the PGvector extension, but Nick was mentioning he's hitting up some limitations in his work at Postgres AI with the size of HNSW indexes where some of the build times and latencies start to really impact the performance. And he says these types of indexes can work well for millions of vectors, but not necessarily billions of vectors. And you're probably going to have to look for a more optimized solution for that use case. They did mention turbopuffer, which I had not heard of, but he does have the link down here that basically is a proprietary scale out way to handle vectors. And he also mentioned Amazon S3 just recently introduced S3 vectors, which is a way to store your vectors and index them and answer questions within S3, so that was something he was going to look at. They also mentioned Q like workloads. Maybe you don't want to use postgres, but they admitted they'd still probably use it as long as you optimize it appropriately. I think engines like orioleDB that could be better for Qlike workloads potentially. They also mentioned if you're at the limits of extreme OLTP and and you're using something like Vitess with MySQL until some additional sharding solutions like PgDog or the Postgres version of Vitess Multigres become more mature, you should probably stick with those solutions. And they also mentioned some time series as well, but you know, with time scale or Tiger data now that would seem to be a viable solution to use as opposed to community based postgres. But if you want to learn more you can listen to the episode or watch the YouTube video down here.

Next piece of content Speaking of timescale “Introducing Direct Compress: Up to 40x Faster, Leaner Data Ingestion for Developers (Tech Preview)”, this is from tigerdata.com and this is enhancement to their platform where they are now directly compressing the data before it even gets into postgres. And this particular version is in 2.21 for copy operations. So as the data is copied it compresses it during ingestion in memory to give this much better performance. And if you like charts, look at this chart. The no compression is just the small blue line down here, whereas the direct compression at 1000 or a 10,000 batch size is incredibly performant once you ramp up the number of threads you're using. So if you want to learn more about this definitely check out this blog post.

Next piece of content “FOSSY 2025 and RAGtime with Postgres”, this is from vyruss.org and I've always liked Jimmy Angelako's presentations and he recently gave one at Fosse 2025 about retrieval augmented generation with postgres so you can check it out here.

And the last piece of content “Can Collations Be Used Over Case Insensitive text?”, this is from cybertech-postgresql.com and he decided to look at both using a case insensitive ICU collations as well as using the citex extension to handle searches without regard to case. The first thing he checked out was the equals operator and when it's using an index the performance was relatively the same. However when not using an index the Citex was four times slower compared to a case insensitive collation. Then he checked less than greater than operators as well as the equal variance and again still saw better performance from the custom collation compared to sitex. Then he checked out the like operator but had an issue because non deterministic collations are not supported for like. However they are in Postgres 18, so we check those out and again that performance exceeded the citex extension as well. The problem is you can't create an index on the non deterministic collations if you want to try to speed up like searches. So that is one area where the site text column has an advantage. You can put an index on it, but you know that may change in the future. But if you want to learn more, definitely check out this blog post.

Now it's time for my consulting corner for the last two or so versions of postgres. I've noticed if you have an order by limit and an index on the column you're ordering by, it tends to choose that index over a more selective index that may exist on a where clause that you're using. And unfortunately this can really cause some performance problems and a lot of times shows up as performance regressions. So maybe you're using version 13, version 14 and you upgrade to say 16 or 17. Now suddenly the query planner changes and starts trying to use the column in the order by that's indexed and no longer uses that more selective where clause and the performance just tanks. Now I know one way to get around it is to actually make the column you're doing the order by a function. So if it's a timestamp you can say, add an interval of 0 seconds or something. Or if it's an integer you can add a zero. Or if it's text you can add an empty string, something that doesn't change the actual data. But still, postgres looks at it as a functional index now, therefore it won't use it in query planning, but that seems to be a bit of a hack. I think you could probably try to figure out some other indexes you could add to mitigate it. But I'm curious, have you seen these types of performance regressions in more recent versions of Postgres, and what have you done to mitigate them or work around them? Let me know in the comments.

OrioleDB More Free? | Scaling Postgres 383

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript