background

Hello Babelfish, Planner Deconstruction, Exist & Not Exist, Fun With SQL | Scaling Postgres 190

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the open sourcing of Babelfish, deconstructing the Postgres planner, when to avoid exist & not exist and having fun with SQL.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Announcing Open Source Babelfish for PostgreSQL: An Accelerator for SQL Server Migration". This is from babelfishpg.org. This is a follow-up announcement where AWS has developed Babelfish, which is a way for Postgres to basically talk like a Microsoft SQL Server. So they've now open-sourced the code and they've created this site to explain more about it. So basically, it is line-compatible with Microsoft SQL Server. So it supports the SQL syntax of Microsoft SQL Server. It supports the T-SQL language. So those all are all the different constructs like store procedures, variables, exceptions, and things of that nature, as well as the TDS which is SQL Server's Wire Protocol.

So it basically makes Postgres Server act like a Microsoft SQL Server. Now to get started with it, it says you have to add a patch against PostgreSQL, so they have the patch here on GitHub, as well as install different extensions. One for the T-SQL language, one for the TDS Wire Protocol, another one for new data types found in T-SQL, as well as a specific one for money support. Now there is also a tool I think called Compass that also lets you do an audit of your Microsoft SQL Server to confirm that you can indeed migrate to using Babelfish on PostgreSQL. But this is a great addition for people who are wanting to potentially migrate to PostgreSQL and you're working with Microsoft SQL Server, so definitely encourage you to check this out.

There's also another post by AWS itself that is called "Goodbye, Microsoft SQL Server. Hello Babelfish". Now this post talks about how to get this up and working on Aurora because that's the only service that AWS provides where Babelfish is working right now on Aurora. They don't offer it for their RDS version of Postgres, which is a little bit ironic because the first page of the Post talks about how "Many of our customers are telling us they want to move away from proprietary database vendors to avoid expensive costs and burdensome licensing terms". The irony is that you're moving to Aurora, which is a proprietary database. Now it's Postgres compatible, but it's definitely open-source and you can only get it in one place, which is AWS. So a little bit of an irony with that, but I assume eventually they're going to make this work with their PostgreSQL hosting options as well. But this walks through how to create an Aurora instance and to get it up and working to be able to talk as a Microsoft SQL Server. So if you're interested in that, you can check out these two blog posts.

Next piece of content,- "How we deconstructed the Postgres planner to find indexing opportunities". This is from pganalyze.com and this is a very interesting post where for their service, they actually went through the PostgreSQL open-source code and extracted the planner because they want to get to the point where they can do automatic index recommendations with their tool. They also open-sourced, I think, a part of that tool where you can give it a schema and give some index suggestions. So basically, they utilized a tool called libclang, which extracted the source code for Postgres.

For the planner, it was almost a half million lines of Postgres source, and they basically got it working as an independent library and built it into their separate tool. So it basically runs the planner utilizing information about the schema and some statistics as well, which we'll talk about to give index recommendations. So they have an example of a relatively complex query here. They ran it against a Postgres server, and it came out with a particular plan. And then they ran it against their pg_plan function that they developed.

The plan estimation was within 1% of the actual production estimate. So that seems pretty close. And basically, they want to use this in order to do index recommendations. The other thing that they're adding to this, which we noted was missing before, is a measure of selectivity by utilizing the pg_statistics table. So with that information plus information about the schema, they should be able to get more accurate index recommendations. So this is definitely something they're actively working on and improving, and it seems like they're making some pretty good improvements to it. So if you want to learn more about this, definitely check out this blog post.

The next piece of content- "Three Cases Against IF NOT EXISTS/IF EXISTS in Postgres DDL". This is from postgres.ai. When you create a table, you can say create it if it doesn't exist. Now he's saying where not to use this is in migrations, and migration is basically an application framework sending to the database some means to alter the schema. If you want to create a new table, you want to add a new index. Well, typically those are done through migrations that send commands to the database to move it to the next version of the schema as far as the application framework is concerned. Usually, you can roll those back if there's a problem. The problem with using IF EXISTS in this type of migration is that now you're in a relatively unknown state because you don't know if the table existed or not when you ran it.

So basically he's saying this is a no-no. I definitely agree. Personally, in my application, I've never used IF EXISTS or IF NOT EXISTS when creating tables. Because again, you want to know if there's a failure, and if there's a failure, figure out why. If you need to roll something back, there's always the rollback functionality in the migration tools. Typically the next one he says not to do is DROP TABLE IF EXISTS. So again, this can cause definite problems if you're doing it as part of an application framework migration toolset. Now I use CREATE A TABLE IF IT DOES NOT EXIST all the time, or DROP TABLE IF EXISTS. If I'm doing tests on temporary tables like this is not for an application framework migration.

But if I'm working essentially outside of that framework and I'm creating temporary tables to maybe do some other DBA work, I use DROP TABLE IF EXISTS or CREATE A TABLE IF IT DOES NOT EXIST frequently. But typically those are for temporary tables that only exist for a span of time for particular work I'm doing. The other area he talks about is CREATE INDEX CONCURRENTLY IF NOT EXISTS. Now, this is a big no-no because you could potentially try creating an index multiple times because if there's an error when creating an index, it leaves it in an invalid state there. So you could be creating multiple invalid indexes if you leave this. So, it's definitely something you wouldn't want to do, but definitely some great recommendations. And if you want to learn more about it, you can definitely check out this blog post.

The next piece of content- "Fun with SQL in Postgres: Finding Revenue Accrued Per Day". This is from blog.crunchydata.com and they had a previous post where they developed a query to show the monthly recurring revenue for a subscription service. This one they wanted to look at revenue accrued per day. So as payments came in, increasing the running total revenue. This blog post works through how to do that and the query changes that are necessary and explains how to do it. So if you want to learn more about working with SQL, you can definitely check out this blog post.

The next piece of content- "Exploring PL/pgSQL part two: implementing a Forth-like interpreter". So this is from eatonfield.com and this post is a follow-up to a previous one where he takes it to the next level in terms of working with PL/pgSQL and developing an interpreter. So if you want to learn more about that, you can check out this blog post.

The next piece of content- "The PostgreSQL Timeline Concept". This is from highgo.ca and he's talking about when working with WAL files, there's the concept of the timeline. There's a timeline ID that's built into the WAL file. Whenever you do a point-in-time recovery restore or whenever you promote a replica, it starts a new timeline and it discusses that. It discusses the history file that's relevant to tracking the timelines with regard to Postgres. So if you want to learn a little bit more about how Postgres works in terms of the timeline ID, definitely check out this blog post.

The next piece of content- "Patroni & etcd in High Availability Environments". This is from blog.crunchydata.com and they're talking about one way to do High Availability is working with the Patroni etcd tools and working together, they help you create a cluster of PostgreSQL servers that enable the system to detect errors and promote certain replicas. Now, this talks about particular failures that can happen with this kind of setup because typically you have a Patroni service running on each replica as well as an etcd service. One of the things I talk about because of the heartbeat process that it constantly checks to make sure that the Postgres database is up and running, you're going to want to ensure that your network has super High Availability and essentially no downtime.

Otherwise, you're going to start getting failures in the High Availability and you'll get failovers when you aren't expecting them. The other issue is that disk rights to etcd or any kind of disk issues can also impact that as well. So they go into detail about things you need to watch out for, such as under-resourced file systems, IO, convention, networking delays, et cetera. Also, how to diagnose the system in terms of looking at different logs to find the cause of potential High Availability issues. So if you want to learn more, you can check out this blog post.

The next piece of content- "Can't Resize Your Postgres Kubernetes Volume? No Problem!". This is from blog.crunchydata.com and they're talking about how to do volume resizes with regard to Kubernetes and how this works when running Postgres, particularly with their Postgres operator. So if you want to learn more, you can check out this blog post.

The next piece of content- "Vehicle routing optimization with Amazon Aurora PostgreSQL-Compatible Edition". This is from aws.amazon.com and they're talking about essentially the traveling salesman problem, where you have a particular salesman or vehicle that needs to go to certain destinations and what is the most efficient route to do that. Now this uses Aurora, but because this uses PostGIS and pgRouting as well as OpenStreetMaps, you can use Postgres for this. So anywhere where they're using Aurora, you should be able to use PostgreSQL as well. So if you want to learn more about how to do that, you can check out this post.

The next piece of content, the PostgreSQL person of the week is Tatsuro Yamada. If you want to learn more about Tatsuro and his contributions to Postgres, definitely check out this blog post.

The last piece of content, we had another episode of The Rubber Duck Dev Show this past Wednesday. We discussed how much time you should spend planning versus doing in terms of your coding work. So if you're interested in that content, you can check out this episode.

episode_image