background

Build A GPT In SQL | Scaling Postgres 299

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how you can build a GPT in 500 lines of SQL code, how to optimize extension loading, the best way to set passwords and being able to change generated columns.

Content Discussed

YouTube Video

Podcast Audio

Transcript

You know, frequently, I cover blog posts on Scaling Postgres that cover using SQL for this type of programming task or this other type of task. Think of the advent of code series where someone from Crunchy Data was doing all the tasks using SQL. But we're going to be talking about a GPT that was written with 500 lines of SQL code. But before we get into that, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Happy New Year: GPT in 500 lines of SQL". This is from explainextended.com and it's funny he said he asked Chat GPT if it can implement a large language model in SQL. But Chat GPT can't actually do that.

It's not suitable for implementing large language models, et cetera, et cetera. To which he wrote, "It just proves that if you want something done right, you have to do it yourself". Or another way to say it in today's parlance is here, hold my beer. So even though this is not a performance-based post, as much as AI is coming to the fore, I really felt this was the strongest post this week and wanted to highlight it on Scaling Postgres because there's a lot of information about AI and large language models that I haven't seen described in this particular way. He really goes from the bare bones and starts building up to what a GPT is, basically a generative pre-trained transformer, and understanding about how you build one. Because now he's going to build it in SQL.

Like for example, here's just some code that has functions about doing these particular steps to build this and we're actually basically building a GPT2 he says. But just to forewarn you, your brain may start to explode as you get deeper into it and get into the different function equations and matrices and math and things of that nature. But I still endorse that this is a very good read to try and put more pieces together of how AI works. Now, the actual resulting source code is at the bottom here where you can expand the source and this is all the 500 lines of SQL code to get it working. As an endorsement this is a great post talking about this, four different commenters definitely gave him kudos for writing this. So if you're interested at all in learning more about AI or even some SQL, I definitely encourage checking out this blog post.

Next piece of content- "OPTIMIZING POSTGRESQL FUNCTIONS WITH PRELOADED LIBRARIES". This is from cybertec-postgresql.com. He's talking about a situation where he's using some PostGIS code, and on the first invocation, it takes about ten milliseconds, but on a subsequent run of it, it took 0.6 milliseconds, so less than a millisecond. The reason is that when a database connection is first established, this extension, PostGIS, is not available yet. It has to be loaded the first time it's used inside of a connection. Now of course this has implications when you do a restart, because these libraries aren't loaded yet, and all the connections have to be established to try and load up the different extensions as well.

But you could even see some of these performance issues if you're not using a pooler like pgBouncer, for example. For every new connection initiation, that first query is going to be slow. So the way around it is to actually load libraries early, and there's actually a setting you can change. They say it's a fairly old feature in the Postgres session_preload_library, so you can define the library that you want to preload before starting a session. Now, it's still a little bit slow at 2.8 milliseconds, but definitely faster than without using this. So something to consider if you're seeing performance problems potentially related to this.

Next piece of content- "Postgres password encryption without leaking credentials". This is from launchbylunch.com. He's talking about different ways that you can set a password for a user in Postgres. If you just create a user, or alter a user and set the password to a string, you're basically sending the password in the clear over the network if you're connecting across the network to the database. So he looks at some solutions to that. Now, first off is Postgres encryption. So he says up until version 10, there were only MD5 options available, and by default, when you create a password, it will go ahead and encrypt it, if that's the default encryption method. Or for Postgres versions after 10, you could use SCRAM-SHA-256. Basically, everyone should be using the SCRAM-SHA-256 as opposed to MD5 now.

So basically if you create a user and set the password to this literal string, the database will take it and encrypt it to whatever the default encryption is, MD5, SCRAM-SHA-256. But you can also send passwords that are already encrypted. So you can send this MD5 hash as the literal string, and it will store that in the password database of Postgres. Or you can even send a SCRAM-SHA-256 version as a literal string and it will just store that in there to be able to do the password comparisons. So basically, how you would do this is you would need to encrypt it before creating the user or altering the user and changing the password. Now you could use external tools to do this, but they're also a part of libraries like libpq or the Java library has methods to do it as well. He also mentions a node library. So by pre-hashing the password, you get to avoid sending a password in cleartext over the network or even potentially log somewhere. So this is the ideal way to set passwords in Postgres. But check out this blog post if you want to learn more.

Next piece of content- "Waiting for PostgreSQL 17 - ALTER TABLE command to change generation expression". This is from depesz.com and he's talking about generated columns and apparently, in version 17, you can now update them. So you can alter the table to change what expression it uses. So in this case he created the column by extracting the year from a timestamp, but then he altered the table and set the expression as this new one, basically making a two-digit year, and it worked just fine. The disadvantage of this is it has to do a whole table rewrite, which is definitely not ideal, but at least you can do it now as of 17. The thing he really didn't like about this is now suddenly when you alter a table, you have SET EXPRESSION as expression, but then you also have SET DEFAULT expression. So it just starts to get confusing the different terms here in the help file, but a good addition to have in the upcoming Postgres 17.

Next piece of content- "Postgres Postmaster File Explained". This is from crunchydata.com and the postmaster.pid file gets created whenever Postgres is started in the data directory, and it gives some important information that he lists out here. So the first line is the PID or the process ID of the parent Postgres process. So it's just an integer value here that corresponds to the process running on the system. The second line is the data directory for Postgres. The third is the epoch value of the Postgres start time. The fourth line is the port Postgres is listening on. The fifth line is the uni socket directory. The 6th line is the TCP IP interfaces being listened on. So for example, asterisks just mean all of them. Line seven is the shared memory key and ID, and lastly, the current status, which is ready, and the rest of the blog post just goes into detail about some of those different pieces of information. So if you want to learn more, definitely check out this blog post.

There was another episode of Postgres FM last week. This one was on "pgBadger", and in this episode, Michael was actually joined by Alicja, who is the program manager for the Azure Database for Postgres Gorilla Microsoft. They talked all about pgBadger, which is, I guess, primarily a log analyzer, although it does a lot more than that. But it's primarily a tool to help you analyze your log files. So you can point it to files and it helps you analyze them. Hate to say it, but I haven't really used pgBadger yet because I just tend to analyze the logs on the system themselves. Rarely do people actually send me logs, but if you're running your own systems, maybe you would like to check out this tool and listen to the episode here. Or check out the YouTube video down here.

Next piece of content- "Understanding PostgreSQL Aggregation and Hyperfunctions' Design". This is from timescale.com and the first part of this post is definitely Postgres related. The second part is more relevant for Timescale, but still, I really enjoyed how it explained things in this blog post. So he talks about aggregates versus functions. Basically functions transform an input into an output, generally operating one row at a time. Aggregates do an operation across many rows. He has a lot of great visual aids explaining how the process of aggregation works, and even some animations here as you can see how he's explaining how an average works. 

Basically, add up the total value of the column you want to average and then the count of those rows to give you the average. Then he also talks about how this works in parallel as well. You would have two different processes go through the averaging calculations and then combine them as the final function to give the result. Now from the Timescale perspective, he talks about them using a two-step aggregation and their hyperfunction. So basically when you do an average, they basically have a transition function and then the outer call does the calculation. So I thought this blog post was pretty interesting explaining how Postgres operationally does what it does, and also covering timescale at the end here as well.

Next piece of content- "Introducing pg_query for Postgres 16- Parsing SQL/JSON, Windows support, PL/PpgQL parse mode & more". This is from pganalyze.com. They released another version of their open-source library called pg_query which basically takes SQL queries and turns them into a syntax tree. They actually are using some of the Postgres source code as part of this library to be able to extract this information. I didn't know it, but this has apparently been around for ten years, so that's quite a long time. So this of course talks about the library and all the enhancements added to the most recent version. So check this blog post out if you want to learn more.

Next piece of content- "How to speed up COUNT(DISTINCT)". This is from eversql.com. He talks about many different ways to speed up COUNT in general, but also COUNT(DISTINCT). So this is quite a long blog post, but they were saying if you can handle some inaccuracies, meaning you're okay with an estimate, you could do things like HyberLogLog if you want something more accurate. If you could use a covering index for the particular column that you want to run DISTINCT on, that can give you some better performance as well as some additional techniques. So check out this blog post if you want to learn more.

The last piece of content is "Introducing a new community wiki page 'Operations cheat sheet'". This is from postgres-road.blogspot.com and he includes a link to the Postgres wiki where there's now an operations cheat sheet. So this gives a whole lot of information on how to manage Postgres from an operational perspective. So definitely check out this resource if you want to learn more.

episode_image