SQL Data Analysis, pgBouncer Setup, Understanding Deadlocks, Tuning & Optimization

In this episode of Scaling Postgres, we discuss doing data analysis with SQL, sophisticated pgBouncer set ups, understanding how to avoid deadlocks and an intro to tuning and optimization.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Practical SQL for Data Analysis". This is from hakibenita.com and it has a subtitle "What you can do without Pandas". So this is a tour de force blog post. It's almost like a little ebook. It is so long and they even have the table of contents here for it. Basically, they're talking about Pandas, which are used in data science, to my understanding, with Python to do data analysis. Basically, this post talks about how you can do SQL right from a database, particularly Postgres, to answer a lot of the questions that you can do with Pandas. Now, this post is so long I'm going to primarily just look at the table of contents and speak briefly about it.

But even if you don't read this full post, I'm going to be saving it as a reference to be able to look up different techniques for doing data analysis. Now, the first part here where he's talking about SQL versus Panda performance, he's talking about how when you do analysis with Pandas, you have to pull all the data in from some data source and process it there, whereas it's usually more efficient to do some of that data extraction from within SQL. So for example, talks about only pulling the columns you need, only pulling the rows you need, doing grouping on the database itself as opposed to within Pandas or within an application, so it's applicable to an application as well.

So he talks a little bit about that and also about memory performance because when you're pulling a whole copy of the data down, it's much more efficient again to just pull what you need. Now, he goes into some different tools that Postgres or other database systems offer, such as common table expressions to be able to iteratively build a query. Talking about how you can generate data. You can select random data, you could do sampling with table samplings. Then he goes into different things you can do like rollups, cubes, and grouping sets and gets into window functions, how you can do conditional expressions within a select. Then he also goes into things such as linear regressions, interpolations, and binning and how to do that type of analysis just using SQL. So this is a very comprehensive post. I highly suggest you check it out.

The next piece of content- "pgBouncer Setup, Extended Architecture, Use-cases and leveraging SO_REUSEPORT". This is from enterprisedb.com and similar to the first post, this is another tour de force blog post covering a lot of different items with regard to PgBouncer. Now, they do reference this post that was done late last year that we also covered in a previous episode of Scaling Postgres about spinning up multiple instances of PgBouncer using the same port. Now, the reason this is of interest is because PgBouncer is single-threaded. If you want to run this on the server and have a dedicated PgBouncer machine, well, nowadays you probably have two CPU cores or four or eight. So if all you're running is PgBouncer and you're only running one copy of it, you're not utilizing all of the machine's resources and you're going to want to run multiple instances of it.

Basically, if you're able to use this capability to reuse the same port, then you can kind of do a load balancing between them because connections come into that port and it could use one of the different services that are servicing that port. So they talk about this in a previous post, but this one goes into more depth and how you can set up some different ways to do it and even install it from Source. Because they mentioned that there were some edge cases that the previous post maybe didn't cover. So they talked about doing it from Source. Then even if you can't do it from Source, they presented another technique here where you can essentially do the same thing, reuse the same PgBouncer port. Then set up multiple PgBouncers on an instance that can leverage more than just one CPU core. They go through the whole process of setting that up.

And that's just the first half of the post. So the second half of the post is about innovative ways that you can use this. So reuse port one is that it helps you scale your PgBouncers so you'll have multiple PgBouncers running to help you do scaling. It can also do load balancing because it will send connections off to a particular PgBouncer service running. You can also do read, write, or read-only routing by setting up particular databases in PgBouncer. They discuss that here and they give you an example of the setup that is configured to do just that. Then they talk about areas where you are running PgBouncer in what they call a hostile environment, which means you could potentially get a lot of connections hitting it and also ways to hit off denial of service attacks. So they covered that here as well as finishing up with an architecture for setting up PgBouncer for high resiliency. So, another example of a really comprehensive post talking about PgBouncer. This is another post I highly suggest you check out.

The next piece of content- "POSTGRESQL: UNDERSTANDING DEADLOCKS". This is from cybertec-postgresql.com and the first thing they cover is how a deadlock happens. Basically, you have two transactions that are happening at the same time and you say you have two rows in a table. One row starts an update so it's locking this row. Another transaction starts an update and locks the second row. Then this transaction tries locking the second row and it says it's now waiting for this lock to release. It continues to wait. Now this transaction tries to update row one. So they're trying to update each other's lock and essentially that starts with our column here, the deadlock timeout. So by default, this is 1 second and then if it doesn't resolve within that 1 second, one of these transactions is chosen to win. So this one succeeds in doing the update and this one gets an error that it was a deadlock and basically, it's going to stop that transaction.

This one gets committed, this one does not. Then they talk about how to fix and avoid deadlocks. Basically, I agree with him here that there is no magic configuration parameter. The most important thing to do when doing this is ordering your transactions. So for example, imagine that you have 100 transactions that are happening concurrently which in large installations can definitely happen. You don't want to have a case where this transaction has a pool of rows that needs to update. This transaction has a pool of rows that need to update, but they're in different orders. So then you have the case.

This one starts updating and it updates row one and then it also needs to update row two. It's happening in one order, whereas this other one is updating row two and then it updates row one. What you want to have happen is the order to be consistent across your transactions. So when it chooses to do the updates, you want to update it, say by order of the ID. So you'll never get the case where they will need to be updated out of order. So even if these are updating the same transactions, as long as they do it in the same order, you will never get a deadlock. So that's the most important thing to know about deadlocks is that the order is important. Now you may get delays while this transaction waits for another to complete, but as long as things happen in the same order, you won't get a deadlock. But if you want to learn more about that, you can definitely check out this post.

The next piece of content- "An Introduction to PostgreSQL Performance Tuning and Optimization". This is from enterprisedb.com and this is yet another tour de force blog post that again has its own table of contents because of all the content it covers. So they focus on what to do with the bare metal configuration, how you can tune the operating system, all the different configuration parameters in Postgres itself, and then how to analyze different queries and workloads. Again, a great reference post to keep around when you have questions about configuration in regards to tuning.

The next piece of content- "Implementing Incremental View Maintenance for PostgreSQL (Part I)". This is from yugonagata-pgsql.blogspot.com and he's talking about a feature that they've started working on. So nothing's final yet, but the feature they are working on is something they call IVM or Incremental View Maintenance and it's specifically with regard to materialized views. So a view is just a virtual table, a materialized view. It's no longer virtual but it makes it into essentially its own physical table and it has the physical rows that it references. But the thing about that is that it is a cached view and periodically you will need to refresh it from the base tables and they're talking about the refresh materialized view plan.

Now if you have a lot of data behind this view, refreshing it could take a very long time. But what they're looking into is a way to do Incremental View Maintenance where the materialized view stays updated based upon changes in the base tables and it does this without you having to manually set up triggers or anything like that. Apparently, it may be using triggers under the covers that we can't see, but basically, when you do an update of the base table, the materialized view gets updated if it's defined as being incremental. Now it may be a while before this hits Postgres, but it's definitely a great feature that they're working on. If you want to learn more about it, you can check out this post.

The next piece of content- "Choice of Table Column Types and Order When Migrating to PostgreSQL". This is from blog.crunchydata.com. They're talking about the data types you should choose when migrating to Postgres and basically, when you're choosing to migrate numerical types, they basically have recommendations around what type you should convert it into. The first types you should generally use are INTs and bigints. So if it's just a number, if it's just an integer, go ahead and just use the Int or the bigint. If you have something that's a decimal, then you want to use the numeric or the decimal type. But there are also floats and reels available if you don't need something that's more accurate.

So they discuss that through the post and then they also go into a little bit of alignment and how if you want to conserve space, the best way to store data in the table is to first store your largest fixed types first, followed by your smallest fixed-size types. So for example, bigints go first and then INTs for second. That's because of the alignment and this is related to memory, you could actually have wasted space if you order things differently. But first, you want to have your large fixed types, then your smaller fixed types, and then your variable types thereafter, say text strings and things of that nature. So if you want to learn more about that and how that works, you can definitely check out this post.

The next piece of content- "CHAR(1) to Boolean transformation while migrating to PostgreSQL". This is from migops.com. They're talking about when you're converting from Oracle to Postgres. Apparently, Oracle doesn't really have a binary data type and how you should do that conversion when working with Postgres. They also talk about ways you can set up check constraints in order to filter out rows that are coming in from Oracle. So this post goes into how you can transform from a CHAR(1) into a Boolean. So if you have an interest in converting data from Oracle to Postgres, definitely check out this post on a technique to do that.

The next piece of content- "Dramatical Effect of LSE Instructions for PostgreSQL on Graviton2 Instances". This is from akorotkov.github.io. He's talking about the Graviton2 processors that are on EC2 instances. Just so you know, they're using the development branch of PostgreSQL 14. So this is not a live branch. This is something that's in active development, and looking at different instructions that have been enabled in each of these branches, and then looking at the transaction performance across the number of clients. As you can see exactly, it says here when you have LSE instructions enabled for the Graviton processors in the development branch of Postgres 14, you get dramatically consistent transactions per second across high numbers of clients when using the LSE and not so much when using the base development branch. So definitely interesting results for the Graviton2 processors. I guess as we move forward, more people may start using these types of ARM processors as opposed to just Intel or AMD. So if you're interested in that, definitely check out this blog post.

The next piece of content- "Getting Started with PostgreSQL Set Operators". This is from arctype.com, and they're talking about different ways of doing unions. So combining two different queries so you can do a UNION, you can do a UNION ALL, which doesn't remove duplicates. You could do intersections of two different tables, you can do exceptions, or use the EXCEPT keyword to accept different clauses as well as ORDER the resulting result set. So if you want a simple post on how to do set operations with Postgres, definitely check out this blog post.

The next piece of content- "Pgpool-II Logging and Debugging". This is from b-peng.blogspot.com. They're talking about doing logging and debugging in Pgpool II. So if you're interested in configuring that, you can check out this blog post.

The next piece of content- "TLS: DEMYSTIFYING COMMUNICATION ENCRYPTION IN POSTGRESQL". This is from cybertec-postgresql.com. Now, this post just covers TLS. There's really not too much of a mention to Postgres, but if you want a background on TLS and how it does its encryption, this is definitely a post you can check out.

The last piece of content, the PostgreSQL person of the week is Corey Huinker. So if you're interested in learning more about Corey and his contributions to Postgres, definitely check out this blog post.

SQL Data Analysis, pgBouncer Setup, Understanding Deadlocks, Tuning & Optimization | Scaling Postgres 163

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript