background

Fast Random Rows, Fast Distinct Queries, Temp & Unlogged Tables, Terminating Connections | Scaling Postgres 164

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss selecting random rows quickly, faster distinct queries, how replication handles temp and unlogged tables and how to terminate connections.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Getting random rows faster. Very much faster". This is from Magnus Hagander's blog at hagander.net and he talks about ways you can get fast random rows from a table. One simple way to do it is to order by the function random and do a limit of one. That should get you what you need, a random row. However, this is probably the slowest way. He also mentioned that Jonathan Katz posted a different way to do it on Twitter that's faster but more complicated. But then he is presenting his own idea for being able to get fast random rows and that's using the TABLESAMPLE capability of Postgres. Now, normally this returns a portion of a table, such as he says, 10%. However, you can add an extension called a tsm_system_rows extension, that allows you to return a certain number of system rows.

It uses syntax that looks like this. Then he has some speed indications here whereas using his example, the order by random returns in about 2 seconds from 5 million begins returns in 1 second from Jonathan Katz's method. But the table sample method using this extension returns in 0.5 milliseconds. So super fast. Then looking at it with 10 million rows, essentially it doubles the time, or at least close to it for both of the first two examples, but it's still about that same speed using the table sample with this extension. Now, he says how does this magic work? Basically, it chooses a random disk block on the table. Now, as a consequence, if you're pulling more than one row, it won't be that random.

So for example, looking at the slow method here, when you're pulling random, you do get three random rows, whereas if you use the TABLESAMPLE and choose three rows, it chooses one random disk block, but then reads what's available there in that disk block. So you can get a random start, but you get contiguous rows from that point. But his point is the performance gains using this are so huge you could probably get around it and call it more than once so that you are sure to get a random row for however many you select. He has a CTE here that he developed that kind of gets you close to doing that. But I thought this was an interesting post and gives you a great way to pull out a random sample of rows from even huge tables.

The next piece of content- "How we made DISTINCT queries up to 8000x faster on PostgreSQL". This is from blog.timescale.com and they're talking about their extension, TimescaleDB, for Postgres and how they were looking to make improvements on distinct queries. Now their focus is on time series data, so they're probably looking at a number of devices that are returning readings over a period of time and that's the example they're using here. Now they talk about distinct queries and why they're slow because basically when it scans an index, Postgres needs to go through every item in that index to look for unique values. However, there is a concept in other databases called a skip scan. So basically it skips over redundant values.

So if you have 10 million data points from a device and you're pulling data from that table, it can go to each device as opposed to having to look through those 10 million entries in an index. Now, they did say that it looks like this skip scan capability may be making it into PostgreSQL 15, which is in the fall of 2022. But if this is something that is important for you now, it is available in the TimescaleDB extension and that's pretty much what this post talks about. Now the other thing to keep in mind is that you can use a RECURSIVE CTE to get a similar performance to this. It may not be as fast, of course, but essentially it is a CTE that looks for each item from the index from a distinct perspective and this is the CTE that will do it for you and be able to give you something equivalent to a skip scan. So if you want to make your DISTINCT queries fast, you could implement this solution as opposed to using this extension or waiting till Postgres 15.

So basically, as I mentioned, the skip scan basically skips over the repeating values, looking for the unique values in the index and then they have some benchmarks here that they developed. Of course, it's usually going to be faster the fewer devices you have or the lower the cardinality because there are fewer entries in that index it needs to search for versus having a lot of values. So this is a scenario where you have devices that are collecting data and they have a certain amount of rows in them and they tested five different queries looking for specific things. I'll mention it in a second. But you could see the ratio of improvement they're seeing is 26 times up to about over 8,000 times.

So the scenarios they looked at were what is the last reported time of each device in a paged list, what is the time and most recently reported set of values for each device in a paged list, and was the most recent point for all reporting devices in the last five minutes. Which device was reported at some time today but not within the last hour, and then what devices reported yesterday but not in the last 24 hours. So each of these examples gives you a performance rating them based on the number of devices. Now, the thing to keep in mind is that with this technique there are a few requirements. So number one, the query must use the DISTINCT keyword on a single column.

So it has to be a single column. That column must be first in the index, followed by any other order by columns. So, for example, looking at this query here, it's DISTINCT ON tags ID. That index needs to have a tags ID followed by time to give you the highest level of performance. The other thing to keep in mind is that the index needs to be a B-tree index and it of course needs to match the ORDER BY in your query. So even though this is mentioning a feature in TimescaleDB, they do give you some insight into what's coming with Postgres 15 in terms of this and also a RECURSIVE CTE you could use in lieu of this to get probably similar performance. So if you're interested in that, definitely check out this blog post.

The next piece of content- "To WAL or not to WAL? When unlogged becomes logged...". This is from fluca1978.github.io and he's talking about the difference between logged and unlogged tables and temporary tables and asking the question what gets passed over to a replica? So we have a primary database that is doing streaming replication to a replica. What shows there? Now, in terms of his testing, he created a normal table, an unlogged table, and a temporary table. He inserted a number of rows in there. You can see in terms of the insertion time that the normal table took 4.7 and the other two took about 1.7-1.8. Why? Because these are not logged to the WAL. The thing to keep in mind is the WAL is the technique that is used to transfer information to replicas. So it's how they keep in sync.

Essentially, it's streaming the WAL to the replica. Now, he set up replication and he wanted to look at what tables exist on the replica. So looking at the replica, you'll see that there is no temporary table. Now, temporary tables will only exist on the primary and they only exist for the session that you're in. So if you close out that session, that temporary table is then released. So nothing is included in the WAL. It gives you great performance, but nothing will be passed over to the replica, which makes sense. Similarly, in an unlogged table, nothing's written to the WAL, but it's interesting that it actually shows on the replica. Now, what it shows is the existence of the table, but it doesn't have any data.

So you can see when it does a select count on the replica from the unlocked table, it says "ERROR: cannot access temporary or unlogged relations during recovery". So you're not going to be able to access that unlogged table. You can see it. So the data definitions of that table are passed over to the replica, but not the actual data itself. So system tables are copied over, but not the data contents of unlogged tables. Then he goes through the process of actually switching a table from logged to unlogged and back again and seeing the performance. Basically, it takes quite a long time to do it similar to a new INSERT because it has to put all of this information in the WAL so the replica can capture it. So I thought this was a pretty interesting blog post. If you want to learn more, definitely check out this post.

The next piece of content- "TERMINATING DATABASE CONNECTIONS IN POSTGRESQL". This is from cybertec-postgresql.com. He's talking about the two different ways you can cancel queries. The first one is pg_cancel_backend and you give it a PID. Or pg_terminate_backend and give it a PID. So the _cancel cancels whatever query is running but not the connection and _terminate actually terminates the query and the whole connection. Now, the first thing to be aware of is what you want to kill. How do you get the PID of a running query? The answer to that is using pg_stat_activity. So they mentioned a few things you're going to want to do here. You're going to want to narrow it usually by the datname field, which is the database name.

So you want to pick the right database. You want to look in the query column to look for the query that you want to cancel. You also need to be aware of the state of that. So if you only want to kill something if it's actively running, usually if it's inactive, it's just the last query that that connection ran. The final thing to be aware of is that because you can have parallel queries, you want to look for the leader pit. So you want to cancel that pit as opposed to one of the parallel back ends. And then to do it you just do SELECT pg_cancel_backen and give it to PID to be canceled. Now _terminate works the same way, you can identify the PIDS the similar way and just run pg_terminate_backend to terminate the back end and he closes out the post talking about what if you want to kick out every single user who is not you?

Well, you can do SELECT pg_terminate_backend, pass it in the PID. But from that, you're actually reading from the pg_stat_activity table and you're showing where the PID is not your PID. So pg_backend PID gives your backend PID. So it's basically kill everything but yours and also ignore those connections that aren't using the database name because usually, these are Postgres processes like the background writer he mentions and other required processes that aren't dedicated to a database as well don't worry about the parallel children, just use the leader_pid. So this is an interesting blog post showing you how to cancel queries.

The next piece of content- "Creating PostgreSQL Test Data with SQL, PL/pgSQL, and Python". This is from tangramvision.com and they're talking all about generating test data. Now they talked about doing it in a docker container that they've provided so you can follow along. Basically, it goes through the post first using raw SQL and using functions such as generate_series and random to generate random data, including using some of the built-in dictionaries of Postgres to choose random words to generate test data. He has a full set of different SQL queries you can run to generate different types of data in the data set he's chosen. He says not only that, but you can also create functions within the database where you can just call those functions as opposed to running these scripts every time. Then the post moves on to talking about PL/pgSQL.

This is, of course, similar to the SQL commands, but now you have the capability of using conditionals, so, if else, you can also do looping within procedures, loop over rows from a query, or raise information. So he goes into producing a procedure for this purpose. Then he follows up by talking about using Python, and again this is using Python within Postgres. So you can create the Python three U extension, for example, to be able to write Python directly in Postgres to generate your test data. And then finally they follow up with the faker package which is available in Python. There's also a faker package in Ruby as well, so depending on your language you may be able to find something similar. So it's a pretty comprehensive blog post of how you can generate test data using a variety of techniques and postgres. So if you're interested in that, check out this blog post.

The next piece of content- "JSONB Multi-Column Type Casting". This is from momjian.us and he's talking about type casting JSON that you're returning from the database and a lot of times you would cast it this way for each value. So this value is text, this value is an integer. He said there's also a function called jsonb_to_record and you can give it a whole row of data and specify how it should be converted out. So for example, make A text and B integer and you can see that he's outputting the types here. The other benefit of it is that it then assigns this essentially as a column you can query, so you can say where B is less than equal to 4 or B is greater than 4. So perhaps this is a new way you want to try to do multi-column type casting.

The next piece of content- "Jsonpath: ** Operator and Lax Mode Aren't Meant to Be Together". This is from Alexander Korotkov's blog, and the ** operator, or the asterisks operator, as he mentions here, explores arbitrary depth finding the values that you're searching for essentially everywhere within that JSON path. Lax mode is kind of a relaxed way of working with the JSON and basically, you can get some unexpected results from this. So basically based on how these work, he's suggesting using the strict mode when using this operator and if you do it, you get the value expected. So, very simple post, but you can check it out if you want to learn more.

The next piece of content- "Using Kubernetes? Chances Are You Need a Database". This is from blog.crunchydata.com and they're basically covering how you can use Kubernetes to set up Postgres database systems. This post is pretty short, but it has a lot of links to different content that's covered on getting Postgres up and running in a Kubernetes environment. So if you're interested in that, you can check out this blog post.

The next piece of content- "(The Many) Spatial Indexes of PostGIS". This is from blog.crunchydata.com, and he's talking about spatial indexes. So if you have a certain number of points that you need to say, all the things inside this or all of the things near this other thing, you can use bounding boxes to find those particular points. How you generally do that is you use a spatial index to help with performance and he gives an example of creating a GIST index on a particular geometry. But there's more than just this index, there are other indexes that exist that you can use. So for example, based upon the spatial geometry, whether it's 2D, 3D, or 4D, as well as other than gist, there's also the little bit newer spgist as well as the BRIN or the block range index. He goes over and discusses some of the different benefits in performance and looks at how they compare in the different sizes. So if you're interested in this, you can definitely check out this blog post.

The next piece of content- "OPENSTREETMAP TO POSTGIS IS GETTING LIGHTER". This is from rustprooflabs.com and by lighter, they've dropped the memory requirements of OpenStreetMap. To him, what's even more beneficial is that you can now easily calculate how much memory you will need for doing your analysis. So he tested this out and it resulted in exactly what he expected dramatic savings in memory, showing the new low RAM middle compared to the legacy middle. So if you want to learn more about some of these space savings, definitely check out this blog post.

The next piece of content- "PUBLISHING MAPS WITH GEOSERVER AND POSTGIS". This is from cybertec-postgresql.com. If you want to know how to get Geoserver up and running and presenting data in a map, definitely check out this blog post.

The last piece of content. The PostgreSQL person of the week is Dave Page. So if you want to learn more about Dave and his contributions to Postgres, definitely check out this blog post.

episode_image