background

Use Bigint, High Availability, Autocommit, Synchronized Sequential Scans | Scaling Postgres 169

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss only using bigint, high availability, Postgres autocommit and synchronized sequential scans.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "USE BIGINT IN POSTGRES". This is from rustprooflabs.com, and basically, this is a blog post describing how, typically, he used INTs, but then around two to three years ago converted to only using bigints. This blog post actually goes through a bunch of tests to see what is the actual performance difference between INTs and bigints. Basically, the conclusion I came to looking through this is maybe about a 6-7% difference based upon some of the things he's tested. Now, of course, the big question is why default to using bigints because it's about twice the size. So depending on your data, you could be storing more and then the performance will be different because you'll have a bigger data type to work with and more data in there.

But he says the main reason to go with a bigint is that you avoid having to convert an INT to a bigint, which is a pretty big inconvenience to have to deal with, particularly if you're talking about a primary key. So with primary keys today, if you have any thought that the database is going to get any kind of usage, I would definitely go bigints. Even if it wasn't, I would still probably consider bigints because with fast SSDs, as he's saying here, sufficient RAM and CPU, the performance difference should be relatively negligible. Now, it looks like he was testing this on an eight-core CPU with 16 gigs of RAM on Ubuntu 20.04 and Postgres 13 and used pgbench to test some different performances. Now, some of the selects were a little bit slower comparing INT to bigints, and then he tests some indexes, and, frequently, he says the test results are negligible, but some of the differences are 6-7%.

I probably wouldn't notice that kind of difference either. So I think this has big benefits if you're using it for primary keys because there's no risk of overrunning the bigint compared to an INT. But he said there may be some reasons you don't want to use bigint. Like if you know there are only certain values being stored within a field like you'll only ever have ten different integers from a foreign key to something, then it doesn't make sense to make it a bigint. He also suggests maybe you have low-power devices or an embedded use case, or maybe your network bandwidth is limited. But overall, I definitely agree with the recommendation to choose bigints over INTs. If you want to learn more about its testing procedures, go ahead and check out this blog post.

The next piece of content- "PostgreSQL HA with Patroni: Your Turn to Test Failure Scenarios". This is from percona.com, and they did a presentation at Percona Live Online looking at testing failover scenarios using HA with Patroni in Postgres. This blog post goes through the entire setup of doing it. So they set up a three-node etcd cluster, configured a software watchdog, installed PostgreSQL, and got Patroni up and running to manage the Postgres instances. Then set up HAProxy to be able to direct connections either to a write port, which would be the primary, or a replica port, which would be the reads. Then tested workloads using a specific tool that they developed in Python to do their testing.

Then they did a bunch of experiments testing these different failure scenarios, such as losing network connections from either the replica or the primary, unplugging the network cable from both, simulating a power outage, or simulating a segfault by killing the Postmaster process in Postgres. Or by killing Patroni by implementing CPU saturation, as well as just doing a manual switchover to see how the cluster recovered and the approximate time to do that. So if you're interested in setting up your own high-availability scenario, maybe you want to check out this post as well as the accompanying talk.

The next piece of content- "DISABLING AUTOCOMMIT IN POSTGRESQL CAN DAMAGE YOUR HEALTH". This is from cybertec-postgresql.com, and they're talking about autocommit. That is the process that Postgres wraps every statement you send to the database in a transaction. Say you're at a psql prompt and you send a statement, it's automatically wrapping that statement within a transaction. Now, you can manually define the transaction boundaries by using BEGIN and then conclude with a rollback or commit to actually commit that transaction or roll it back. But if you're just doing a quick query or a quick INSERT statement, that is automatically wrapped in a transaction that autocommits it. Now, that's a little bit different from Oracle, where you actually have to explicitly give the commit or rollback statements. Now, there's nothing in the database that allows you to, at the server level, turn off this autocommit. Apparently, it was done back in 2002, but they quickly removed that feature in 2003 due to the issues.

But it is still the case that you can do it on the client. So with the client that you're using, you can actually turn that autocommit off. He mentions various places you can do it. Like in psql, you can run \set AUTOCOMMIT off. In the JDBC driver, there's an autocommit setting you can change or psycopg2 or pgAdmin 4 or other tools. You can turn off this autocommit, but you want to be very careful with doing this, particularly with how PostgreSQL handles its MVCC, its MultiVersion Concurrency Control Implementation. Now, the number one problem with this is that you're causing sessions to be idle in transaction and long database sessions that are in the state of idle in transaction can cause a number of problems with Postgres. Now the first problem is you have locking in the database, particularly for DDL commands. So if you just do a single SELECT, you're actually going to have an ACCESS SHARE lock on the table. Now that could cause problems with other DDL statements that want to happen. The other issue is not being able to do a vacuum.

And he says here, quote "...autovacuum cannot delete any row versions that are younger than the start of your transaction...". So you could be creating a big bloat problem for yourself if you have a lot of these long-running idle transactions that are preventing autovacuum from completing. So you definitely want to try to avoid this if you can. Now there are some protections you can add on the server side to, as he says here, defend against AUTOCOMMIT off. The first one is to set an idle and transaction session timeout to a value greater than zero. So this will automatically cancel those transactions that have been there for too long. The other parameter you can change is the old snapshot threshold to something greater than negative one and that will, as he says here, allow vacuum to clean up dead tuples that have been dead longer than that time. But of course, the issue with both of these is they do send an error to a client if either one of these scenarios is encountered. So this was a great post talking about autocommit, and if you're interested in learning more, definitely check out this blog post.

The next piece of content- "DATA WAREHOUSING: MAKING USE OF SYNCHRONIZED SEQ SCANS". This is from cybertec-postgresql.com. They're talking about a feature that was a little new to me, where Postgres has the ability to synchronize concurrent sequential scans that are happening against a table. Now he's first talking about, say, you have a database environment and you have, say, one to ten users, and your database system has a particular throughput it can handle. Well, if one user is doing a sequential scan on a table that may take 16.6 minutes to run, well, if you add a second user, it's going to double that time. If you have four users, it's going to double again. Basically, your performance keeps dropping the more users you add to the system.

Now this happens assuming that each query to the system is entirely independent and there are no dependencies between them. Basically, each of these ten users has to do a full sequential scan of each table. But there's a feature in Postgres called synchronized sequential scans. Basically, let's say you have user 1 who has started a sequential scan on a ten-terabyte table. Then at this point here, when it's about halfway done, a second user comes in asking for a sequential scan of the same table. Well, it doesn't have to start at the beginning, it actually joins up and starts using the same IO request to collect its information.

So it continues on till the end, and then it starts at the beginning to finish up its IO request to return data to the user. The same thing in user 3. If it gets here around 75% complete, and then user 3 starts a sequential scan, it can piggyback on this existing IO request to start filling out what it needs to send to user 3. And then once this IO request is done, both users 2 and 3 can start another sequential IO request to fill out the information that they need. There's actually a parameter for this called synchronized sequential scans, and by default, it is on. So you can turn it off if you need to. But definitely an interesting feature. If you want to learn more about it, definitely check out this post.

The next piece of content- "Better Range Types in Postgres 14: Turning 100 Lines of SQL Into 3". This is from blog.crunchydata.com, and they're talking about a new feature that allows multirange types. So more than just having a range type, you can have multiple ranges within a type. So for example, the data once stored looks something like this, where you have a range from, looks like, June 8th to June 10, and then June 13th to June 16th. You can store that within the same data type. Then you can do queries against it, like does it contain, using this operator, certain dates that you're looking for to see if they exist or not. He shows how you can do that here.

So the same type of range types that exist in a normal range, you can also do a multirange. So they have INTs and bigints numerical types, as well as timestamps with time zones, dates, and times. Then they do a real implementation where they need to check a set of non-overlapping types to see if range exists in there. Something that took a great many lines of SQL before now you can do in basically three, which is here. This is basically, quote "finding what dates I'm available for an employment within a given month". So of course this makes it easier to understand, and maintain and actually uses less storage. So if you're interested in learning more about this feature, definitely check out this blog post.

The next piece of content is actually a YouTube video and it's "FETCHING LARGE AMOUNTS OF DATA". This is from the CyberTech YouTube channel, and what they're actually talking about is cursors. So let's say you have a ten-terabyte table and you select all the data from that table without a WHERE clause and pull the data down, you could potentially crash your client. Now, normally you would never do this, you would want to use a WHERE clause for better performance. But another scenario you can use is cursors. So a cursor allows you to fetch one or more rows from the database from a particular point. So you can think of it as a cursor on a screen. You can move it to a particular location in the data set based upon a query and then fetch one row, ten rows, or thousand rows, and then work with it locally on your client and that's much less data to pull down. So if you want to learn how to work with cursors in Postgres, this is a great video to check out.

The next piece of content- "PostgreSQL as a Microservice". This is from tapoueh.org. The author was on a podcast recently and they were talking about the concept of Postgres as a microservice and well, I tend to just think of Postgres as a service and they were thinking about it as a storage service. But he says, you know, it's really much more than that. It's not just a very basic source of data here and then be able to retrieve it. It also does it with high concurrency, meaning many, many different users can be accessing pulling data down, inserting it, deleting it, updating it, and providing, depending on your configuration, a consistent view as well of the different data in the database due to its implementation. It's really a great concurrency microservice in his opinion. He goes into a little bit about the isolation and locking available in Postgres that you would not want to build yourself. So if you're interested in that, you can check out this blog post.

The next piece of content- "Ignore nulls in Postgres". This is from patternmatchers.wordpress.com and he's talking about a problem where Postgres does not support the IGNORE NULLS clause, which Oracle does, for example. So he imagines the scenario where you have this table here and you want to generate this column here that shows the last value that's NOT NULL. So, for example, A is NULL, and A is NULL. Okay, it's two. So now this is going to be set at two. The next row is two, two because these are NULLS. Then when it changes again, it's okay, now it's a six, and now it's a seven. So what is the last value that's NOT NULL? Now, to do it in Oracle, it's relatively simple. You just do this window function where you're ignoring NULLS, but with Postgres, it's a little bit more difficult. Now there is this implementation you can do here using a subquery essentially, but there's an actual better way to do it in that you can create your own aggregates.

So aggregates are like some average, I typically just use the built-in ones, but you can create your own. So to do this scenario, he actually created his own coalesce function first to be able to do his implementation and then he created his custom aggregate and it's called find_last_ignore_nulls and you pass in an element. So now with the implementation in Postgres, you can just do this, select this from the table, and then find_last_ignore_nulls passing in the column that you are basing it on and you can do it as a window function and you get the same results. So I found this very interesting and he looks at the performance considerations and it looks to be about twice as fast as the subquery implementation, so definitely interesting. So if you want to learn more, definitely check out this blog post.

The next piece of content- "Postgres 14 highlight- Memory dumps". This is from paquier.xyz and he's talking about new features to be able to look at what's going on with memory and Postgres. Now, it's actually not a dump of something that has crashed, but it's something that you can look in real-time into a session as to what is going on. So there's a new view called pg_backend_memory_contexts that within a given session you can look at the memory context by using a query such as this one, and it outputs the information for your individual session. It also has a feature where you can request the memory context for a specific process ID and it will actually log it to the Postgres logs. It doesn't show it on the screen, but it does log it to the Postgres logs. Basically, this is a superuser-only function. Now, I'm not quite sure why it only logs it to the logs and not to the screen, but it's another feature that's been added to Postgres 14.

The next piece of content- "PostgreSQL on Linux: Counting Committed Memory". This is from blog.crunchydata.com and this talks about Postgres's use of memory, particularly the comparing commit limit to commit to an as, and shows how those vary when the databases started and stopped, and gives insight into how you can look into this different memory. So if you're interested in that, you can check out this blog post.

The next piece of content- "Grafana Dashboards for pgSCV". This is from lesovsky.medium.com and pgSCV is a new metrics exporter for Prometheus that works with Postgres, PgBouncer, and others. Well, this is using Grafana dashboards to be able to use those, but I found that this particular post is interesting with what metrics he wanted to monitor with Postgres and which ones he felt were important, such as looking at the RED metrics requests, errors, and durations. Also looking at activity logs, statements, lock and wait events, replication, WAL, vacuum maintenance, background writes, disk space usage, tables, WAL archiving, system resources usage, as well as areas for PgBouncer. So if you want to learn about what metrics he feels are important, it's definitely a great blog post to do that.

The next piece of content is "Time Zone Abbreviations". This is from momjian.us. So this is a very short post that actually shows two views I didn't know existed. One is pg_timezone_names which outprints the different time zone names within Postgres. As well as pg_timezone_abbrevs, which gives you the abbreviations of the different time zones in Postgres. So if you want to check those out and learn more about it, you can check out this blog post.

The next piece of content is actually a YouTube channel. This is the Percona YouTube channel and they've posted over the last two weeks a great many videos from Procona Live, and a number of these, of course, have PostgreSQL content. So if you want some more video content, definitely check out this YouTube channel.

The last piece of content. The PostgreSQL person of the week is Tom Kincaid. So if you want to learn more about Tom and his contributions to Postgres, definitely check out this blog post that does it.

episode_image