Data Loading Speed, View Dependencies, Users & Roles, H/A Clusters

In this episode of Scaling Postgres, we discuss data loading speeds, view dependencies, users & roles and high availability clusters.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. The first piece of content is "Data Loading Performance of Postgres and TimescaleDB". This is from the blog.coelho.net called Stuff About Things. This is basically a performance investigation looking at comparisons of insert performance between Postgres and TimescaleDB and mentions Simon Riggs at 2ndQuadrant prompted him to look at the performance of loading a lot of data between Postgres and TimescaleDB. I believe it's in reference to an article on the TimescaleDB website, where they see degraded insert performance over time. As you insert millions of rows into the database and approach essentially a billion rows, they show TimescaleDB does not degrade significantly, whereas PostgreSQL does. So this blog post actually investigates that independently, I guess you might say, to see if this can be seen.

It looks like they use two different instance types, one a high memory, r5, and one a CPU high performance one. They compared version 11 of PostgreSQL and the 12 development version as well as TimescaleDB 1.2.2 and 1.3, and looked at the performance of loading 4 billion rows in both standard and partition tables. And what they actually got on the high CPU one was 320,000 rows per second for Postgres and 225 rows per second for TimescaleDB. This is an ingest rate of about 100 gigabits per hour he indicates here. So I didn't really see this drop off as indicated in this. Now this may be a different version, maybe they're using version 10 of PostgreSQL. I don't quite remember where and I don't see where they're saying what version this is. But he goes through the test scenario and then shows how he did the configuration and then shows the performance here.

So you can see 1.2.2 of TimescaleDB is in around the average of 183,000 rows per second, whereas different variations of whether it's a standard table or partitioned into monthly or weekly, both the 12 development version and 11 version are all north of 200,000, approaching as high as 268,000 rows per second. Then looking at the high CPU capability instance here, it looks like he just used version 11 of PostgreSQL. Standard tables got up to 325,000 inserts per second, whereas the partitions were a bit lower in TimescaleDB around the 220,000 per second mark. So I thought it was very interesting that he did not see the same thing. Now again, the tests are different, the methodology may be different, and of course there's more than just ingestion rate if you're working with time series data. But if you're evaluating different database systems and wondering if you want to use something like TimescaleDB, maybe it makes sense to go ahead and do your own benchmarks with PostgreSQL versus other solutions you're evaluating to see if that gives you better performance or not and what the difference is. So definitely an interesting blog post to check out.

The next post is "TRACKING VIEW DEPENDENCIES IN POSTGRESQL". This is from cybertec-postgresql.com. So this is talking about when you have a table you've created and you create a view on it. If you try dropping the table, it's going to give you a dependency error because it cannot drop the table because other objects depend on it, the view. You can't actually alter a table and drop columns from it as well because you can't drop the column because the view depends on it. So if you have the scenario where they're mentioning here, you want to go from an integer to a big int, you actually have to remove the view, make the change, and then add the view back. They mentioned some very good best practices for views here and they mentioned views are good for two things.

One, they allow you to have a recurring SQL query or expression in place for easy reuse. Two, they can be used as an interface to extract from the actual table definition so that you can reorganize the tables without having to modify the interface. They say two patterns here that they see people sometimes use that kind of thing to cause problems if you have views within views within views. So it makes it really hard to understand what's going on or denormalize the whole database into a worldview and use that for all the queries. Sometimes you get things that just don't work as you expect when you're using different WHERE conditions. So basically just keep it simple with one view and don't try to nest your views in general. Then they go into how views are stored in PostgreSQL. It's not stored like a function they mentioned here where it stores the text, it actually stores a query parse tree that they mentioned here. So it actually does some work ahead of time at the create view stage.

So for example using the search path at the point that the view is created and the objects referred to are used by the object ID. So you can rename objects or columns used in a view definition just as long as you don't actually change what the data type is and things or try to drop it and things of that nature. And they talk about where the dependencies are stored and then they give you an example here to test queries on how you can actually look for dependencies within a view. So if you need to be able to make a change to a table, these queries will help you identify what views are dependent upon them so that you can then adjust those in order to make the changes you need. So if you want more information about views and their dependencies and how they work, this is definitely a great blog post to check out.

The next post is "Managing PostgreSQL users and roles''. This is from aws.amazon.com. Now, they have information related to running PostgreSQL on AWS, but as a general overview of roles and users and essentially using groups, which are essentially just roles, it has a pretty easy-to-understand methodology that I kind of liked. So first they make the clarifications between users, groups, and roles, and that roles are essentially all of them. Generally, roles that are users can basically log in. So for example, you can create a user with its password and it's the equivalent of creating a role as long as it's with login, so it can log in. So it's a role that can log in. That's the only difference between these two statements. That's essentially what a user is. Now, in terms of their recommendation, they also talk about the public schema and public role and things that you would probably want to revoke.

For example, REVOKE CREATE ON SCHEMA public FROM PUBLIC as well as REVOKE ALL ON DATABASE mydatabase FROM PUBLIC. That's a certain recommendation they have. Then you go into granting permissions from that point. They mentioned that when you're creating the role for accessing objects, you need to give roles to the database, then the schema, and then the individual objects. So in this scenario, when you want a read-only role, first you create your role called readonly. Then you grant connect on the database to this readonly role. So you grant the database, then you grant usage on the schema, whatever the schema name is, to the readonly role. At that point, you can grant SELECT on the individual tables to the readonly role. Now, if you don't want to do individual tables, they say you can do all tables.

And then if you want to make sure that this happens for future tables created in the schema, you need to alter the default privileges in the schema for this role. Then they go over the same process for doing a readwrite role to be able to create objects if you want, as well as INSERT, UPDATE, DELETE, and SELECT on the tables. And then once you've done that creating users, you simply create the user and then grant it to a particular role or revoke it from a particular role. So I kind of like this methodology that's laid out here. And if you want to update how you're using users and roles, definitely a blog post to check out.

The next post is "How We Built a Scalable On-Premises PostgreSQL Cluster". This is from medium.com on the Enigma blog. They're talking about how they needed to create a PostgreSQL cluster on, I believe, a customer's on-premises site. So they wanted to be able to build it up and have it be highly available. They used these tools from the application, which goes through HAProxy and then they wanted to have a leader or primary database with two replicas and they wanted it to be highly available. So they're using Patroni and Consul as a means of leader election. So they go through all the different components of it and how they set it up. Now, they don't have every single command on how they set it up, but they just discussed how they went through and set this up for their use case.

So if you're interested in developing a high-availability solution, maybe this blog post is of interest. I'll also reference that they did mention at the bottom here, that going from a single PostgreSQL server to a cluster is not very straightforward, and that's definitely true. They've chosen particular tools, but one that may be kind of easy was I covered in a previous episode where there's this extension called pg_auto_failover, and this seems like a pretty simple solution. Now, it's not as robust as the previous solution, but it does give a relatively simple solution for doing some Ha capabilities. So again, if you're considering that, maybe look back at this post that was done in May 2019.

The next post is "SQL concepts: from A to Z". This is from helenanderson.co.nz. So this was a pretty interesting blog post that basically goes from A to Z and they choose particular things for A to Z. And this isn't PostgreSQL-specific, but it's about the SQL language in choosing particular topics to cover for each of the first letters of the alphabet. So if you want to check out a pretty interesting blog post and kind of make sure you know what each of these areas are, A to Z, definitely one to check out.

The next post is "Postgres Optimisation Tips part 1: EXPLAIN & ANALYZE". This is from medium.com on the carwow Product, Design & Engineering blog. Now this is what I would call a beginner-level blog post. So if you already use EXPLAIN and ANALYZE, I don't think you'll get too much out of this. But if you're still learning query optimization, it's a good example. And they talk through a query that they're using. In this example, they're using Ruby, it looks like in Active Records. So Ruby on Rails and how they needed to optimize a query. They used EXPLAIN ANALYZE to kind of see what the query was doing and that it was spending a great deal of time doing a sequential scan and actually looking at and rewriting the query, not necessarily adding a new index. They found a way to get the same information they wanted just by rewriting the query, and it dramatically improved the execution time. So if you want a little bit more about how you could potentially use EXPLAIN and ANALYZE, definitely what post to check out.

Next post is "How to Deploy and Manage PostgreSQL on OpenShift using the ROBIN Operator". This is from blog.openshift.com. Now, this is outside of my area of expertise because I've never used OpenShift, but if you're looking to deploy it on OpenShift, it goes through all the processes to do it. So if this is of interest to you, definitely a piece of content to check out.

The last post is "Waiting for PostGIS 3: ST_Transform() and Proj6". This is from crunchydata.com. Again, not an area of expertise of mine PostGIS, but again, PostGIS 3 is coming and they continue to produce blog posts about it. So if this feature is of interest to you, definitely one to check out.

Data Loading Speed, View Dependencies, Users & Roles, H/A Clusters | Scaling Postgres 81

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript