background

Microsoft SQL Server Compatible, Time Series Performance, Feature Casualties, BLOB Cleanup | Scaling Postgres 143

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss Microsoft SQL server compatibility, the performance of time series DBs, DB feature casualties and how to clean up BLOBs.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about Microsoft SQL Servercompatibility, time series performance, feature casualties, and Blob cleanup. I'm KristenJameson, and this is scaling postgres episode 143.

[00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our firstpiece of content is AWS announces BabbleFish open source postgres with SQL Servercompatibility. So this is a post from Angres.com, and they're talking about an announcement thathappened at AWS Reinvent, where they announced BabbleFish for Aurora PostgreSQL. Andbasically what it does is it provides line compatibility with Microsoft SQL Server. So it interpretsthe different SQL features that have been added to Microsoft SQL Server as well as TSQL,which is what their store procedures are written in. And they have a little diagram they postedhere. That's from the Babelfish GitHub IO repository, where you can have an application with aPostgreSQL driver that goes through plpgsql to PostgreSQL to be able to speak to it. But youcan also do the same thing with a Microsoft SQL Server client driver that speaks TSQL overTDs. That's the line protocol for Microsoft SQL Server to BabbleFish to then be able to speak toPostgres. So if this works as advertised, essentially you could use PostgreSQL as a drop inreplacement for Microsoft's SQL Server without potentially having to change anything with yourapplication, and then over time, you could transition to Postgres. Now again, I believe this ismentioned for Aurora PostgreSQL, which is PostgreSQL compatible, but it says it's going to bereleased as open source under the Apache Two license on GitHub on 2021, and that it will becompatible with community based Postgres. So this is a great benefit for users who actuallywant to start using Postgres, but they're currently using Microsoft SQL Server. Now, this wasalso published on the AWS blog and it's called Want More PostgreSQL. You just might likeBabbleFish, but it pretty much goes into the same details about why they're doing it andchoosing to open source it. So if you're interested in this, definitely check out these two blogposts.

[00:02:27] The next piece of content is timescale DB versus Amazon timestream 6000 timeshigher inserts, five to 175 times faster queries and 150 to 220 times cheaper. So that is quite adisparity between timescale DB and Amazon timestream. Now, this is from Timescale.com,which of course produced timescale DB, so you have to take their results with a grain of salt andexamine them for yourselves. But if it's anything close to what they're purporting, for example,look at this 6000 times higher inserts. If you're considering a time series database, you woulddefinitely want to check out these types of performance differences yourself before you choose aparticular product to use. Then they show the different query speeds that they were getting outof it. So not as dramatic as the inserts, of course, except for this one high CPU query. I believethis is where they were breaking it out to multiple devices, doing the queries, and then theyfollow up with the cost differential between the two. So again, this is really, really significant. Butagain, if you were looking for a time series database, I would definitely closely examine this andrun your own trials of each to see which performs better and cost, et cetera. But this is a verylong blog post talking about the differences, the different offerings, and potentially trying torationalize why some of these differences exist. So if you're interested in time series databases,and particularly looking for a performant one, maybe you want to check out this post fromTimescale.com.

[00:04:02] The next piece of content is feature Casualties of Large Databases. So this is a postthat is not necessarily PostgreSQL specific, but it's talking about how databases get created forapplications they're worked on for a while and then inevitably they start getting larger and largerand then having issues. And what this is talking about is what areas tend to get ignored usuallyearly on that then kind of bite you later. The number one that they're talking about istransactions. So a lot of times people don't consider transactions, especially when they'resmaller, because there's a lot less activity and you don't need a lot of locking to avoid raceconditions or have inconsistent data. Whereas once you get larger, those transactions becom --e more important. Talking about referential integrity, so identifying your foreign keys to makesure that you don't have orphaned records in particular tables. Again, when you're smaller, notas important, but very, very important. Once you get larger, particularly if you have a lot ofindividuals interacting with the database, you definitely want that data security, that integritywithin the database as opposed to just in the application. They quote as nullable as far as theeye can see. So basically every column that's created is usually nullable. And that is, as hesays, the default in DDL is that it will create any columns for a new table as supporting nulls.And a lot of times you don't want that because it makes having to code your application a little bitmore complex because you have to handle null cases and then maybe blank cases. In theinstance where you have a text column suboptimal indexing, there's an interesting trend whereyou don't need that as many indexes when you're small. But once you start getting larger andlarger is when indexing becomes more important. And depending upon the different databaseyou use, you may have a hesitancy to add more indexes to it because of the potential lockingthat can occur and the size difference it takes for those indexes. So sometimes there's ahesitancy to add indexes. I haven't found this as much with postgres because you can justcreate indexes concurrently and generally they get created without issue. And if there is aproblem, it just is considered invalid. And then you can drop it concurrently if there happens tobe any issue but with other database systems, because that's what this blog post is for. Maybeyou have more issues. You have to deal with dangerous queries and restricted APIs. So thisindividual has indicated that some of the organizations they've worked for, the DBAs or thepeople in charge of the data purposefully restricted very simplified interactions with thedatabase. Like just do one single record inserts or one single record updates and don't delete alot of records at once. Now there's some benefit to that, but you're going to hamstring thedevelopers and cause more work for them if you don't allow them to update more than onerecord or delete more than one record. There's some other controls you can put in place withpostgres to say, limit statement timeouts or things of that nature to prevent one query boggingdown the system. And then he follows up the post with some ideas for scalability and how youcan make working with large databases easier. So I'll let you examine this and check out thispost if you want from Brander.org.

[00:07:15] The next piece of content is Blob cleanup in PostgreSQL. This is from CyberTechPostgresql.com and they're talking about Blobs, which are essentially binary large objects. Now,first he talks about generally with dealing with binary data as long as it's not too large. Youprobably want to just handle binary data within a field of a database table because that's mucheasier to work with. And that is the byte array or the byte A data type. So here you can store animage as a byte array and usually by default it's stored as hex. So you can go ahead and storethat data in postgres and then be able to retrieve it. But you have to keep in mind the maximumsize of that is 1GB per field. So if you have files larger than that, you actually want to store in thedatabase and not just a reference to the file. So for example, you could store a file name thatyou could reference in your application code to know where that file exists. But if you're literallywanting to store it in the database and it exceeds this 1GB per field, then you want to look intothe lead Blob interfaces. So this is an example of importing a large file. You're doing an Lo,which means large object underscore import and then the name of the file you want to importand what it returns is the object ID of that. Now if you notice that's there and then suddenly it'sessentially gone. It's not stored in any table, it's not saved anywhere, it just returns that OID andthen you need to remember it somehow. So what he says is that what most people do is createa table, say table file here that has an incremental ID, a name for that file as well as the objectID. So when you do an insert into that table for the OID field. You actually put that importstatement of the file name into there so that it actually stores that object ID. So now you have all--the references to all those blobs that exist within the database. Now that's okay, but if you dosomething like this and you delete that record, now you've lost what that Object ID is. You haveno idea anymore. Now, it still exists. So you can query the PG Large Object table to get thatlarge object ID, but you have no other reference to it within the database, and it's essentially juststuck there. And he says there's a couple of ways that you can handle it. You can do an largeobject unlink with it, and that removes it from the database. Now, he also mentions that there isa utility exists called Vacuum Large Object that can clean up dead large objects. So basically,there's no reference to this large object in any table. So you can use this to clean up deadobjects that exist in your database. Now, they also mentioned in addition to the large objectimport and unlink, there's a variety of other functions you can use to work with large objects. Soif you're interested in working with binary large objects in postgres, definitely check out this postfrom CyberTech. Postgresql.com next piece of content is an infographic. Five things to includeon your PostgreSQL security checklist. This is from Enterprisedb.com, and this was a great listthat breaks out things to check for security across securing access to it, say physical access,Firewall encryption access, authentication access, how to secure authentication, and thedifferent areas you need to check within postgres. Securing roles within the database, definingaccess controls to different data in the database as well as securing encryption. So if you wantto check out this security checklist for Postgres, check out this blog post.

[00:10:49] The next piece of content is running PG Bouncer on AWS Elastic Container Service.This is from Revenuecat.com, and basically they're talking about setting up PG Bouncer, andthere's not a lot of documentation on PG Bouncer. So I always like showcasing those posts thattalk about setting up and using it. They don't get a lot into the specifics. And they did run this ona container stack. So they were using a docker container that had PG Bouncer included. Theyalso, for analysis were using Prometheus. So they were using the PG Bouncer exporter thatexports data to Prometheus for monitoring of that system. And it goes through the process ofthem setting it up, making changes to their file limits, which frequently you have to do. Going intosome specifics about working with AWS in terms of whether they were setting up a network loadbalancer or service discovery to use it. And then they talk about the monitoring with Prometheus,and finally they follow up with the different PG Bouncer settings. The most important one isselecting a pooling mode, do you want to do session or transaction based as well as definingyour connection limits? So how many clients that can connect to it is the maximum clientconnection, and then the pool size defines how many server connections you want to maintain.There's a default pool size that you set as well as a minimum size it shouldn't drop below, so itjust always has some connections available. So if you're interested in setting up PG Bouncerand some of the specific environment considerations for AWS as well as their lasting containerservice, definitely check out this blog post.

[00:12:21] The next piece of content is how PG Backrest is addressing slow PostgreSQL wallarchiving using Asynchronous feature. This is from Procona.com. They're talking about PGbackrest. They're referencing a previous article they talked about. Slow wall archiving is that youwant to be very cautious of the archive command you use in postgres. Because if that is slow forany reason, you could get a wall archiving being backed up and essentially your data volumecontinuing to grow because it can't archive wall fast enough. Now, if it's a simple copycommand, that'll rarely happen unless you have problems with your disk systems and it's a slowdisk causing it. But if you make it more complicated than just a simple copy command, forexample, if you copy it to an external server, or in one of the worst case scenarios they'rementioning here, you're doing an S Three copy to S Three. It just takes a long time to transferthose types of files. You can get a backup. And using this query, when they were running a PGbench workload, they were seeing a delay of 278 wall segments waiting to be archived. One wayto potentially s --peed that up is to use compression. Using compression, they were able to get it down 166, butthat still existed. But another feature that PG Back rusts offers is Asynchronous wall archiving.So that means it doesn't have to wait until that file has completed the transfer to S Three. It justsends an immediate notification to postgres, okay, it's done. And then it spawns multipleprocesses to then send that work to those processes to do that file transfer. So there is a riskwith something breaking down in that process, and it's acknowledging postgres while it's stillfalling behind with the transfer. So you do have to make sure your configuration is correct andyou're not seeing some of these delays, but it does enable you to respond to postgres faster andthen archive those files to the ultimate destination in a more asynchronous manner. They talkabout the configurations you want to do, including the full configuration for PG Backrest they didhere, and they show a little bit how it works, where you have a process that does the immediatenotification to postgres, as well as spawning multiple processes to actually do the transfer to SThree. So if you're interested in setting up PG backcrest with this asynchronous wall archiving?Definitely check out this blog post.

[00:14:48] The next piece of content is replacing lines of code with two little reg x's in postgres.This is from Crunchydata.com and they're talking about they have a set of data and essentiallythey need to do some data cleanup. So the format that they want the data to be in is Cam andthen three numerical digits. But these were human entered, so they're going to have some error.Some people, if it was supposed to be eight, they just put eight. Or instead of being Cam 59,they just put Cam 59. This stands for cameras. So it's number of cameras or they also useduppercase. So basically you can use a regex to correct this data rather than writing some codeas they said here in Python or some other language to fix it. So here's the exact query he usedto correct this data. So he uses a regex replace for each column that he wants to update and hebasically looks for the last two digits and sets it up to put them at the end of this. And he explainsexactly how this regular expression works to correct the data. Now, he also does that for theinstance of the capital letters. So yes, you can do this with a regex and there should also be away to do it in the same update statement to correct the case as well. Maybe using a nestedone, or I would probably just use a lower function to do it, for example, maybe lowering the inputbefore doing the regex search. But this is a very brief post about how to use regular expressionsto correct data that you have received.

[00:16:25] The next piece of content is deploy PG pool two on Kubernetes to achieve query, loadbalancing and monitoring. And this is from Bping blogspot.com. So the focus of this article is onPG pool two and setting it up to do query and load balance monitoring. And for monitoring theyare using Prometheus in the Pgpool exporter to do that. So they're talking about setting up theprocess for working with Kubernetes and getting PG pool to work with it. So if you're interested,you can check out this blog post.

[00:16:56] The next piece of content is Evolution of Tree Data Structures for Indexing. Moreexciting than it sounds. So this is a very in depth examination of Btree indexes, the differenttypes and different considerations, how they work, and some more, so modern research intothem. So if you're interested in that type of content, you can check out this blog post. And this isfrom earthelion info.

[00:17:20] The next piece of content is PostGIS trajectory. Space plus time. This is fromRustprooflabs.com. They're talking about acquiring data and then not only doing a geographicalexamination, but also across the dimension of time. So you can see how this time has changed.And on the bottom here, tracking routes of different vehicles here and here. They even did avisualization of it. So if you're interested in this type of content, definitely check out this blogpost.

[00:17:50] Next piece of content is 2020 Pgasia Conference ended successfully at anunprecedented scale. This is from Haigo, CA. So if you want to learn more about the postgresconf in China and PGConf Asia, definitely check out this blog post.

[00:18:06] And the last piece of content is the PostgreSQL Person of the Week is S --tefan Schleichnich. So if you're interested in learning more about Stefan's contributions topostgres, definitely check out this blog post.

[00:18:19] That does it. For this episode of Scaling PostgreSQL, you can get links to all thecontent mentioned in the show notes. Be sure to head over to Scalingposgres.com, where youcan sign up to receive weekly notifications of each episode, or you can subscribe via YouTube oritunes. Thanks. --

episode_image