background

Fantastic Data Types, Domain Data Types, Insert Performance, Row Differences | Scaling Postgres 121

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss fantastic data types, custom domain data types, improving insert performance and how to query row differences.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about fantastic data types, domain datatypes, insert performance, and row differences. I'm Kristen Jameson and this is ScalingPostgres, episode 121 point.

[00:00:22] All right, I hope you, your family and coworkers us continue to do well in these times.Our first piece of content is Webinar fantastic data types and where to use them follow up. Thisis from Secondquadrant.com, and it's a webinar that was put on by Second quadrant and arecording is available here. You click on the link and you can register to attend it. What wasreally great about this presentation is that I've heard of using enums, but I've heard that therewere numerous disadvantages. But this presentation really put enums in perspective on howadvantageous they can be and ways to avoid some of the disadvantages and really just deletingthem if you ever need to is the biggest disadvantage. But otherwise there are a lot of advantageto using them compared to, say, a separate lookup table or to storing raw text in a text field thathas to be managed elsewhere. So for this part alone, I definitely suggest you check out thiswebinar to see their discussion on enums. Their coverage of array and JSON isn't as in depth,but for arrays. They say use it for consistent data type. So if you need to store a set of integersor a set of text, say tags, like user tags or something of that nature, an array is a good use casefor that. But don't store an array of multiple different data types that have been converted to textthat really doesn't make too much sense, and reach to JSON or ideally JSON b. When you havea variety of data you happen to need to store in JSON. But again, just be cautious of overusingthat particular data type, and they do also cover timestamps in their usage as well. So overall,really good presentation I suggest you check out the next piece of content is PostgreSQL usefulNew data Types. And they're actually talking about domains, which a domain is a concept thatyou essentially create a new data type that has a specific check constraint. So in this example,he wants to create a color code data type. It's made as a text field, but it has this checkconstraint that's added to it to validate the color codes. And then this domain just becomesessentially a data type. So he's creating a table here with the column C that has the type of colorcode, and when you insert a valid color code, it accepts it, and when it violates that constraint, itpresents an error message. Now of course, you don't have to create domains. You can putcheck constraints on existing tables and leaves the type as text, but this just generally makesworking with the database easier to do and to reason about when looking at different tables andhow the data is laid out. But he looked at some other use cases like maybe you want to have atext string that is only an alphanumeric. Well, this is one way you can use a domain to do it. Youcan also create a domain for URLs. Any presents this check constraint to be able to check that aparticular data type only accepts URLs. So if you're interested in setting up these types ofdomains in your database, maybe you want to check out this post from cybertechnpostgresql.com.

[00:03:30] The next piece of content is 13 Tips to Improve PostgreSQL Insert performance. Thisis from the Timescale.com blog. Now, Timescale is an extension of postgres that does timeseries data very well. So some of the recommendations are related to Timescale, but the vastmajority apply to vanilla PostgreSQL as well. So I'm just going to run through some of these. Thefirst one is use indexes in moderation. So the more indexes you have, the slower inserts happenbecause you have to update all those indexes. Number two reconsider foreign key constraints,because generally when you have a foreign key constraint, the other table must be checked orvalidated that that value exists before the insert can happen. So reconsider using those for fasterinsert performance. Three avoid unnecessary unique keys because again, that's a case where ithas to check, does this data already exist in the table before it does the insert. So there isanother validation step that if you don't need it, you could increase speed. Use separate disksfor wall and data. So depending on your performance, separating the wall from the data can giveyou better performance. Five use Performant disks, so don't use hard disks. Use SAT --A or NVMe drive or a high speed San. Now, starting from suggestion six on, they're talkingabout things related to Timescale DB. However, I find that nearly all of these are also applicableto postgres as well. Number six is use parallel writes. So if you're doing an insert or a copy, useparallel processes or a parallel process to insert that data, because these inserts or copies aredone as a single transaction in the single threaded fashion. So if you parallelize it, you can getbetter write performance, potentially. Now, looks like Timescale DB actually has a parallel copycommand, but you could do this yourself using your own language or techniques. With postgresseven insert rows and batches, it's always more advantageous to do a multi row insert versussingle row inserts many times. Number eight properly configure shared buffers always important.Nine run our docker images on Linux hosts. So this is mostly timescale DB related. It's basicallyrelated to overhead, related to virtual machines. So you need to take these same considerationson how you set up postgres. Ten write data in loose time order. Basically, they're saying youwant to try to keep as much in memory as possible. Now, Timescale DB does things related topartitioning. They don't necessarily call it partitioning, but it's time series data that's chunked upby a timestamp or something similar. And they're saying, insert all your data in the timing order,as opposed to say, if you're collecting metrics from servers, don't insert all the data from serverA, then all from B and then all from C. Do it. By time instead of inserting data so that indexesand things of that nature are kept in memory as much as possible with the more recent tables.And that's applicable to postgres as well.

[00:06:32] Eleven talks about avoid too large chunks and twelve avoid too many or too smallchunks. And this kind of goes to partition sizes. You don't want too little data in a partition ifyou're using postgres and you don't want partitions that are too large, you want them to be justright, essentially. And then 13 is watch row width. So anytime you want to maximize insertperformance again, fewer columns generally does better and you want to try to avoid a whole lotof data being inserted at one time. So a lot of great recommendations in terms of maximizingyour insert performance and go ahead and check out this post from TimescaleDB if you wantmore details.

[00:07:12] The next post is calculating differences between rows in SQL. This is frompostgresql.com and this is a post about window functions. So how do you compare one row tothe next row when you're doing a query? The first example that they're reviewing here is lag. Soyou want to say consult the previous year of a set of data you're looking at. You can use the lagwindow function and look back one row, but of course when you're doing this you want to orderit. So you do an over and order by the year so you can look at the previous information for theprevious year to compare it to the first year. And then he says what do you want to do? If youwant to look at the difference between this year and the last year, you can actually take thenumber from this year and minus that window function to give you what the difference is nowother than lag, he also talks about first value. So if you want to compare every year from the firstvalue of a given year, you can use the first value function. And here he's showing how you cancompare oil production compared to the first year that you have data available for. So you cancompare each row from the first year and then he also shows how you can partition the data inyour queries, say by country in this example. And here he's showing window functions that arebroken out by country. So you can look at Canada's output relative to Mexico's output. So ifyou're looking to learn more about window functions and learning how to use them, definitelycheck out this blog post from CyberTech postgresql.com. The next post is indexing JSON Bcolumns in PostgreSQL. This is from Sevloid Net. Now, the title is interesting because he talksabout an issue where he was indexing a JSON B field trying to track multiple different IDs for agiven document. So apparently different sources have different IDs for it. So he wanted to trackit all in one document as opposed to using a separate table. But he ran into a number of issues,particularly when he needed to query based on multiple IDs. --Now, I should say this is kind of how the post starts, but the post is a story that kind of weavesin a lot of different information so it's not strictly on indexing JSON B columns. But after helooked and investigated more, his ultimate conclusion was that he had a domain modeling failureand ideally he should set up a separate table, not using JSON B columns. So it's veryinteresting. The title of this post is JSON B index, whereas ultimately the solution was not to useJSON B columns. But again, the story and the process that he went through is a little bitinteresting if you wanted to learn more about that. But given he ran into the situation wheremultiple IDs could reference a single document, this seemed to be a more efficient solution.Now, we also cover some additional pitfalls where he ran into problems and he also referencesother people running into problems with the new JIT compiler that is enabled by default inpostgres twelve. So again some good information with regard to that. So again, this post is moreof a story about a path that he went on, the things he discovered with postgres. So if you'reinterested in learning more about that, check out this blog post.

[00:10:24] The next post is should you ever use synchronous replication in PostgreSQL? Nowthis is actually curious because I actually think the title of this presentation is how best to useSynchronous replication in PostgreSQL. Now it starts off again with from a story perspective of acustomer that was having issues, that ran into some data loss due to a disk issue and howbecause they had a setting of synchronous commit, their standby did not have all the data that itwas supposed to have and it was because of how synchronous commit was set up. And they goover the different settings that are possible for synchronous commit from off, remote write onlocal and remote apply and they describe how you can set this up to get better performance orbetter protection of your data. So essentially it's a continuum. If you want the best performanceyou could turn synchronous commit off, but you're potentially risking some data loss doing that,of course, because as mentioned here, quote commits are sent to the application as soon as thetransaction is handled by the back end process, but before the transaction is flushed to the wall.So essentially it's somewhere in the memory, but not committed to the disk on the primarydatabase server. So this doesn't even have to do with the replicas. Whereas the most safety youcan get, but the least performance because it has to wait is remote apply. Because commits aresent by the primary to the application only after the standbys defined in synchronous. Standbynames have confirmed that the transactions have been applied beyond the wall to the databaseitself. So this post talks about essentially the trade off. If you want more speed but are lessconcerned about safety, you could go the synchronous commit off route, but if preserving data isthe most important and performance not so much, then you could go with a synchronous commitwith a remote apply. But generally, if you want to learn more about synchronous replication andhow best to configure some of your settings, definitely check out this post fromEnterprisedb.com.

[00:12:25] The next post is Oracle do PostgreSQL binary objects. Now I read this title and I wasexpecting to look at a comparison between them, but actually most of this post talks aboutPostgreSQL in general and it's a really good weighing of pros and cons on where you shouldstore binary objects. So if you store binary objects in the database, he mentions threeadvantages they get backed up with appropriate transaction integrity, they get transported tophysical streaming replication, and they don't require separate connection stream to the caller.But then they list eleven different issues that are disadvantages with storing them in thedatabase, primarily related to their size and performance. So if they're small it's less of an issue.But I won't run through all of these right now, but definitely a great list of potential disadvantages.And then of course, they give you some recommendations on how best to store binary objectsgiving these pros and cons. So if you are storing binary objects in your PostgreSQL databaseand you're not sure if you're doing it right or if you want to use a different method, definitelycheck out this blog post, because it's not just about oracle, but in ge --neral. What are some best practices with regard to storage of binary objects and working withyour database? And this is from secondquadron.com.

[00:13:45] The next post is actually a YouTube channel and it's a look at the elephant's trunkPostgreSQL 13. This is from the Warsaw PostgreSQL Users Group YouTube channel and it's apresentation about an hour and 17 minutes in length going over the new features in PostgreSQL13th. And Magnus Hagander has done this a number of times with different versions ofPostgreSQL and this is his most recent version 13 version. So if you want to learn more aboutpostgres 13 and all the different features coming in it, definitely check out this video.

[00:14:18] The next post is Webinar PostgreSQL partitioning roadmap follow up. This is fromSecondquarter.com and they're going over the roadmap for partitioning updates andimprovements to PostgreSQL. Now, they covered some of the historical changes withPostgreSQL eleven and version twelve, and also what's coming in 13 and potentially somethoughts about 14. So again, they continue to push as much as they can to fill out theperformance, make management easier, and looking into the future. For Sharding, again, thisstill seems a fairly long way away, but again, it's something being considered. And then they alsotalk about their bi directional replication product for a small portion of the webinar. So if you'reinterested, you could just click this link here to get registered to view it.

[00:15:07] The next piece of content is how secure is your Postgres? Part One Access this isfrom Enterprisedb.com, and they're talking about different levels of access to the PostgreSQLinstance in terms of setting up security from physical access, how you can connect to it, whetheryou're using a Unix domain socket locally or using TCB IP connection. And then once you allowa TCB IP connection, what kind of firewalls are in place to prevent access from other areas, aswell as ensuring that the data is encrypted in transit by using transport encryption such as TLSto do that, and different ways of configuring these SSL parameters. So if you're interested inlearning more about secure access to PostgreSQL, definitely check out this blog post.

[00:15:56] And the last piece of content is the PostgreSQL Person of the Week is AndrewDunstan. So if you're interested in learning more about Andrew and his contributions toPostgreSQL, definitely check out this blog post.

[00:16:09] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can signup to receive weekly notifications of each episode, or you could subscribe via YouTube. Righttunes. Thanks. --

episode_image