background

Fast PostgreSQL Data Loading Using Ruby

Join Over 1,000 Engineers & Get New Episodes Weekly!

Learn how to quickly load data into PostgreSQL with various techniques using Ruby. We discuss the best ways to load terabytes of data.

Load One Row Per Insert Statement with Parameters

Load data one row at a time using parameterized queries of the pg gem.

#!/usr/bin/env ruby
require 'pg'
require 'faker'
require 'date'
require 'csv'

def time_rand rt, from = 0.0, to = Time.now
  Time.at(from + rt.rand * (to.to_f - from.to_f))
end

conn = PG.connect(
  dbname: "test",
  port: 5432,
  user: "creston",
  password: ENV["POSTGRES_DEV_PASSWORD"]
)
dir = "/tmp/"
counts = {}

starting_at = Time.now
puts "#{starting_at}: Starting"

table_name = "posts"
counts[table_name] = 100000
i = 1
rt = Random.new
puts "#{Time.now}: Starting #{table_name}"
while i <= counts[table_name]
  conn.exec_params(
    "insert into posts (id, title, content, published_at, type)
    values ($1, $2, $3, $4, $5);",
    [
      i,
      Faker::Book.title,
      Faker::Lorem.paragraphs(3).join(","),
      time_rand(rt, Time.local(2010, 1, 1)),
      Faker::Book.genre
    ])
  i += 1
  puts "#{Time.now}: #{i} :: #{i/(Time.now.to_i - starting_at.to_i)}" if i % 1000 == 0
end
puts "#{Time.now}: Finished #{table_name}"

Load One Row Per Insert Statement without Parameterization

Load one row of data at a time without using parameterized queries. Only show the changes from the script above.

  conn.exec "
    insert into posts (id, title, content, published_at, type)
    values (#{i}, '#{Faker::Book.title.gsub("'","''")}', '#{Faker::Lorem.paragraphs(3).join(",")}', '#{time_rand(rt, Time.local(2010, 1, 1))}', '#{Faker::Book.genre}');"

Load Multiple Rows Per Insert Statement

Load ten rows of data at a time with a nested loop. Only show the changes from the original script above. In the video, we tested up to 500 rows of data per insert statement.

i = 0
rt = Random.new
puts "#{Time.now}: Starting #{table_name}"
while i <= counts[table_name]
  sql = "insert into posts (id, title, content, published_at, type) values"
  for j in i..(i+9)
    sql += "(#{j}, '#{Faker::Book.title.gsub("'","''")}', '#{Faker::Lorem.paragraphs(3).join(",")}', '#{time_rand(rt, Time.local(2010, 1, 1))}', '#{Faker::Book.genre}'),"
  end
  conn.exec sql.chomp(",")
  i += 10
  puts "#{Time.now}: #{i} :: #{i/(Time.now.to_i - starting_at.to_i)}" if i % 1000 == 0
end

Generate CSV and COPY into PostgreSQL

Instead of loading data directly, generate a CSV file using Ruby. Only show the changes from the original script above. Then, load that CSV file using PostgreSQL's copy command.

CSV.open(dir + "#{table_name}.csv", "wb") do |csv|
  while i <= counts[table_name]
    csv << [i, Faker::Book.title, Faker::Lorem.paragraphs(3).join(","), time_rand(rt, Time.local(2010, 1, 1)), Faker::Book.genre]
    i += 1
    puts "#{Time.now}: #{i} :: #{i/(Time.now.to_i - starting_at.to_i)}" if i % 1000 == 0
  end
end
copy posts (id, title, content, published_at, type)
from '/tmp/posts.csv'
CSV DELIMITER ',';