A Simple Database in Ruby

I’ve been reading Martin Kleppmann’s book “Designing Data-Intensive Applications” and decided to create a simple project to explore and solidify my understanding of some of the concepts explained in his book. The code for this post can be found here.

At the start of chapter 3, Kleppmann describes a simple key value store that stores data by appending the latest write to a log file. Kleppmann implements the database as two bash functions, which I’ve decided to implement with Ruby. I’m also taking the opportunity to try out sorbet for Ruby type checking.

The set function is fairly simple, it takes an integer key and a hash value. To converts the hash to JSON and writes appends it to the file. This is an O(1) operation - very efficient.

# typed: strict
# frozen_string_literal: true

require_relative "db/version"
require 'sorbet-runtime'
require 'json'

module Rb
  module Db
    class Database
      LOG_FILE = T.let("./aof.txt", String)

      extend T::Sig
      sig {void}
      def initialize
        @log = T.let(File.open(LOG_FILE, 'a+'), File)
        @log.sync = true
      end

      sig {params(key: Integer, value: T::Hash[T.untyped, T.untyped]).returns(NilClass)}
      def set(key, value)
        @log.write("#{key},#{value.to_json}\n"); nil
      end

    end

    class Error < StandardError; end
  end
end

We can test out this implementation with IRB:

2.5.3 :001 > require_relative './lib/rb/db'
 => true
2.5.3 :002 > db = Rb::Db::Database.new
 => #<Rb::Db::Database:0x00007f8db6069550 @log=#<File:./aof.txt>>
2.5.3 :003 > db.set(1, {first_name: 'Peter', last_name: 'Keogh'})
 => nil
2.5.3 :004 > db.set(2, {first_name: 'Steve', last_name: 'Jobs'})
 => nil
2.5.3 :005 > db.set(3, {first_name: 'Steve', last_name: 'Wozniak'})
 => nil
2.5.3 :006 > db.set(1, {first_name: 'Pete', last_name: 'Keogh'})
 => nil

And we can see the logfile:

rb-db ➤ tail -F aof.txt
1,{"first_name":"Peter","last_name":"Keogh"}
2,{"first_name":"Steve","last_name":"Jobs"}
3,{"first_name":"Steve","last_name":"Wozniak"}
1,{"first_name":"Pete","last_name":"Keogh"}

How about retrieval? We can retrieve the value for a given key by iterating over every line in the log file, matching the lines that contain that key, returning the last value and parsing it’s JSON.

  sig {params(key: Integer).returns(T::Hash[T.untyped, T.untyped])}
  def get(key)
    File.open(LOG_FILE, 'r')
        .each_line.grep(/^#{key},/).last
        &.gsub(/^\d+,/, '')&.yield_self {|h| JSON.parse(h)}
  end

Usage:

2.5.3 :007 > db.get(1)
 => {"first_name"=>"Pete", "last_name"=>"Keogh"}
2.5.3 :008 > db.set(1, {first_name: 'Steve', last_name: 'Wozniak'})
 => nil
2.5.3 :009 > db.get(1)
 => {"first_name"=>"Steve", "last_name"=>"Wozniak"}

This works, but is very inefficient - it’s O(n) as we need to iterate over every line to retrieve the latest key. Kleppmann introduces a hash index to solve this problem.

Hash Indexes

We can use a hash index to perform reads in O(1) time. The index is an in memory hash that uses the user defined key as a key and the byte offset in the file as a value.

We update our constructor to initialize the index:

  sig {void}
  def initialize
    @log = T.let(File.open(LOG_FILE, 'a+'), File)
    @log.sync = true
    @index = T.let({}, T::Hash[Integer, Integer])
  end

When writing, we get the byte offset (with file.pos) and store it with the key:

  sig {params(key: Integer, value: T::Hash[T.untyped, T.untyped]).returns(NilClass)}
  def set(key, value)
    line = "#{key},#{value.to_json}\n"
    @log.write(line)
    byte_offset = @log.pos - line.length
    @index[key] = byte_offset
    nil
  end

When reading, we open the file, seek to that byte offset and read the line:

  sig {params(key: Integer).returns(T::Hash[T.untyped, T.untyped])}
  def get(key)
    file = File.open(LOG_FILE, 'r')
    file.pos = T.must(@index[key])
    file.readline.gsub(/^\d+,/, '').yield_self {|h| JSON.parse(h)}
  end

The above works fine, but what about when we close our program and reopen it? It turns out that reads are returning nil for indexes we’ve already added to the database! To prevent this, we need to populate the index when the DB is first created by calling populate_index from the constructor.

  sig { returns(NilClass) }
  def populate_index
    file = File.open(LOG_FILE, 'r')
    loop do
      begin
        current_pos = file.pos
        line = file.readline
        key = T.must(line.match(/^(\d+),/))[1].to_i
        @index[key] = current_pos
      rescue EOFError
        break
      end
    end
  end
Written on March 19, 2021