LevelDB Project: Level 0

by: burt rosenberg
at: university of miami
date: oct 2020
NAME
    LevelDB FUSE filesystem for Linux
    
SYNOPSIS	
    leveldbd [-v] _keystore_dir_ _tablestore_dir_
    
DESCRIPTION
    Uses FUSE to attach a key-value filesystem implemented using the 
    ideas of the LevelDB datastore that runs Google's BigTable. 
    
OPTIONS
    -v verbose output
        
HISTORY
    New for csc421-201, november 2019

BUGS

US Patent US4905110 A

Goals

Immutable Key-Value Stores

The traditional file system is a data store that sometimes does not scale well. It provides the service of data persistence. However, scalable datastores need to be distributed so that many processes can access the data at higher access rates; they need to be tolerant to failure, and to this end, the data is replicated. Distributed access, replicated data, and fault tolerence are easier accommodated with key-value stores.

A key-value store is a collection pairs, a key with a value. A key-value pair can be inserted into the store, over-writing if there is an existing pair with a matching key; and the value associated to a key can be retrieved.

Databases that change in time are difficult to make consistent. A database that never changes is of limited application. A compromise is a database that many ways is immutable, and otherwise grows by appending new facts. No fact is over wrong, but it could be out-dated. The collection of all facts at any moment is a subset of all facts at any future time.

This is not exactly true, as outdated facts are effectively removed as they are no long reported. However, it is really a pragmatic matter, that a "gleaning" is made to recover space used by outdated facts, once it is well-known that the fact is outdated in any current version of the data store.

LevelDB

LevelDB is a family of files (in the well directory) that store the key-value pairs. The files are mostly immutable (once created they never change), except some are append-only (which shares with immutable a robustness to concurrent access). The files all share a common format. They are a sequence of fixed size (64 bytes) records, with a key (first 16 bytes) then a value (48 bytes).

The key and value are placed into their respective fields, with null ('\0') padding on the right. The null byte is not allowed in the key or value itself. It is like a C string, however, the key or value can fill their entire field, and then will not have a null termination byte.

There are three levels of files,

LevelDB tables
         +-----------------------------------------------------------+ 
level 2  | k,v | k,v | k,v |           ...                     | k,v |
         +-----------------------------------------------------------+
         
           ⇑ search level 2 last
           
                +----------------------------+    
level 1         | k,v | k,v |   ...    | k,v |
                +----------------------------+
                    
           ⇑ search level 1 newest to oldest
           
                           +----------------------------+    
level 1                    | k,v | k,v |   ...    | k,v |
                           +----------------------------+ 

            ⇐ search level 0 newest first, open table first ⇐   

                  (closed)                     (open) 
         +------------------------+     +------------------
level 0  | k,v | k,v | .... | k,v |     | k,v | k,v | ....    ⇐   | k,v |
         +------------------------+     +------------------`   new pairs append to
                                                               open table



        +----------------+------------------------------------------------+
        | KEY (16 bytes) |           VALUE (48 bytes)                     |
        +----------------+------------------------------------------------+

Searching in LevelDB files

The tables are arranged on a search path that assure that the most recent version of a key-value pair is encountered first on the path. Once a key match is found, the search terminates. The older key-value pairs of the same key are in effect discarded, without having to modify any files.

Level 0 files are search from most recently added key-value pair backwards, stopping at the first match of the key. In this way updated values hide older values. There is an open level 0 file which accepts appends. When the current file is large enough, it is closed and a new open level 0 file is created. Then the two level 0 files are searched newest to oldest, open file first, then closed file, for the first match.

Level 1 and 2 files are searched by binary search, aided by the fixed record length. The are searched from the most recently created level 1 file back towards the oldest level 1, then searching the level 2 for the first match.

Tombstone

To avoid modifications of settled data, deletes are by hiding the deleted key-value pair by a meta key-value pair that indicates the key is not present. Like the student that says "I am not here" during attendance. In this year's version, we will mark a deleted key by a key with the empty string as a value.

Key-value pairs in level 2 files will never be deleted from the file, as all files are immutable. A new level 2 file will be created without the deleted pairs, and replaces all-or-nothing the existing level 2 file. There are never any Tombstones in a level 2 table, as they are not needed. Tombstones in level 0 or level 1 files are needed to halt the search for a deleted key.

Gleaning

Updates work by placing the new value ahead of the old value in the search of pairs. Because level 0 are append-only, they might contain multiple pairs of the same key. However level 1 and level 2 files must not. The process of discarding hiden key-value pairs is gleaning.

Glean a level 0 when forming a level 1 by sweeping through discarding all but the most recent key-value pair of a key. Treat the tombstones no differently then an other pair.

Level 1's and combined with each other, or with level 2's, by gleaning all but the most recent key-value pair for a key. When merging to create a new level 2, the tombstones are dropped.

The LevelDB filesystem

The key–value store is given the patina of a file system by running your FUSE enabled program, leveldbd, on the mount point store.

Under the directory store, these files do not exist. The store directory is a way of inserting a key-value pair into actually true files, in the kv-table directory.

The kv-table directory can contain a number of level 0 and level 1 tables, and a single level 2 table, each as separate files. The level 0 tables are append only until a certain size, when they become immutable. The other tables are all immutable.

However, tables are deleted when it is certain they will never be accessed again.

This is a very durable method for storing data. It is resistant to corruption due to concurrency issues. The LevelDB system can be used in a distributed architecture, with multiple threads, because the LevelDB files are immutable. Immutable files cannot give conflicting data, because the data never changes. Updates replace old immutable file with new ones in a manner that is atomic, consistent, efficient, and concurrent.

Project: Level 0

This project implements just the first level 0 file. You have template code that does the FUSE part and also opens up the file kv-table/l0.dat, which will be the unique level 0 table for this part of the project.

It also memory-maps it, so that you will not have to use file operations on the table. It can be programmed as if it were an in-memory array.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

author: burton rosenberg
created: 10 nov 2019
update: 21 oct 2020