LevelDB Project

by: burt rosenberg
at: university of miami
date: nov 2019
NAME
    levelDB-server levelDB-tool 
    
SYNOPSIS	
    levelDB-server [-v] [-s]  _mount_dir_ _data_dir_
    levelDB-tool [-v] [-m in|out|sort] _level_file_
    
DESCRIPTION
    Uses FUSE to attach a key-value filesystem implemented using the 
    ideas of the LevelDB datastore that runs Google's BigTable. 
    
OPTIONS
    -v verbose output
    -s to pass the remainder of the options to FUSE (not implmented)
          
        
HISTORY
    New for csc421-201, november 2019

BUGS

US Patent US4905110 A

Goals

The Wish and the Well

Because this stuff is a little bit unexpected, and your preconceptions might hinder your understanding of the project, I will tell it as a strange story of the wishes and the well.

The wish and the well are directories. The wish is a directory to which a wishes are made, and the leveldb program is listening for those wishes, and manipulates real files in the well directory to respond to the wish.

The section of the file system rooted at the wish directory accepts the common file system requests (read, write, unlink) and responds in the common way, but it implements the semantics of a key–value store, rather than that of a common file system.

A key–value store is a collection of pairings, (key, value), indexed on the key. A pair can be added to the collection, the collection can be queried for the value associated with a key, the pairing can be updated, or the pairing can be completely removed. Our key–value pairs will be mostly pairings of text.

The key–value store is given the patina of a file system so,

Under the directory wish, these files do not exist. The wish directory is a way of inserting a key-value backend into a filesystem situation. The data is stored in the well directory, but not as separate files, but as a sequence of LevelDB files, that organize the key-value pairs for durable storage and retrieval.

Durable means that the store is very resistent to corruption due to concurrency issues. The LevelDB system can be used in a distributed architecture, with multiple threads, because the LevelDB files are immutable. Immutable files cannot give conflicting data, because the data never changes.

Updates replace old immutable file with new ones in a manner that is atomic, consistent, efficient, and concurrent.

FUSE

Unix has mountable file systems. This means that the overall file tree can be comprised of different file systems. The passage from one file system to another is at an object that looks to the user like a directory, but it is a mount point. A mount point node is marked with the root of the filesystem which should begin at that directory.

FUSE allows user programs to be the destination of a mount point. By registering on a directory, FUSE intercepts all file requests on a path passing through that directory and packages up the request as a call into the user program. The user program provides a response by call-by-reference parameters, and the return value.

FUSE and Files

User-program <-----> Kernel <-----> FUSE <-----> minifat.c <-----> fatdisk.dmg

            filesystem                 FUSE callbacks     filesystem

Looking at the template C program for the project, not the FUSE Callback Table. There are the functions in your program that FUSE will call when it intercepts the corresponding filesystem operation. E.g. readding a diretory, opening, reading or writing a file.

The FUSE Callback table
static struct fuse_operations fuse_oper = {
	// call backs for fuse
	.getattr = fs_getattr,
	.readdir = fs_readdir,
	.mkdir   = fs_mkdir,
	.rmdir   = fs_rmdir,
	.create  = fs_create,
	.unlink  = fs_unlink,
	.utimens = fs_utimens,
	.truncate = fs_truncate,
	.open    = fs_open,
	.write   = fs_write,
	.read    = fs_read,
} ;

To get started with FUSE, explore class/examples/fuse-hw. The Makefile has an install target that you run to install the FUSE development package. The test target will compile and run the hello-world file system. This is a file system that contains only single file in the root directory with a fixed content.

The documentation on FUSE is scarce. Here are some resources:

LevelDB

As FUSE is the technique behind wishes, LevelDB is the technique behind wells.

LevelDB is a family of files (in the well directory) that store the key-value pairs. The files are mostly immutable (once created they never change), except some are append-only (which shares with immutable a robustness to concurrent access). The files all share a common format. They are a sequence of fixed size (64 bytes) records, with a key (first 16 bytes) then a value (48 bytes).

The key and value are placed into their respective fields, with null ('\0') padding on the right. The null byte is not allowed in the key or value itself. It is like a C string, however, the key or value can fill their entire field, and then will not have a null termination byte.

There are three levels of files,

LevelDB tables
         +-----------------------------------------------------------+ 
level 2  | k,v | k,v | k,v |           ...                     | k,v |
         +-----------------------------------------------------------+
         
           ⇑ search level 2 last
           
                +----------------------------+    
level 1         | k,v | k,v |   ...    | k,v |
                +----------------------------+
                    
           ⇑ search level 1 newest to oldest
           
                           +----------------------------+    
level 1                    | k,v | k,v |   ...    | k,v |
                           +----------------------------+ 

            ⇐ search level 0 newest first, open table first ⇐   

                  (closed)                     (open) 
         +------------------------+     +------------------
level 0  | k,v | k,v | .... | k,v |     | k,v | k,v | ....    ⇐   | k,v |
         +------------------------+     +------------------`   new pairs append to
                                                               open table
Searching in LevelDB files

The tables are arranged on a search path that assure that the most recent version of a key-value pair is encountered first on the path. Once a key match is found, the search terminates. The older key-value pairs of the same key are in effect discarded, without having to modify any files.

Level 0 files are search from most recently added key-value pair backwards, stopping at the first match of the key. In this way updated values hide older values. There is an open level 0 file which accepts appends. When the current file is large enough, it is closed and a new open level 0 file is created. Then the two level 0 files are searched newest to oldest, open file first, then closed file, for the first match.

Level 1 and 2 files are searched by binary search, aided by the fixed record length. The are searched from the most recently created level 1 file back towards the oldest level 1, then searching the level 2 for the first match.

Tombstone

To avoid modifications of settled data, deletes are by hiding the deleted key-value pair by a meta key-value pair that indicates the key is not present. Like the student that says "I am not here" during attendance. The key is the key to delete, and the value is the null byte followed by the 0xff byte.

Key-value pairs in level 2 files will not be deleted, as the files are immutable. A new level 2 file will be created without the deleted pairs. In the meanwhile, tombstones in level 0 or level 1 files will halt the search for a deleted key.

Gleaning

Updates work by placing the new value ahead of the old value in the search of pairs. Because level 0 are append-only, they might contain multiple pairs of the same key. However level 1 and level 2 files must not. The process of discarding hiden key-value pairs is gleaning.

Glean a level 0 when forming a level 1 by sweeping through discarding all but the first instance of a key. Treat the tombstones no differently then an other pair.

Glean level 1's and 2's by retaining only the first key-value pair for a given key on the search path. If the remaining key-value pair is a tombstone, discard it. Level 2 files should not have tombstones.

No value keys

A null key is represented by a pair of null bytes at indices 0 and 1 of the value field. This clarification is needed to reserve values of the second byte other than 0 or 255 for future assignment.
— This clarification was written 14 may 2020

Transacting the tables

The migration of table date from open L0, to closed L0, to L1, to L2, is carried out in concurrent context with minimal halting of database updates or access.

The use of open and closed L0 tables allows that the closed L0 is transformed to an L1 while updates and accesses continues. During the time to create the L1, only the open L0 is changed, and this does not affect the rest of the tables.

Once the L1 table is created, a brief lock drops the closed L0 table from the search path simultaneously inserting the new L1 table into the search path, in front of any other L1 table.

Likewise, a collection of L1 tables and the L2 table can proceed concurrently with database operation. When they have been gleaned and merged into a new L2, a brief lock drops the current L2 and the component L1's from the search path and inserts the new L2 at the end of the search path.

Physical files for the tables

The directory rooting the datafiles is given as an argument to the command. We have called it "well". Create subdirectories level0, level1 and level2 to contain the files at that level. The data files can be named 0.ldb, 1.ldb, etc., wrapping back around to 0.ldb.

For instance, a simple rule could be there are only 0.ldb and 1.ldb files on each level, and they get used alternately.

On level 0, start with 0.ldb open. To close it just stop writing to it and instead write to 1.ldb. This starts the modification of level0/0.ldb into a level1 table, in the level1 directory. Once the leve1 table is created. delete level0/0.ldb. Now when level0/1.ldb closes, the new open level 0 table is level0/0.ldb.

Level 2 files are created alternately as level2/1.ldb, level2/0.ldb, level2/1.ldb, and so on. While one is being built the other continues to be used.

Building a level 2 can merge all level1 tables and the current level 2 table in use at the time the merge started. New level 1 tables might be created, but this does not affect the correctness of the merge already in progress.

Once merged, the obsolete level2 table and level1 tables are discarded.

well
  |
  +----  level0
  |        |
  |        +----- 0.ldb  
  |        |
  |        +----- 1.ldb
  |
  +----  level1
  |        |
  |        +----- 0.ldb  
  |        |
  |        +----- 1.ldb
  |
  +----  level2
           |
           +----- 0.ldb  

Getting started with Memory Mapped files

While it is possible to do the project without memory mapped IO, you might what to try it. Promotion of tables requires certain algorithms,

These algorithms are likely to be much simpler as in-memory algorithms.

The leveldb-tool program gives an example.

How to do this project

Extra Credit (or food for thought):

The wish directory has phantom directories level0, level1 and level2. Consider queries with syntax e.g. wish/level1/key, as a query only on key-values pairs in level1 tables.

The value field has a lot of information not used, as anything with the first byte a null is a empty value and the remaining 47 bytes are available for codes. We have used one combination for the tombstone.

An possible key-value is an autoincrement, where atomically a value is returned and the value is incremented by one.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

author: burton rosenberg
created: 10 nov 2019
update: 14 may 2020