lsmlite-rs: Rust bindings for SQLite's lsm1 storage engine

TL;DR: We have open-sourced lsmlite-rs , our Rust bindings for SQLite’s lsm1 storage engine.

At Helsing, we of course use open source software, but also actively contribute back to the community, for example with our internal software testing workshop, the Oxigraph database, an auto formatter for the RDF turtle format, our internal build system, and our protobuf package manager. In the same spirit, we would like to announce today that we are releasing lsmlite-rs – our Rust bindings for SQLite’s lsm1 stand-alone storage engine: https://github.com/helsing-ai/lsmlite-rs.

Rust logo courtesy of the Rust Foundation (CC-BY).

SQLite’s lsm1 extension is an excellent implementation of Log-structured Merge Trees and is similar in spirit to RocksDB and WiredTiger (the storage engine of MongoDB). Unlike RocksDB, lsm1 structures data on stable storage as a collection of read-only B-trees (called "segments" in lsm1's terminology) that increase in size as the database grows. Thus, lsm1 follows the fundamental design principles of bLSM rather than those of a traditional LSM-Tree – in which data is stored in immutable sorted arrays. This comes with the advantage of offering excellent I/O for reads out-of-the-box, while also being efficient at writing data.

Other appealing characteristics of lsm1 are:

Industrial-grade, robust implementation.
Small (single-file) implementation with single-writer/multiple-reader MVCC-based transactional concurrency model.
Data durability in the face of application or system failure (it allows ACID transactions).
Support for compression and/or encryption via hooks.
Permissive license.
Read-only support. Writes to a database opened in read-only mode are rejected.
Optimisations for write-once read-many (WORM) workloads. A database becomes essentially a single densely-packed B-tree. Improving the space required by the database on disk, as well as providing optimal I/O for reads.

Background

We started working on lsmlite-rs to support real-time sensor data storage and retrieval on embedded devices. Instead of going all-in with a canonical choice like RocksDB, we decided to extract a lightweight API (Rust traits Disk and Cursor) in order to abstract the underlying storage engine. This allowed us to experiment with both RocksDB and lsm1 in parallel. This is how lsmlite-rs was born.

Design decisions

For lsmlite-rs we have decided to try to keep resources under control. That is:

The main memory level of the LSM Tree is limited to 16 MiB and a total of two of them can exist at any given time. There is one structure (the active one) in which all writes are performed, and another one (the old) that is marked for flushing to disk.
No part of the database is memory-mapped. That is, all file operations are done through operating system’s read and write primitives.
Access to a database file from a different process is currently deactivated. This speeds up operations on the database from different threads as no shared-memory and POSIX locks are used.
Robustness in the presence of a system crash is (by default) configured such that the database does not get corrupted, but recently committed transactions may be lost following recovery. This is lsm1's default durability setting.
Background threads (in the relevant operational modes) are actively scheduled as writes to the database are performed. To keep memory usage as well as data safety under control, writes to the database may incur in higher latencies while the corresponding background thread works towards flushing data from main memory to disk (thus reducing main memory consumption) and/or check-pointing volatile data (the data that could be lost during a failure).

Getting started

The following is a short example on how to declare and open a database, then insert data and traverse the whole database extracting all keys and values currently contained therein. Additional examples on particular topics (eg, transactions, or compression) can be found under the examples directory.

use lsmlite_rs::*;

// Make sure that `/tmp/my_db.lsm` does not exist yet.
let db_conf = DbConf::new("/tmp/", "my_db".to_string());

// Declare an empty handle.
let mut db: LsmDb = Default::default();
// Initialize the handle with our configuration.
let rc = db.initialize(db_conf);
// Connect to the database. It is at this point that the file is produced.
let rc = db.connect();

// Insert data into the database, so that something gets traversed.
// Persist numbers 1 to 100 with 1 KB zeroed payload.
let value = vec![0; 1024];
let max_n: usize = 100;
for n in 1..=max_n {
    let key_serial = n.to_be_bytes();
    let rc = db.persist(&key_serial, &value).unwrap();
}

// Open the cursor once we have written data (snapshot isolation).
let mut cursor = db.cursor_open().unwrap();
// Move the cursor to the very first record on the database.
let rc = cursor.first();
assert!(rc.is_ok());

// Now traverse the database extracting the data we just added.
let mut num_records = 0;
while cursor.valid().is_ok() {
    num_records += 1;
    // Extract the key.
    let current_key = Cursor::get_key(&cursor).unwrap();
    // Parse it to an integer.
    assert!(current_key.len() == 8);
    let key = usize::from_be_bytes(current_key.try_into().unwrap());
    // Extract the value.
    let current_value = Cursor::get_value(&cursor).unwrap();
    // Everything should match.
    assert!(key == num_records);
    assert!(current_value.len() == 1024);
    // Move onto the next record.
    cursor.next().unwrap();
}
// We did find what we wanted.
assert_eq!(num_records, max_n);
assert!(cursor.valid().is_err());

Micro-benchmarks

We have experimentally compared the performance of lsmlite-rs against RocksDB – as provided by the rocksdb crate – by performing two sets of micro-benchmarks. These experiments are not meant to be exhaustive in any way, our main objective here is to simply compare the performance of the engines under very simple conditions that represent workloads we are interested in.

Experimental setup

All experiments were run on an embedded system powered by an NVIDIA Jetson Xavier NX with the following main specifications:

6xARM64 V8, 8 GB of main memory
Crucial P3 1TB M.2 PCIe Gen3 NVMe Internal SSD
Ubuntu 20.04.6 LTS
Rust 1.69.0 withcargo build --release

We implemented our traits Disk and Cursor using RocksDB and we have only changed the following settings (to reduce runtime overheads as well as to keep main memory usage limited and equivalent to the amount used by lsmlite-rs) from the defaults provided by RocksDB:

set_write_buffer_size(16 * 1024 * 1024): Keeps main-memory usage low.
set_compression_type(DBCompressionType::None): Avoids runtime overheads.
set_bottommost_compression_type(DBCompressionType::None): Avoids runtime overheads.
set_blob_compression_type(DBCompressionType::None): Avoids runtime overheads.

By default, RocksDB uses two background threads to take care of file operations (compactions). So, in total, RocksDB runs on three threads (including the main writing thread).

We compared two different lsmlite-rs configurations. The first is the default configuration, in which a single thread is responsible for writing to the database as well as performing file operations like compactions and flushing to disk. In our plots below, this configuration is represented by LSMLiteRS(1T). The second configuration uses one additional background thread to take care of the aforementioned file operations. This mode can be activated by setting LsmMode::LsmBackgroundMerger in the database’s configuration before initialisation. Thus, in this second configuration, there is a clear separation of concerns between writing to the database (write-ahead log and main-memory component of the LSM Tree) and (heavy) file operations. In our plots, this configuration is represented by LSMLiteRS(2T).

First set of experiments: sorted keys

In the very first set of experiments, we insert keys 1..n for n ∈ {1x10^6, 5x10^6, 10x10^6, 15x10^6, 20x10^6, 25x10^6, 30x10^6, 50x10^6, 100x10^6} in which all keys are sorted in increasing order. Each record consists of a pair <k, v> in which k is 8-byte long and v is a random blob of 1 kilobyte. Thus each record is as large as 1032 bytes. Therefore, the amount of data the corresponding database is ingesting varies from about 1 GiB to about 100 GiB. This experiment represents use-cases in which keys arrive in sorted order - for example when bulk-loading a database, or when a record is persisted using a timestamp (eg, arrival, creation, captured, etc.)

After having written the database, two sets of queries are issued against it:

Set of queries with 1% selectivity. That is, each query reads at most 1% of the database — the only exception is if the key lies at the end of the database and thus 1% of reads are no longer possible. This query selectivity is usually regarded as low — and thus the result set is large.
Set of queries with 0.1% selectivity. The result set of each query is now much smaller.

Before running each set of queries, keys are randomly permuted. Also, each query retrieves key and value of each output record. Finally, the amount of queries run is such that the whole database might be read. That is, for example, for a database containing 10^6 records, running queries with 1% selectivity, we run 100 queries, whereas on queries with 0.1% selectivity we run 1000 queries.

The results of this set of experiments can be seen in the following set of four plots:

Building and querying a database in which keys were inserted in sequential order.

As a baseline for write speed we have also included in this experiment PCAP files using this Rust implementation. In theory, writing PCAP files is very fast as it can be used as an append-only file (better utilising hardware bandwidth). However, observe that the ingestion performance of RocksDB is by far the best. Compared to lsmlite-rs (regardless of the configuration), RocksDB achieves considerable speed-ups against lsmlite-rs (2x-6x) while building the database. Compared to PCAP, the speed-ups are much lower (about 30%). However, observe as well that the memory footprint of RocksDB while building the database is considerably higher than that of lsmlite-rs and PCAP as the size of the database increases (about 50% more main memory). This is in our opinion an important detail to keep in mind when running under constrained resources.

When it comes to querying the database, the experiments indicate that lsmlite-rs has an edge over RocksDB as the size of the result set increases. On queries with 0.1% selectivity RocksDB is definitely faster than lsmlite-rs. However, on queries with low selectivity lsmlite-rs is definitely faster. It is important to point out that the difference in query performance between LSMLiteRS(1T) and LSMLiteRS(2T) for queries with 0.1% selectivity lies in the internal shape of the database - a background thread can compact segments more aggressively and thus optimise the database much better than a single thread doing the least amount of compaction to avoid having a very high number of small segments in the database while also taking care of the main writes to the database. Observe, however, that if selectivity is low these differences seem to have no strong effect any longer.

We did not perform queries on the PCAP files as they are not efficiently queryable (no efficient random access).

These experiments indicate that RocksDB is the definitely the better alternative when the data to be persisted is naturally sorted (as long as the underlying system has enough memory for it). Surprisingly enough, RocksDB turned out to be even better than simply writing PCAP files and it has the advantage that the database is efficiently queryable. Finally, it seems that lsmlite-rs has very limited opportunities in this scenario; only when main memory is strongly constrained and query performance is truly critical in the presence of low-selectivity queries in the workload.

Second set of experiments: random keys

The second set of experiments is similar to the previous one; the only difference is that now keys are randomly permuted before the databases are built. Without further ado, the results of this set of experiments can be seen in the following set of four plots:

Building and querying a database in which keys were inserted in random order.

It seems that RocksDB has taken a considerable performance hit this time. Whereas the write performance of lsmlite-rs is similar in both sets of experiments, the write performance of RocksDB degraded almost an order of magnitude. Also, as the order in which keys are persisted has an impact on the internal positions of the records in the database, query performance is also influenced, and as we can see, lsmlite-rs has considerably better query performance (in either configuration) even on higher selectivity queries – unlike the previous set of experiments.

These experiments indicate that whether to use RocksDB or some other engine like lsmlite-rs strongly depends on the workload. In this regard, we would like to point out that, although keys in random order seem like an unrealistic scenario, the nature of writes on LSM-based engines is such that all writes (insertions, updates, deletions) are treated similarly – as records are inserted into the database with a marker indicating what kind of operation is being performed – and it is through compactions that write operations for the same record are simplified as they eventually end up next to each other (due to the sorted nature of the database). That is, in workloads in which not only insertions happen, but also updates and deletions, any kind of write on any given record will essentially appear in arbitrary order to the engine w.r.t. writes to other records, and thus this will definitely affect the performance of RocksDB for example.

As mentioned before, our experiments are not meant to be exhaustive as the configuration space is very large. Your mileage will definitely vary, but in any case, having access to one more storage engine that could fulfil particular requirements is always good.

How to contribute

We would be happy to hear your thoughts, feedback and pull requests are welcome.

lsmlite-rs is available under the Apache License 2.0 at https://github.com/helsing-ai/lsmlite-rs