RocksDB v6.21.0 Release Notes

Release Date: 2021-05-21 // almost 3 years ago
  • 🐛 Bug Fixes

    • 🛠 Fixed a bug in handling file rename error in distributed/network file systems when the server succeeds but client returns error. The bug can cause CURRENT file to point to non-existing MANIFEST file, thus DB cannot be opened.
    • 🛠 Fixed a bug where ingested files were written with incorrect boundary key metadata. In rare cases this could have led to a level's files being wrongly ordered and queries for the boundary keys returning wrong results.
    • 🛠 Fixed a data race between insertion into memtables and the retrieval of the DB properties rocksdb.cur-size-active-mem-table, rocksdb.cur-size-all-mem-tables, and rocksdb.size-all-mem-tables.
    • 🛠 Fixed the false-positive alert when recovering from the WAL file. Avoid reporting "SST file is ahead of WAL" on a newly created empty column family, if the previous WAL file is corrupted.
    • Fixed a bug where GetLiveFiles() output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use GetLiveFiles(), failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and fail_if_options_file_error == false. Read-only DBs were impacted when no OPTIONS files existed.
    • Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe().
    • 🔀 In the IngestExternalFile() API, only try to sync the ingested file if the file is linked and the FileSystem/Env supports reopening a writable file.
    • Fixed a bug that AdvancedColumnFamilyOptions.max_compaction_bytes is under-calculated for manual compaction (CompactRange()). Manual compaction is split to multiple compactions if the compaction size exceed the max_compaction_bytes. The bug creates much larger compaction which size exceed the user setting. On the other hand, larger manual compaction size can increase the subcompaction parallelism, you can tune that by setting max_compaction_bytes.

    Behavior Changes

    • 🛠 Due to the fix of false-postive alert of "SST file is ahead of WAL", all the CFs with no SST file (CF empty) will bypass the consistency check. We fixed a false-positive, but introduced a very rare true-negative which will be triggered in the following conditions: A CF with some delete operations in the last a few queries which will result in an empty CF (those are flushed to SST file and a compaction triggered which combines this file and all other SST files and generates an empty CF, or there is another reason to write a manifest entry for this CF after a flush that generates no SST file from an empty CF). The deletion entries are logged in a WAL and this WAL was corrupted, while the CF's log number points to the next WAL (due to the flush). Therefore, the DB can only recover to the point without these trailing deletions and cause the inconsistent DB status.

    🆕 New Features

    • Add new option allow_stall passed during instance creation of WriteBufferManager. When allow_stall is set, WriteBufferManager will stall all writers shared across multiple DBs and columns if memory usage goes beyond specified WriteBufferManager::buffer_size (soft limit). Stall will be cleared when memory is freed after flush and memory usage goes down below buffer_size.
    • 👍 Allow CompactionFilters to apply in more table file creation scenarios such as flush and recovery. For compatibility, CompactionFilters by default apply during compaction. Users can customize this behavior by overriding CompactionFilterFactory::ShouldFilterTableFileCreation().
    • ➕ Added more fields to FilterBuildingContext with LSM details, for custom filter policies that vary behavior based on where they are in the LSM-tree.
    • Added DB::Properties::kBlockCacheEntryStats for querying statistics on what percentage of block cache is used by various kinds of blocks, etc. using DB::GetProperty and DB::GetMapProperty. The same information is now dumped to info LOG periodically according to stats_dump_period_sec.
    • ➕ Add an experimental Remote Compaction feature, which allows the user to run Compaction on a different host or process. The feature is still under development, currently only works on some basic use cases. The interface will be changed without backward/forward compatibility support.
    • 👌 RocksDB would validate total entries read in flush, and compare with counter inserted into it. If flush_verify_memtable_count = true (default), flush will fail. Otherwise, only log to info logs.
    • Add TableProperties::num_filter_entries, which can be used with TableProperties::filter_size to calculate the effective bits per filter entry (unique user key or prefix) for a table file.
    • ➕ Added a cancel field to CompactRangeOptions, allowing individual in-process manual range compactions to be cancelled.

    🐎 Performance Improvements

    • BlockPrefetcher is used by iterators to prefetch data if they anticipate more data to be used in future. It is enabled implicitly by rocksdb. Added change to take in account read pattern if reads are sequential. This would disable prefetching for random reads in MultiGet and iterators as readahead_size is increased exponential doing large prefetches.

    Public API change

    • ✂ Removed a parameter from TableFactory::NewTableBuilder, which should not be called by user code because TableBuilder is not a public API.
    • ✂ Removed unused structure CompactionFilterContext.
    • 🗄 The skip_filters parameter to SstFileWriter is now considered deprecated. Use BlockBasedTableOptions::filter_policy to control generation of filters.
    • 🛠 ClockCache is known to have bugs that could lead to crash or corruption, so should not be used until fixed. Use NewLRUCache instead.
    • ➕ Added a new pure virtual function ApplyToAllEntries to Cache, to replace ApplyToAllCacheEntries. Custom Cache implementations must add an implementation. Because this function is for gathering statistics, an empty implementation could be acceptable for some applications.
    • ➕ Added the ObjectRegistry to the ConfigOptions class. This registry instance will be used to find any customizable loadable objects during initialization.
    • Expanded the ObjectRegistry functionality to allow nested ObjectRegistry instances. Added methods to register a set of functions with the registry/library as a group.
    • Deprecated backupable_db.h and BackupableDBOptions in favor of new versions with appropriate names: backup_engine.h and BackupEngineOptions. Old API compatibility is preserved.

    0️⃣ Default Option Change

    • When options.arena_block_size <= 0 (default value 0), still use writer_buffer_size / 8 but cap to 1MB. Too large alloation size might not be friendly to allocator and might cause performance issues in extreme cases.

    🏗 Build

    • 👉 By default, try to build with liburing. For make, if ROCKSDB_USE_IO_URING is not set, treat as enable, which means RocksDB will try to build with liburing. Users can disable it with ROCKSDB_USE_IO_URING=0. For cmake, add WITH_LIBURING to control it, with default on.