RocksDB v6.21.0 Release Notes

Release Date: 2021-05-21 // about 1 month ago
  • 🐛 Bug Fixes

    • 🛠 Fixed a bug in handling file rename error in distributed/network file systems when the server succeeds but client returns error. The bug can cause CURRENT file to point to non-existing MANIFEST file, thus DB cannot be opened.
    • 🛠 Fixed a bug where ingested files were written with incorrect boundary key metadata. In rare cases this could have led to a level's files being wrongly ordered and queries for the boundary keys returning wrong results.
    • 🛠 Fixed a data race between insertion into memtables and the retrieval of the DB properties rocksdb.cur-size-active-mem-table, rocksdb.cur-size-all-mem-tables, and rocksdb.size-all-mem-tables.
    • 🛠 Fixed the false-positive alert when recovering from the WAL file. Avoid reporting "SST file is ahead of WAL" on a newly created empty column family, if the previous WAL file is corrupted.
    • Fixed a bug where GetLiveFiles() output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use GetLiveFiles(), failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and fail_if_options_file_error == false. Read-only DBs were impacted when no OPTIONS files existed.
    • Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe().
    • 🔀 In the IngestExternalFile() API, only try to sync the ingested file if the file is linked and the FileSystem/Env supports reopening a writable file.
    • Fixed a bug that AdvancedColumnFamilyOptions.max_compaction_bytes is under-calculated for manual compaction (CompactRange()). Manual compaction is split to multiple compactions if the compaction size exceed the max_compaction_bytes. The bug creates much larger compaction which size exceed the user setting. On the other hand, larger manual compaction size can increase the subcompaction parallelism, you can tune that by setting max_compaction_bytes.

    Behavior Changes

    • 🛠 Due to the fix of false-postive alert of "SST file is ahead of WAL", all the CFs with no SST file (CF empty) will bypass the consistency check. We fixed a false-positive, but introduced a very rare true-negative which will be triggered in the following conditions: A CF with some delete operations in the last a few queries which will result in an empty CF (those are flushed to SST file and a compaction triggered which combines this file and all other SST files and generates an empty CF, or there is another reason to write a manifest entry for this CF after a flush that generates no SST file from an empty CF). The deletion entries are logged in a WAL and this WAL was corrupted, while the CF's log number points to the next WAL (due to the flush). Therefore, the DB can only recover to the point without these trailing deletions and cause the inconsistent DB status.

    🆕 New Features

    • Add new option allow_stall passed during instance creation of WriteBufferManager. When allow_stall is set, WriteBufferManager will stall all writers shared across multiple DBs and columns if memory usage goes beyond specified WriteBufferManager::buffer_size (soft limit). Stall will be cleared when memory is freed after flush and memory usage goes down below buffer_size.
    • 👍 Allow CompactionFilters to apply in more table file creation scenarios such as flush and recovery. For compatibility, CompactionFilters by default apply during compaction. Users can customize this behavior by overriding CompactionFilterFactory::ShouldFilterTableFileCreation().
    • ➕ Added more fields to FilterBuildingContext with LSM details, for custom filter policies that vary behavior based on where they are in the LSM-tree.
    • Added DB::Properties::kBlockCacheEntryStats for querying statistics on what percentage of block cache is used by various kinds of blocks, etc. using DB::GetProperty and DB::GetMapProperty. The same information is now dumped to info LOG periodically according to stats_dump_period_sec.
    • ➕ Add an experimental Remote Compaction feature, which allows the user to run Compaction on a different host or process. The feature is still under development, currently only works on some basic use cases. The interface will be changed without backward/forward compatibility support.
    • 👌 RocksDB would validate total entries read in flush, and compare with counter inserted into it. If flush_verify_memtable_count = true (default), flush will fail. Otherwise, only log to info logs.
    • Add TableProperties::num_filter_entries, which can be used with TableProperties::filter_size to calculate the effective bits per filter entry (unique user key or prefix) for a table file.
    • ➕ Added a cancel field to CompactRangeOptions, allowing individual in-process manual range compactions to be cancelled.

    🐎 Performance Improvements

    • BlockPrefetcher is used by iterators to prefetch data if they anticipate more data to be used in future. It is enabled implicitly by rocksdb. Added change to take in account read pattern if reads are sequential. This would disable prefetching for random reads in MultiGet and iterators as readahead_size is increased exponential doing large prefetches.

    Public API change

    • ✂ Removed a parameter from TableFactory::NewTableBuilder, which should not be called by user code because TableBuilder is not a public API.
    • ✂ Removed unused structure CompactionFilterContext.
    • 🗄 The skip_filters parameter to SstFileWriter is now considered deprecated. Use BlockBasedTableOptions::filter_policy to control generation of filters.
    • 🛠 ClockCache is known to have bugs that could lead to crash or corruption, so should not be used until fixed. Use NewLRUCache instead.
    • ➕ Added a new pure virtual function ApplyToAllEntries to Cache, to replace ApplyToAllCacheEntries. Custom Cache implementations must add an implementation. Because this function is for gathering statistics, an empty implementation could be acceptable for some applications.
    • ➕ Added the ObjectRegistry to the ConfigOptions class. This registry instance will be used to find any customizable loadable objects during initialization.
    • Expanded the ObjectRegistry functionality to allow nested ObjectRegistry instances. Added methods to register a set of functions with the registry/library as a group.
    • Deprecated backupable_db.h and BackupableDBOptions in favor of new versions with appropriate names: backup_engine.h and BackupEngineOptions. Old API compatibility is preserved.

    0️⃣ Default Option Change

    • When options.arena_block_size <= 0 (default value 0), still use writer_buffer_size / 8 but cap to 1MB. Too large alloation size might not be friendly to allocator and might cause performance issues in extreme cases.

    🏗 Build

    • 👉 By default, try to build with liburing. For make, if ROCKSDB_USE_IO_URING is not set, treat as enable, which means RocksDB will try to build with liburing. Users can disable it with ROCKSDB_USE_IO_URING=0. For cmake, add WITH_LIBURING to control it, with default on.

Previous changes from v6.20.0

  • Behavior Changes

    • ColumnFamilyOptions::sample_for_compression now takes effect for creation of all block-based tables. Previously it only took effect for block-based tables created by flush.
    • CompactFiles() can no longer compact files from lower level to up level, which has the risk to corrupt DB (details: #8063). The validation is also added to all compactions.
    • 🛠 Fixed some cases in which DB::OpenForReadOnly() could write to the filesystem. If you want a Logger with a read-only DB, you must now set DBOptions::info_log yourself, such as using CreateLoggerFromOptions().
    • get_iostats_context() will never return nullptr. If thread-local support is not available, and user does not opt-out iostats context, then compilation will fail. The same applies to perf context as well.
    • ➕ Added support for WriteBatchWithIndex::NewIteratorWithBase when overwrite_key=false. Previously, this combination was not supported and would assert or return nullptr.
    • 👌 Improve the behavior of WriteBatchWithIndex for Merge operations. Now more operations may be stored in order to return the correct merged result.

    🐛 Bug Fixes

    • 👉 Use thread-safe strerror_r() to get error messages.
    • 🛠 Fixed a potential hang in shutdown for a DB whose Env has high-pri thread pool disabled (Env::GetBackgroundThreads(Env::Priority::HIGH) == 0)
    • 📚 Made BackupEngine thread-safe and added documentation comments to clarify what is safe for multiple BackupEngine objects accessing the same backup directory.
    • 🛠 Fixed crash (divide by zero) when compression dictionary is applied to a file containing only range tombstones.
    • 🛠 Fixed a backward iteration bug with partitioned filter enabled: not including the prefix of the last key of the previous filter partition in current filter partition can cause wrong iteration result.
    • Fixed a bug that allowed DBOptions::max_open_files to be set with a non-negative integer with ColumnFamilyOptions::compaction_style = kCompactionStyleFIFO.

    🐎 Performance Improvements

    • 🐎 On ARM platform, use yield instead of wfe to relax cpu to gain better performance.

    Public API change

    • Added TableProperties::slow_compression_estimated_data_size and TableProperties::fast_compression_estimated_data_size. When ColumnFamilyOptions::sample_for_compression > 0, they estimate what TableProperties::data_size would have been if the "fast" or "slow" (see ColumnFamilyOptions::sample_for_compression API doc for definitions) compression had been used instead.
    • ⚡️ Update DB::StartIOTrace and remove Env object from the arguments as its redundant and DB already has Env object that is passed down to IOTracer::StartIOTrace
    • ➕ Added FlushReason::kWalFull, which is reported when a memtable is flushed due to the WAL reaching its size limit; those flushes were previously reported as FlushReason::kWriteBufferManager. Also, changed the reason for flushes triggered by the write buffer manager to FlushReason::kWriteBufferManager; they were previously reported as FlushReason::kWriteBufferFull.
    • Extend file_checksum_dump ldb command and DB::GetLiveFilesChecksumInfo API for IntegratedBlobDB and get checksum of blob files along with SST files.

    🆕 New Features

    • Added the ability to open BackupEngine backups as read-only DBs, using BackupInfo::name_for_open and env_for_open provided by BackupEngine::GetBackupInfo() with include_file_details=true.
    • ➕ Added BackupEngine support for integrated BlobDB, with blob files shared between backups when table files are shared. Because of current limitations, blob files always use the kLegacyCrc32cAndFileSize naming scheme, and incremental backups must read and checksum all blob files in a DB, even for files that are already backed up.
    • ➕ Added an optional output parameter to BackupEngine::CreateNewBackup(WithMetadata) to return the BackupID of the new backup.
    • ➕ Added BackupEngine::GetBackupInfo / GetLatestBackupInfo for querying individual backups.
    • 👍 Made the Ribbon filter a long-term supported feature in terms of the SST schema(compatible with version >= 6.15.0) though the API for enabling it is expected to change.