RocksDB v7.6.0 Release Notes

  • 🆕 New Features

    • Added prepopulate_blob_cache to ColumnFamilyOptions. If enabled, prepopulate warm/hot blobs which are already in memory into blob cache at the time of flush. On a flush, the blob that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this blob back into memory again, which is avoided by enabling this option. This further helps if the workload exhibits high temporal locality, where most of the reads go to recently written data. This also helps in case of the remote file system since it involves network traffic and higher latencies.
    • 👌 Support using secondary cache with the blob cache. When creating a blob cache, the user can set a secondary blob cache by configuring secondary_cache in LRUCacheOptions.
    • Charge memory usage of blob cache when the backing cache of the blob cache and the block cache are different. If an operation reserving memory for blob cache exceeds the avaible space left in the block cache at some point (i.e, causing a cache full under LRUCacheOptions::strict_capacity_limit = true), creation will fail with Status::MemoryLimit(). To opt in this feature, enable charging CacheEntryRole::kBlobCache in BlockBasedTableOptions::cache_usage_options.
    • 👌 Improve subcompaction range partition so that it is likely to be more even. More evenly distribution of subcompaction will improve compaction throughput for some workloads. All input files' index blocks to sample some anchor key points from which we pick positions to partition the input range. This would introduce some CPU overhead in compaction preparation phase, if subcompaction is enabled, but it should be a small fraction of the CPU usage of the whole compaction process. This also brings a behavier change: subcompaction number is much more likely to maxed out than before.
    • ➕ Add CompactionPri::kRoundRobin, a compaction picking mode that cycles through all the files with a compact cursor in a round-robin manner. This feature is available since 7.5.
    • Provide support for subcompactions for user_defined_timestamp.
    • Added an option memtable_protection_bytes_per_key that turns on memtable per key-value checksum protection. Each memtable entry will be suffixed by a checksum that is computed during writes, and verified in reads/compaction. Detected corruption will be logged and with corruption status returned to user.
    • Added a blob-specific cache priority level - bottom level. Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. The user can specify the new option low_pri_pool_ratio in LRUCacheOptions to configure the ratio of capacity reserved for low priority cache entries (and therefore the remaining ratio is the space reserved for the bottom level), or configuring the new argument low_pri_pool_ratio in NewLRUCache() to achieve the same effect.

    Public API changes

    • ✂ Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.
    • CompactRangeOptions::exclusive_manual_compaction is now false by default. This ensures RocksDB does not introduce artificial parallelism limitations by default.
    • Tiered Storage: change bottommost_temperture to last_level_temperture. The old option name is kept only for migration, please use the new option. The behavior is changed to apply temperature for the last_level SST files only.
    • Added a new experimental ReadOption flag called optimize_multiget_for_io, which when set attempts to reduce MultiGet latency by spawning coroutines for keys in multiple levels.

    🐛 Bug Fixes

    • 🛠 Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)
    • 🛠 Fix a bug where GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.
    • 🛠 Fix race conditions in GenericRateLimiter.
    • 🛠 Fix a bug in FIFOCompactionPicker::PickTTLCompaction where total_size calculating might cause underflow
    • 🛠 Fix data race bug in hash linked list memtable. With this bug, read request might temporarily miss an old record in the memtable in a race condition to the hash bucket.
    • Fix a bug that best_efforts_recovery may fail to open the db with mmap read.
    • 🛠 Fixed a bug where blobs read during compaction would pollute the cache.
    • 🛠 Fixed a data race in LRUCache when used with a secondary_cache.
    • 🛠 Fixed a bug where blobs read by iterators would be inserted into the cache even with the fill_cache read option set to false.
    • 🛠 Fixed the segfault caused by AllocateData() in CompressedSecondaryCache::SplitValueIntoChunks() and MergeChunksIntoValueTest.
    • 🛠 Fixed a bug in BlobDB where a mix of inlined and blob values could result in an incorrect value being passed to the compaction filter (see #10391).
    • 🛠 Fixed a memory leak bug in stress tests caused by FaultInjectionSecondaryCache.

    Behavior Change

    • ➕ Added checksum handshake during the copying of decompressed WAL fragment. This together with #9875, #10037, #10212, #10114 and #10319 provides end-to-end integrity protection for write batch during recovery.
    • 🔀 To minimize the internal fragmentation caused by the variable size of the compressed blocks in CompressedSecondaryCache, the original block is split according to the jemalloc bin size in Insert() and then merged back in Lookup().
    • 🚚 PosixLogger is removed and by default EnvLogger will be used for info logging. The behavior of the two loggers should be very similar when using the default Posix Env.
    • ✂ Remove [min|max]_timestamp from VersionEdit for now since they are not tracked in MANIFEST anyway but consume two empty std::string (up to 64 bytes) for each file. Should they be added back in the future, we should store them more compactly.
    • Improve universal tiered storage compaction picker to avoid extra major compaction triggered by size amplification. If preclude_last_level_data_seconds is enabled, the size amplification is calculated within non last_level data only which skip the last level and use the penultimate level as the size base.
    • 🔀 If an error is hit when writing to a file (append, sync, etc), RocksDB is more strict with not issuing more operations to it, except closing the file, with exceptions of some WAL file operations in error recovery path.
    • A WriteBufferManager constructed with allow_stall == false will no longer trigger write stall implicitly by thrashing until memtable count limit is reached. Instead, a column family can continue accumulating writes while that CF is flushing, which means memory may increase. Users who prefer stalling writes must now explicitly set allow_stall == true.
    • ➕ Add CompressedSecondaryCache into the stress tests.
    • Block cache keys have changed, which will cause any persistent caches to miss between versions.

    🐎 Performance Improvements

    • Instead of constructing FragmentedRangeTombstoneList during every read operation, it is now constructed once and stored in immutable memtables. This improves speed of querying range tombstones from immutable memtables.
    • 🚀 When using iterators with the integrated BlobDB implementation, blob cache handles are now released immediately when the iterator's position changes.
    • MultiGet can now do more IO in parallel by reading data blocks from SST files in multiple levels, if the optimize_multiget_for_io ReadOption flag is set.