RocksDB v7.8.0 Release Notes

  • ๐Ÿ†• New Features

    • ๐Ÿ‘ DeleteRange() now supports user-defined timestamp.
    • ๐Ÿ‘ Provide support for async_io with tailing iterators when ReadOptions.tailing is enabled during scans.
    • Tiered Storage: allow data moving up from the last level to the penultimate level if the input level is penultimate level or above.
    • โž• Added DB::Properties::kFastBlockCacheEntryStats, which is similar to DB::Properties::kBlockCacheEntryStats, except returns cached (stale) values in more cases to reduce overhead.
    • ๐Ÿ‘ FIFO compaction now supports migrating from a multi-level DB via DB::Open(). During the migration phase, FIFO compaction picker will:
    • picks the sst file with the smallest starting key in the bottom-most non-empty level.
    • Note that during the migration phase, the file purge order will only be an approximation of "FIFO" as files in lower-level might sometime contain newer keys than files in upper-level.
    • Added an option ignore_max_compaction_bytes_for_input to ignore max_compaction_bytes limit when adding files to be compacted from input level. This should help reduce write amplification. The option is enabled by default.
    • Tiered Storage: allow data moving up from the last level even if it's a last level only compaction, as long as the penultimate level is empty.
    • Add a new option IOOptions.do_not_recurse that can be used by underlying file systems to skip recursing through sub directories and list only files in GetChildren API.
    • Add option preserve_internal_time_seconds to preserve the time information for the latest data. Which can be used to determine the age of data when preclude_last_level_data_seconds is enabled. The time information is attached with SST in table property which can be parsed by tool ldb or sst_dump.

    ๐Ÿ› Bug Fixes

    • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
    • ๐Ÿ›  Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
    • ๐Ÿ›  Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
    • ๐Ÿ›  Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).
    • ๐Ÿ›  Fixed a bug causing manual flush with flush_opts.wait=false to stall when database has stopped all writes (#10001).
    • ๐Ÿ›  Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
    • Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).
    • ๐Ÿ›  Fixed a memory safety bug in experimental HyperClockCache (#10768)
    • Fixed some cases where ldb update_manifest and ldb unsafe_remove_sst_file are not usable because they were requiring the DB files to match the existing manifest state (before updating the manifest to match a desired state).

    ๐ŸŽ Performance Improvements

    • Try to align the compaction output file boundaries to the next level ones, which can reduce more than 10% compaction load for the default level compaction. The feature is enabled by default, to disable, set AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size to false. As a side effect, it can create SSTs larger than the target_file_size (capped at 2x target_file_size) or smaller files.
    • ๐Ÿ‘Œ Improve RoundRobin TTL compaction, which is going to be the same as normal RoundRobin compaction to move the compaction cursor.
    • ๐Ÿ›  Fix a small CPU regression caused by a change that UserComparatorWrapper was made Customizable, because Customizable itself has small CPU overhead for initialization.

    Behavior Changes

    • Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).
    • With periodic stat dumper waits up every options.stats_dump_period_sec seconds, it won't dump stats for a CF if it has no change in the period, unless 7 periods have been skipped.
    • Only periodic stats dumper triggered by options.stats_dump_period_sec will update stats interval. Ones triggered by DB::GetProperty() will not update stats interval and will report based on an interval since the last time stats dump period.

    Public API changes

    • ๐ŸŽ Make kXXH3 checksum the new default, because it is faster on common hardware, especially with kCRC32c affected by a performance bug in some versions of clang ( DBs written with this new setting can be read by RocksDB 6.27 and newer.
    • Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Introduced an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. More details in rocksdb/includb/block_cache_trace_writer.h.