RocksDB v7.2.0 Release Notes

  • ๐Ÿ› Bug Fixes

    • ๐Ÿ›  Fixed bug which caused rocksdb failure in the situation when rocksdb was accessible using UNC path
    • ๐Ÿ›  Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
    • ๐Ÿ›  Fixed a heap use-after-free race with DropColumnFamily.
    • ๐Ÿ›  Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
    • Fixed file_type, relative_filename and directory fields returned by GetLiveFilesMetaData(), which were added in inheriting from FileStorageInfo.
    • Fixed a bug affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#9766).
    • ๐Ÿ›  Fix segfault in FilePrefetchBuffer with async_io as it doesn't wait for pending jobs to complete on destruction.
    • ๐Ÿ– Fix ERROR_HANDLER_AUTORESUME_RETRY_COUNT stat whose value was set wrong in portal.h
    • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution the corrupted WALs whose numbers are larger than the corrupted wal and smaller than the new WAL will be moved to archive folder.
    • ๐Ÿ›  Fixed a bug in RocksDB DB::Open() which may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.

    ๐Ÿ†• New Features

    • ๐Ÿ‘€ For db_bench when --seed=0 or --seed is not set then it uses the current time as the seed value. Previously it used the value 1000.
    • ๐Ÿ‘€ For db_bench when --benchmark lists multiple tests and each test uses a seed for a RNG then the seeds across tests will no longer be repeated.
    • Added an option to dynamically charge an updating estimated memory usage of block-based table reader to block cache if block cache available. To enable this feature, set BlockBasedTableOptions::reserve_table_reader_memory = true.
    • Add new stat ASYNC_READ_BYTES that calculates number of bytes read during async read call and users can check if async code path is being called by RocksDB internal automatic prefetching for sequential reads.
    • Enable async prefetching if ReadOptions.readahead_size is set along with ReadOptions.async_io in FilePrefetchBuffer.
    • โž• Add event listener support on remote compaction compactor side.
    • โž• Added a dedicated integer DB property rocksdb.live-blob-file-garbage-size that exposes the total amount of garbage in the blob files in the current version.
    • RocksDB does internal auto prefetching if it notices sequential reads. It starts with readahead size initial_auto_readahead_size which now can be configured through BlockBasedTableOptions.
    • โž• Add a merge operator that allows users to register specific aggregation function so that they can does aggregation using different aggregation types for different keys. See comments in include/rocksdb/utilities/agg_merge.h for actual usage. The feature is experimental and the format is subject to change and we won't provide a migration tool.
    • ๐ŸŽ Meta-internal / Experimental: Improve CPU performance by replacing many uses of std::unordered_map with folly::F14FastMap when RocksDB is compiled together with Folly.
    • Experimental: Add CompressedSecondaryCache, a concrete implementation of rocksdb::SecondaryCache, that integrates with compression libraries (e.g. LZ4) to hold compressed blocks.

    Behavior changes

    • Disallow usage of commit-time-write-batch for write-prepared/write-unprepared transactions if TransactionOptions::use_only_the_last_commit_time_batch_for_recovery is false to prevent two (or more) uncommitted versions of the same key in the database. Otherwise, bottommost compaction may violate the internal key uniqueness invariant of SSTs if the sequence numbers of both internal keys are zeroed out (#9794).
    • โšก๏ธ Make DB::GetUpdatesSince() return NotSupported early for write-prepared/write-unprepared transactions, as the API contract indicates.

    Public API changes

    • ๐Ÿ”ฆ Exposed APIs to examine results of block cache stats collections in a structured way. In particular, users of GetMapProperty() with property kBlockCacheEntryStats can now use the functions in BlockCacheEntryStatsMapKeys to find stats in the map.
    • Add fail_if_not_bottommost_level to IngestExternalFileOptions so that ingestion will fail if the file(s) cannot be ingested to the bottommost level.
    • Add output parameter is_in_sec_cache to SecondaryCache::Lookup(). It is to indicate whether the handle is possibly erased from the secondary cache after the Lookup.