RocksDB v6.28.0 Release Notes

Release Date: 2021-12-17 // about 1 month ago
  • ๐Ÿ†• New Features

    • ๐Ÿ‘ Introduced 'CommitWithTimestamp' as a new tag. Currently, there is no API for user to trigger a write with this tag to the WAL. This is part of the efforts to support write-commited transactions with user-defined timestamps.

    ๐Ÿ› Bug Fixes

    • ๐Ÿ›  Fixed a bug in rocksdb automatic implicit prefetching which got broken because of new feature adaptive_readahead and internal prefetching got disabled when iterator moves from one file to next.
    • Fixed a bug in TableOptions.prepopulate_block_cache which causes segmentation fault when used with TableOptions.partition_filters = true and TableOptions.cache_index_and_filter_blocks = true.
    • ๐Ÿ›  Fixed a bug affecting custom memtable factories which are not registered with the ObjectRegistry. The bug could result in failure to save the OPTIONS file.
    • ๐Ÿ›  Fixed a bug causing two duplicate entries to be appended to a file opened in non-direct mode and tracked by FaultInjectionTestFS.
    • Fixed a bug in TableOptions.prepopulate_block_cache to support block-based filters also.
    • ๐Ÿšš Block cache keys no longer use FSRandomAccessFile::GetUniqueId() (previously used when available), so a filesystem recycling unique ids can no longer lead to incorrect result or crash (#7405). For files generated by RocksDB >= 6.24, the cache keys are stable across DB::Open and DB directory move / copy / import / export / migration, etc. Although collisions are still theoretically possible, they are (a) impossible in many common cases, (b) not dependent on environmental factors, and (c) much less likely than a CPU miscalculation while executing RocksDB.

    Behavior Changes

    • MemTableList::TrimHistory now use allocated bytes when max_write_buffer_size_to_maintain > 0(default in TrasactionDB, introduced in PR#5022) Fix #8371.

    Public API change

    • Extend WriteBatch::AssignTimestamp and AssignTimestamps API so that both functions can accept an optional checker argument that performs additional checking on timestamp sizes.
    • Introduce a new EventListener callback that will be called upon the end of automatic error recovery.

    ๐ŸŽ Performance Improvements

    • Replaced map property TableProperties::properties_offsets with uint64_t property external_sst_file_global_seqno_offset to save table properties's memory.
    • ๐Ÿ›  Block cache accesses are faster by RocksDB using cache keys of fixed size (16 bytes).

    Java API Changes

    • โœ‚ Removed Java API TableProperties.getPropertiesOffsets() as it exposed internal details to external users.

Previous changes from v6.27.0

  • ๐Ÿ†• New Features

    • โž• Added new ChecksumType kXXH3 which is faster than kCRC32c on almost all x86_64 hardware.
    • โž• Added a new online consistency check for BlobDB which validates that the number/total size of garbage blobs does not exceed the number/total size of all blobs in any given blob file.
    • ๐Ÿ‘ Provided support for tracking per-sst user-defined timestamp information in MANIFEST.
    • Added new option "adaptive_readahead" in ReadOptions. For iterators, RocksDB does auto-readahead on noticing sequential reads and by enabling this option, readahead_size of current file (if reads are sequential) will be carried forward to next file instead of starting from the scratch at each level (except L0 level files). If reads are not sequential it will fall back to 8KB. This option is applicable only for RocksDB internal prefetch buffer and isn't supported with underlying file system prefetching.
    • โž• Added the read count and read bytes related stats to Statistics for tiered storage hot, warm, and cold file reads.
    • Added an option to dynamically charge an updating estimated memory usage of block-based table building to block cache if block cache available. It currently only includes charging memory usage of constructing (new) Bloom Filter and Ribbon Filter to block cache. To enable this feature, set BlockBasedTableOptions::reserve_table_builder_memory = true.
    • โž• Add a new API OnIOError in listener.h that notifies listeners when an IO error occurs during FileSystem operation along with filename, status etc.
    • Added compaction readahead support for blob files to the integrated BlobDB implementation, which can improve compaction performance when the database resides on higher-latency storage like HDDs or remote filesystems. Readahead can be configured using the column family option blob_compaction_readahead_size.

    ๐Ÿ› Bug Fixes

    • Prevent a CompactRange() with CompactRangeOptions::change_level == true from possibly causing corruption to the LSM state (overlapping files within a level) when run in parallel with another manual compaction. Note that setting force_consistency_checks == true (the default) would cause the DB to enter read-only mode in this scenario and return Status::Corruption, rather than committing any corruption.
    • ๐Ÿ›  Fixed a bug in CompactionIterator when write-prepared transaction is used. A released earliest write conflict snapshot may cause assertion failure in dbg mode and unexpected key in opt mode.
    • Fix ticker WRITE_WITH_WAL("rocksdb.write.wal"), this bug is caused by a bad extra RecordTick(stats_, WRITE_WITH_WAL) (at 2 place), this fix remove the extra RecordTicks and fix the corresponding test case.
    • EventListener::OnTableFileCreated was previously called with OK status and file_size==0 in cases of no SST file contents written (because there was no content to add) and the empty file deleted before calling the listener. Now the status is Aborted.
    • ๐Ÿ›  Fixed a bug in CompactionIterator when write-preared transaction is used. Releasing earliest_snapshot during compaction may cause a SingleDelete to be output after a PUT of the same user key whose seq has been zeroed.
    • โž• Added input sanitization on negative bytes passed into GenericRateLimiter::Request.
    • ๐Ÿ›  Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete.
    • Fixed a bug of timestamp-based GC which can cause all versions of a key under full_history_ts_low to be dropped. This bug will be triggered when some of the ikeys' timestamps are lower than full_history_ts_low, while others are newer.
    • In some cases outside of the DB read and compaction paths, SST block checksums are now checked where they were not before.
    • Explicitly check for and disallow the BlockBasedTableOptions if insertion into one of {block_cache, block_cache_compressed, persistent_cache} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.)
    • Users who configured a dedicated thread pool for bottommost compactions by explicitly adding threads to the Env::Priority::BOTTOM pool will no longer see RocksDB schedule automatic compactions exceeding the DB's compaction concurrency limit. For details on per-DB compaction concurrency limit, see API docs of max_background_compactions and max_background_jobs.
    • ๐Ÿ›  Fixed a bug of background flush thread picking more memtables to flush and prematurely advancing column family's log_number.
    • ๐Ÿ›  Fixed an assertion failure in ManifestTailer.
    • ๐Ÿ›  Fixed a bug that could, with WAL enabled, cause backups, checkpoints, and GetSortedWalFiles() to fail randomly with an error like IO error: 001234.log: No such file or directory

    Behavior Changes

    • NUM_FILES_IN_SINGLE_COMPACTION was only counting the first input level files, now it's including all input files.
    • TransactionUtil::CheckKeyForConflicts can also perform conflict-checking based on user-defined timestamps in addition to sequence numbers.
    • โœ‚ Removed GenericRateLimiter's minimum refill bytes per period previously enforced.

    Public API change

    • โฑ When options.ttl is used with leveled compaction with compactinon priority kMinOverlappingRatio, files exceeding half of TTL value will be prioritized more, so that by the time TTL is reached, fewer extra compactions will be scheduled to clear them up. At the same time, when compacting files with data older than half of TTL, output files may be cut off based on those files' boundaries, in order for the early TTL compaction to work properly.
    • Made FileSystem and RateLimiter extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • ๐Ÿ‘ป Clarified in API comments that RocksDB is not exception safe for callbacks and custom extensions. An exception propagating into RocksDB can lead to undefined behavior, including data loss, unreported corruption, deadlocks, and more.
    • Marked WriteBufferManager as final because it is not intended for extension.
    • โœ‚ Removed unimportant implementation details from table_properties.h
    • โž• Add API FSDirectory::FsyncWithDirOptions(), which provides extra information like directory fsync reason in DirFsyncOptions. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the DB::Open() speed by ~20%.
    • DB::Open() is not going be blocked by obsolete file purge if DBOptions::avoid_unnecessary_blocking_io is set to true.
    • ๐Ÿ— In builds where glibc provides gettid(), info log ("LOG" file) lines now print a system-wide thread ID from gettid() instead of the process-local pthread_self(). For all users, the thread ID format is changed from hexadecimal to decimal integer.
    • In builds where glibc provides pthread_setname_np(), the background thread names no longer contain an ID suffix. For example, "rocksdb:bottom7" (and all other threads in the Env::Priority::BOTTOM pool) are now named "rocksdb:bottom". Previously large thread pools could breach the name size limit (e.g., naming "rocksdb:bottom10" would fail).
    • Deprecating ReadOptions::iter_start_seqnum and DBOptions::preserve_deletes, please try using user defined timestamp feature instead. The options will be removed in a future release, currently it logs a warning message when using.

    ๐ŸŽ Performance Improvements

    • ๐Ÿš€ Released some memory related to filter construction earlier in BlockBasedTableBuilder for FullFilter and PartitionedFilter case (#9070)

    Behavior Changes

    • NUM_FILES_IN_SINGLE_COMPACTION was only counting the first input level files, now it's including all input files.