RocksDB v7.4.0 Release Notes

  • 🐛 Bug Fixes

    • Fixed a bug in calculating key-value integrity protection for users of in-place memtable updates. In particular, the affected users would be those who configure protection_bytes_per_key > 0 on WriteBatch or WriteOptions, and configure inplace_callback != nullptr.
    • 🛠 Fixed a bug where a snapshot taken during SST file ingestion would be unstable.
    • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
    • 🛠 Fixed a bug where RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.
    • 🛠 Fix a race condition in WAL size tracking which is caused by an unsafe iterator access after container is changed.
    • 🛠 Fix unprotected concurrent accesses to WritableFileWriter::filesize_ by DB::SyncWAL() and DB::Put() in two write queue mode.
    • 🛠 Fix a bug in WAL tracking. Before this PR (#10087), calling SyncWAL() on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequent DB::Open even if the WAL file is missing or corrupted.
    • Fix a bug that could return wrong results with index_type=kHashSearch and using SetOptions to change the prefix_extractor.
    • 🛠 Fixed a bug in WAL tracking with wal_compression. WAL compression writes a kSetCompressionType record which is not associated with any sequence number. As result, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.
    • Avoid a crash if the IDENTITY file is accidentally truncated to empty. A new DB ID will be written and generated on Open.
    • Fixed a possible corruption for users of manual_wal_flush and/or FlushWAL(true /* sync */), together with track_and_verify_wals_in_manifest == true. For those users, losing unsynced data (e.g., due to power loss) could make future DB opens fail with a Status::Corruption complaining about missing WAL data.
    • 🛠 Fixed a bug in WriteBatchInternal::Append() where WAL termination point in write batch was not considered and the function appends an incorrect number of checksums.
    • 🛠 Fixed a crash bug introduced in 7.3.0 affecting users of MultiGet with kDataBlockBinaryAndHash.

    Public API changes

    • ➕ Add new API GetUnixTime in Snapshot class which returns the unix time at which Snapshot is taken.
    • 📌 Add transaction get_pinned and multi_get to C API.
    • ➕ Add two-phase commit support to C API.
    • Add rocksdb_transaction_get_writebatch_wi and rocksdb_transaction_rebuild_from_writebatch to C API.
    • Add rocksdb_options_get_blob_file_starting_level and rocksdb_options_set_blob_file_starting_level to C API.
    • ➕ Add blobFileStartingLevel and setBlobFileStartingLevel to Java API.
    • ➕ Add SingleDelete for DB in C API
    • ➕ Add User Defined Timestamp in C API.
      • rocksdb_comparator_with_ts_create to create timestamp aware comparator
      • Put, Get, Delete, SingleDelete, MultiGet APIs has corresponding timestamp aware APIs with suffix with_ts
      • And Add C API's for Transaction, SstFileWriter, Compaction as mentioned here
    • The contract for implementations of Comparator::IsSameLengthImmediateSuccessor has been updated to work around a design bug in auto_prefix_mode.
    • The API documentation for auto_prefix_mode now notes some corner cases in which it returns different results than total_order_seek, due to design bugs that are not easily fixed. Users using built-in comparators and keys at least the size of a fixed prefix length are not affected.
    • Obsoleted the NUM_DATA_BLOCKS_READ_PER_LEVEL stat and introduced the NUM_LEVEL_READ_PER_MULTIGET and MULTIGET_COROUTINE_COUNT stats
    • Introduced WriteOptions::protection_bytes_per_key, which can be used to enable key-value integrity protection for live updates.

    🆕 New Features

    • ➕ Add FileSystem::ReadAsync API in io_tracing
    • Add blob garbage collection parameters blob_garbage_collection_policy and blob_garbage_collection_age_cutoff to both force-enable and force-disable GC, as well as selectively override age cutoff when using CompactRange.
    • ➕ Add an extra sanity check in GetSortedWalFiles() (also used by GetLiveFilesStorageInfo(), BackupEngine, and Checkpoint) to reduce risk of successfully created backup or checkpoint failing to open because of missing WAL file.
    • Add a new column family option blob_file_starting_level to enable writing blob files during flushes and compactions starting from the specified LSM tree level.
    • ➕ Add support for timestamped snapshots (#9879)
    • 👍 Provide support for AbortIO in posix to cancel submitted asynchronous requests using io_uring.
    • ➕ Add support for rate-limiting batched MultiGet() APIs
    • ➕ Added several new tickers, perf context statistics, and DB properties to BlobDB
      • Added new DB properties "rocksdb.blob-cache-capacity", "rocksdb.blob-cache-usage", "rocksdb.blob-cache-pinned-usage" to show blob cache usage.
      • Added new perf context statistics blob_cache_hit_count, blob_read_count, blob_read_byte, blob_read_time, blob_checksum_time and blob_decompress_time.
      • Added new tickers BLOB_DB_CACHE_MISS, BLOB_DB_CACHE_HIT, BLOB_DB_CACHE_ADD, BLOB_DB_CACHE_ADD_FAILURES, BLOB_DB_CACHE_BYTES_READ and BLOB_DB_CACHE_BYTES_WRITE.

    Behavior changes

    • DB::Open(), DB::OpenAsSecondary() will fail if a Logger cannot be created (#9984)
    • DB::Write does not hold global mutex_ if this db instance does not need to switch wal and mem-table (#7516).
    • ✂ Removed support for reading Bloom filters using obsolete block-based filter format. (Support for writing such filters was dropped in 7.0.) For good read performance on old DBs using these filters, a full compaction is required.
    • Per KV checksum in write batch is verified before a write batch is written to WAL to detect any corruption to the write batch (#10114).

    🐎 Performance Improvements

    • 🐎 When compiled with folly (Meta-internal integration; experimental in open source build), improve the locking performance (CPU efficiency) of LRUCache by using folly DistributedMutex in place of standard mutex.