RocksDB v7.7.0 Release Notes

  • πŸ› Bug Fixes

    • πŸ›  Fixed a hang when an operation such as GetLiveFiles or CreateNewBackup is asked to trigger and wait for memtable flush on a read-only DB. Such indirect requests for memtable flush are now ignored on a read-only DB.
    • πŸ”€ Fixed bug where FlushWAL(true /* sync */) (used by GetLiveFilesStorageInfo(), which is used by checkpoint and backup) could cause parallel writes at the tail of a WAL file to never be synced.
    • ⚑️ Fix periodic_task unable to re-register the same task type, which may cause SetOptions() fail to update periodical_task time like: stats_dump_period_sec, stats_persist_period_sec.
    • πŸ›  Fixed a bug in the rocksdb.prefetched.bytes.discarded stat. It was counting the prefetch buffer size, rather than the actual number of bytes discarded from the buffer.
    • πŸ›  Fix bug where the directory containing CURRENT can left unsynced after CURRENT is updated to point to the latest MANIFEST, which leads to risk of unsync data loss of CURRENT.
    • ⚑️ Update rocksdb.multiget.io.batch.size stat in non-async MultiGet as well.
    • πŸ›  Fix a bug in key range overlap checking with concurrent compactions when user-defined timestamp is enabled. User-defined timestamps should be EXCLUDED when checking if two ranges overlap.
    • πŸ›  Fixed a bug where the blob cache prepopulating logic did not consider the secondary cache (see #10603).
    • πŸ›  Fixed the rocksdb.num.sst.read.per.level, rocksdb.num.index.and.filter.blocks.read.per.level and rocksdb.num.level.read.per.multiget stats in the MultiGet coroutines

    Public API changes

    • Add rocksdb_column_family_handle_get_id, rocksdb_column_family_handle_get_name to get name, id of column family in C API
    • βž• Add a new stat rocksdb.async.prefetch.abort.micros to measure time spent waiting for async prefetch reads to abort

    Java API Changes

    • βž• Add CompactionPriority.RoundRobin.
    • πŸ“‡ Revert to using the default metadata charge policy when creating an LRU cache via the Java API.

    Behavior Change

    • DBOptions::verify_sst_unique_id_in_manifest is now an on-by-default feature that verifies SST file identity whenever they are opened by a DB, rather than only at DB::Open time.
    • Right now, when the option migration tool (OptionChangeMigration()) migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. This might create a problem for some users: the giant file may be soon deleted to satisfy max_table_files_size, and might cayse the DB to be almost empty. We change the behavior so that the files are cut to be smaller, but these files might not follow the data insertion order. With the change, after the migration, migrated data might not be dropped by insertion order by FIFO compaction.
    • When a block is firstly found from CompressedSecondaryCache, we just insert a dummy block into the primary cache and don’t erase the block from CompressedSecondaryCache. A standalone handle is returned to the caller. Only if the block is found again from CompressedSecondaryCache before the dummy block is evicted, we erase the block from CompressedSecondaryCache and insert it into the primary cache.
    • When a block is firstly evicted from the primary cache to CompressedSecondaryCache, we just insert a dummy block in CompressedSecondaryCache. Only if it is evicted again before the dummy block is evicted from the cache, it is treated as a hot block and is inserted into CompressedSecondaryCache.
    • Improved the estimation of memory used by cached blobs by taking into account the size of the object owning the blob value and also the allocator overhead if malloc_usable_size is available (see #10583).
    • πŸ‘€ Blob values now have their own category in the cache occupancy statistics, as opposed to being lumped into the "Misc" bucket (see #10601).
    • Change the optimize_multiget_for_io experimental ReadOptions flag to default on.

    πŸ†• New Features

    • RocksDB does internal auto prefetching if it notices 2 sequential reads if readahead_size is not specified. New option num_file_reads_for_auto_readahead is added in BlockBasedTableOptions which indicates after how many sequential reads internal auto prefetching should be start (default is 2).
    • Added new perf context counters block_cache_standalone_handle_count, block_cache_real_handle_count,compressed_sec_cache_insert_real_count, compressed_sec_cache_insert_dummy_count, compressed_sec_cache_uncompressed_bytes, and compressed_sec_cache_compressed_bytes.
    • πŸ‘€ Memory for blobs which are to be inserted into the blob cache is now allocated using the cache's allocator (see #10628 and #10647).
    • πŸ”’ HyperClockCache is an experimental, lock-free Cache alternative for block cache that offers much improved CPU efficiency under high parallel load or high contention, with some caveats. As much as 4.5x higher ops/sec vs. LRUCache has been seen in db_bench under high parallel load.
    • CompressedSecondaryCacheOptions::enable_custom_split_merge is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.

    🐎 Performance Improvements

    • 🐎 Iterator performance is improved for DeleteRange() users. Internally, iterator will skip to the end of a range tombstone when possible, instead of looping through each key and check individually if a key is range deleted.
    • πŸ‘€ Eliminated some allocations and copies in the blob read path. Also, PinnableSlice now only points to the blob value and pins the backing resource (cache entry or buffer) in all cases, instead of containing a copy of the blob value. See #10625 and #10647.
    • πŸ”€ In case of scans with async_io enabled, few optimizations have been added to issue more asynchronous requests in parallel in order to avoid synchronous prefetching.
    • 🐎 DeleteRange() users should see improvement in get/iterator performance from mutable memtable (see #10547).