All Versions
114
Latest Version
Avg Release Cycle
29 days
Latest Release
-

Changelog History
Page 1

  • v7.4.0 Changes

    ๐Ÿ› Bug Fixes

    • Fixed a bug in calculating key-value integrity protection for users of in-place memtable updates. In particular, the affected users would be those who configure protection_bytes_per_key > 0 on WriteBatch or WriteOptions, and configure inplace_callback != nullptr.
    • ๐Ÿ›  Fixed a bug where a snapshot taken during SST file ingestion would be unstable.
    • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
    • ๐Ÿ›  Fixed a bug where RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.
    • ๐Ÿ›  Fix a race condition in WAL size tracking which is caused by an unsafe iterator access after container is changed.
    • ๐Ÿ›  Fix unprotected concurrent accesses to WritableFileWriter::filesize_ by DB::SyncWAL() and DB::Put() in two write queue mode.
    • ๐Ÿ›  Fix a bug in WAL tracking. Before this PR (#10087), calling SyncWAL() on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequent DB::Open even if the WAL file is missing or corrupted.
    • Fix a bug that could return wrong results with index_type=kHashSearch and using SetOptions to change the prefix_extractor.
    • ๐Ÿ›  Fixed a bug in WAL tracking with wal_compression. WAL compression writes a kSetCompressionType record which is not associated with any sequence number. As result, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.
    • Avoid a crash if the IDENTITY file is accidentally truncated to empty. A new DB ID will be written and generated on Open.
    • Fixed a possible corruption for users of manual_wal_flush and/or FlushWAL(true /* sync */), together with track_and_verify_wals_in_manifest == true. For those users, losing unsynced data (e.g., due to power loss) could make future DB opens fail with a Status::Corruption complaining about missing WAL data.
    • ๐Ÿ›  Fixed a bug in WriteBatchInternal::Append() where WAL termination point in write batch was not considered and the function appends an incorrect number of checksums.
    • ๐Ÿ›  Fixed a crash bug introduced in 7.3.0 affecting users of MultiGet with kDataBlockBinaryAndHash.

    Public API changes

    • โž• Add new API GetUnixTime in Snapshot class which returns the unix time at which Snapshot is taken.
    • ๐Ÿ“Œ Add transaction get_pinned and multi_get to C API.
    • โž• Add two-phase commit support to C API.
    • Add rocksdb_transaction_get_writebatch_wi and rocksdb_transaction_rebuild_from_writebatch to C API.
    • Add rocksdb_options_get_blob_file_starting_level and rocksdb_options_set_blob_file_starting_level to C API.
    • โž• Add blobFileStartingLevel and setBlobFileStartingLevel to Java API.
    • โž• Add SingleDelete for DB in C API
    • โž• Add User Defined Timestamp in C API.
      • rocksdb_comparator_with_ts_create to create timestamp aware comparator
      • Put, Get, Delete, SingleDelete, MultiGet APIs has corresponding timestamp aware APIs with suffix with_ts
      • And Add C API's for Transaction, SstFileWriter, Compaction as mentioned here
    • The contract for implementations of Comparator::IsSameLengthImmediateSuccessor has been updated to work around a design bug in auto_prefix_mode.
    • The API documentation for auto_prefix_mode now notes some corner cases in which it returns different results than total_order_seek, due to design bugs that are not easily fixed. Users using built-in comparators and keys at least the size of a fixed prefix length are not affected.
    • Obsoleted the NUM_DATA_BLOCKS_READ_PER_LEVEL stat and introduced the NUM_LEVEL_READ_PER_MULTIGET and MULTIGET_COROUTINE_COUNT stats
    • Introduced WriteOptions::protection_bytes_per_key, which can be used to enable key-value integrity protection for live updates.

    ๐Ÿ†• New Features

    • โž• Add FileSystem::ReadAsync API in io_tracing
    • Add blob garbage collection parameters blob_garbage_collection_policy and blob_garbage_collection_age_cutoff to both force-enable and force-disable GC, as well as selectively override age cutoff when using CompactRange.
    • โž• Add an extra sanity check in GetSortedWalFiles() (also used by GetLiveFilesStorageInfo(), BackupEngine, and Checkpoint) to reduce risk of successfully created backup or checkpoint failing to open because of missing WAL file.
    • Add a new column family option blob_file_starting_level to enable writing blob files during flushes and compactions starting from the specified LSM tree level.
    • โž• Add support for timestamped snapshots (#9879)
    • ๐Ÿ‘ Provide support for AbortIO in posix to cancel submitted asynchronous requests using io_uring.
    • โž• Add support for rate-limiting batched MultiGet() APIs
    • โž• Added several new tickers, perf context statistics, and DB properties to BlobDB
      • Added new DB properties "rocksdb.blob-cache-capacity", "rocksdb.blob-cache-usage", "rocksdb.blob-cache-pinned-usage" to show blob cache usage.
      • Added new perf context statistics blob_cache_hit_count, blob_read_count, blob_read_byte, blob_read_time, blob_checksum_time and blob_decompress_time.
      • Added new tickers BLOB_DB_CACHE_MISS, BLOB_DB_CACHE_HIT, BLOB_DB_CACHE_ADD, BLOB_DB_CACHE_ADD_FAILURES, BLOB_DB_CACHE_BYTES_READ and BLOB_DB_CACHE_BYTES_WRITE.

    Behavior changes

    • DB::Open(), DB::OpenAsSecondary() will fail if a Logger cannot be created (#9984)
    • DB::Write does not hold global mutex_ if this db instance does not need to switch wal and mem-table (#7516).
    • โœ‚ Removed support for reading Bloom filters using obsolete block-based filter format. (Support for writing such filters was dropped in 7.0.) For good read performance on old DBs using these filters, a full compaction is required.
    • Per KV checksum in write batch is verified before a write batch is written to WAL to detect any corruption to the write batch (#10114).

    ๐ŸŽ Performance Improvements

    • ๐ŸŽ When compiled with folly (Meta-internal integration; experimental in open source build), improve the locking performance (CPU efficiency) of LRUCache by using folly DistributedMutex in place of standard mutex.
  • v7.3.0 Changes

    ๐Ÿ› Bug Fixes

    • ๐Ÿ›  Fixed a bug where manual flush would block forever even though flush options had wait=false.
    • Fixed a bug where RocksDB could corrupt DBs with avoid_flush_during_recovery == true by removing valid WALs, leading to Status::Corruption with message like "SST file is ahead of WALs" when attempting to reopen.
    • ๐Ÿ›  Fixed a bug in async_io path where incorrect length of data is read by FilePrefetchBuffer if data is consumed from two populated buffers and request for more data is sent.
    • ๐Ÿ›  Fixed a CompactionFilter bug. Compaction filter used to use Delete to remove keys, even if the keys should be removed with SingleDelete. Mixing Delete and SingleDelete may cause undefined behavior.
    • Fixed a bug in WritableFileWriter::WriteDirect and WritableFileWriter::WriteDirectWithChecksum. The rate_limiter_priority specified in ReadOptions was not passed to the RateLimiter when requesting a token.
    • ๐Ÿ›  Fixed a bug which might cause process crash when I/O error happens when reading an index block in MultiGet().

    ๐Ÿ†• New Features

    • DB::GetLiveFilesStorageInfo is ready for production use.
    • Add new stats PREFETCHED_BYTES_DISCARDED which records number of prefetched bytes discarded by RocksDB FilePrefetchBuffer on destruction and POLL_WAIT_MICROS records wait time for FS::Poll API completion.
    • RemoteCompaction supports table_properties_collector_factories override on compaction worker.
    • Start tracking SST unique id in MANIFEST, which will be used to verify with SST properties during DB open to make sure the SST file is not overwritten or misplaced. A db option verify_sst_unique_id_in_manifest is introduced to enable/disable the verification, if enabled all SST files will be opened during DB-open to verify the unique id (default is false), so it's recommended to use it with max_open_files = -1 to pre-open the files.
    • โž• Added the ability to concurrently read data blocks from multiple files in a level in batched MultiGet. This can be enabled by setting the async_io option in ReadOptions. Using this feature requires a FileSystem that supports ReadAsync (PosixFileSystem is not supported yet for this), and for RocksDB to be compiled with folly and c++20.
    • ๐Ÿ“‡ Charge memory usage of file metadata. RocksDB holds one file metadata structure in-memory per on-disk table file. If an operation reserving memory for file metadata exceeds the avaible space left in the block cache at some point (i.e, causing a cache full under LRUCacheOptions::strict_capacity_limit = true), creation will fail with Status::MemoryLimit(). To opt in this feature, enable charging CacheEntryRole::kFileMetadata in BlockBasedTableOptions::cache_usage_options.

    Public API changes

    • Add rollback_deletion_type_callback to TransactionDBOptions so that write-prepared transactions know whether to issue a Delete or SingleDelete to cancel a previous key written during prior prepare phase. The PR aims to prevent mixing SingleDeletes and Deletes for the same key that can lead to undefined behaviors for write-prepared transactions.
    • EXPERIMENTAL: Add new API AbortIO in file_system to abort the read requests submitted asynchronously.
    • ๐Ÿšš CompactionFilter::Decision has a new value: kRemoveWithSingleDelete. If CompactionFilter returns this decision, then CompactionIterator will use SingleDelete to mark a key as removed.
    • ๐Ÿšš Renamed CompactionFilter::Decision::kRemoveWithSingleDelete to kPurge since the latter sounds more general and hides the implementation details of how compaction iterator handles keys.
    • โž• Added ability to specify functions for Prepare and Validate to OptionsTypeInfo. Added methods to OptionTypeInfo to set the functions via an API. These methods are intended for RocksDB plugin developers for configuration management.
    • Added a new immutable db options, enforce_single_del_contracts. If set to false (default is true), compaction will NOT fail due to a single delete followed by a delete for the same key. The purpose of this temporay option is to help existing use cases migrate.
    • Introduce BlockBasedTableOptions::cache_usage_options and use that to replace BlockBasedTableOptions::reserve_table_builder_memory and BlockBasedTableOptions::reserve_table_reader_memory.
    • ๐Ÿ”„ Changed GetUniqueIdFromTableProperties to return a 128-bit unique identifier, which will be the standard size now. The old functionality (192-bit) is available from GetExtendedUniqueIdFromTableProperties. Both functions are no longer "experimental" and are ready for production use.
    • ๐Ÿ—„ In IOOptions, mark prio as deprecated for future removal.
    • ๐Ÿ—„ In file_system.h, mark IOPriority as deprecated for future removal.
    • Add an option, CompressionOptions::use_zstd_dict_trainer, to indicate whether zstd dictionary trainer should be used for generating zstd compression dictionaries. The default value of this option is true for backward compatibility. When this option is set to false, zstd API ZDICT_finalizeDictionary is used to generate compression dictionaries.
    • ๐Ÿ‘€ Seek API which positions itself every LevelIterator on the correct data block in the correct SST file which can be parallelized if ReadOptions.async_io option is enabled.
    • Add new stat number_async_seek in PerfContext that indicates number of async calls made by seek to prefetch data.
    • โž• Add support for user-defined timestamps to read only DB.

    ๐Ÿ› Bug Fixes

    • ๐ŸŽ RocksDB calls FileSystem::Poll API during FilePrefetchBuffer destruction which impacts performance as it waits for read requets completion which is not needed anymore. Calling FileSystem::AbortIO to abort those requests instead fixes that performance issue.
    • ๐Ÿ›  Fixed unnecessary block cache contention when queries within a MultiGet batch and across parallel batches access the same data block, which previously could cause severely degraded performance in this unusual case. (In more typical MultiGet cases, this fix is expected to yield a small or negligible performance improvement.)

    Behavior changes

    • โœ… Enforce the existing contract of SingleDelete so that SingleDelete cannot be mixed with Delete because it leads to undefined behavior. Fix a number of unit tests that violate the contract but happen to pass.
    • ldb --try_load_options default to true if --db is specified and not creating a new DB, the user can still explicitly disable that by --try_load_options=false (or explicitly enable that by --try_load_options).
    • During Flush write or Compaction write/read, the WriteController is used to determine whether DB writes are stalled or slowed down. The priority (Env::IOPriority) can then be determined accordingly and be passed in IOOptions to the file system.

    ๐ŸŽ Performance Improvements

    • Avoid calling malloc_usable_size() in LRU Cache's mutex.
    • โฌ‡๏ธ Reduce DB mutex holding time when finding obsolete files to delete. When a file is trivial moved to another level, the internal files will be referenced twice internally and sometimes opened twice too. If a deletion candidate file is not the last reference, we need to destroy the reference and close the file but not deleting the file. Right now we determine it by building a set of all live files. With the improvement, we check the file against all live LSM-tree versions instead.
  • v7.2.0 Changes

    ๐Ÿ› Bug Fixes

    • ๐Ÿ›  Fixed bug which caused rocksdb failure in the situation when rocksdb was accessible using UNC path
    • ๐Ÿ›  Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
    • ๐Ÿ›  Fixed a heap use-after-free race with DropColumnFamily.
    • ๐Ÿ›  Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
    • Fixed file_type, relative_filename and directory fields returned by GetLiveFilesMetaData(), which were added in inheriting from FileStorageInfo.
    • Fixed a bug affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#9766).
    • ๐Ÿ›  Fix segfault in FilePrefetchBuffer with async_io as it doesn't wait for pending jobs to complete on destruction.
    • ๐Ÿ– Fix ERROR_HANDLER_AUTORESUME_RETRY_COUNT stat whose value was set wrong in portal.h
    • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution the corrupted WALs whose numbers are larger than the corrupted wal and smaller than the new WAL will be moved to archive folder.
    • ๐Ÿ›  Fixed a bug in RocksDB DB::Open() which may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.

    ๐Ÿ†• New Features

    • ๐Ÿ‘€ For db_bench when --seed=0 or --seed is not set then it uses the current time as the seed value. Previously it used the value 1000.
    • ๐Ÿ‘€ For db_bench when --benchmark lists multiple tests and each test uses a seed for a RNG then the seeds across tests will no longer be repeated.
    • Added an option to dynamically charge an updating estimated memory usage of block-based table reader to block cache if block cache available. To enable this feature, set BlockBasedTableOptions::reserve_table_reader_memory = true.
    • Add new stat ASYNC_READ_BYTES that calculates number of bytes read during async read call and users can check if async code path is being called by RocksDB internal automatic prefetching for sequential reads.
    • Enable async prefetching if ReadOptions.readahead_size is set along with ReadOptions.async_io in FilePrefetchBuffer.
    • โž• Add event listener support on remote compaction compactor side.
    • โž• Added a dedicated integer DB property rocksdb.live-blob-file-garbage-size that exposes the total amount of garbage in the blob files in the current version.
    • RocksDB does internal auto prefetching if it notices sequential reads. It starts with readahead size initial_auto_readahead_size which now can be configured through BlockBasedTableOptions.
    • โž• Add a merge operator that allows users to register specific aggregation function so that they can does aggregation using different aggregation types for different keys. See comments in include/rocksdb/utilities/agg_merge.h for actual usage. The feature is experimental and the format is subject to change and we won't provide a migration tool.
    • ๐ŸŽ Meta-internal / Experimental: Improve CPU performance by replacing many uses of std::unordered_map with folly::F14FastMap when RocksDB is compiled together with Folly.
    • Experimental: Add CompressedSecondaryCache, a concrete implementation of rocksdb::SecondaryCache, that integrates with compression libraries (e.g. LZ4) to hold compressed blocks.

    Behavior changes

    • Disallow usage of commit-time-write-batch for write-prepared/write-unprepared transactions if TransactionOptions::use_only_the_last_commit_time_batch_for_recovery is false to prevent two (or more) uncommitted versions of the same key in the database. Otherwise, bottommost compaction may violate the internal key uniqueness invariant of SSTs if the sequence numbers of both internal keys are zeroed out (#9794).
    • โšก๏ธ Make DB::GetUpdatesSince() return NotSupported early for write-prepared/write-unprepared transactions, as the API contract indicates.

    Public API changes

    • ๐Ÿ”ฆ Exposed APIs to examine results of block cache stats collections in a structured way. In particular, users of GetMapProperty() with property kBlockCacheEntryStats can now use the functions in BlockCacheEntryStatsMapKeys to find stats in the map.
    • Add fail_if_not_bottommost_level to IngestExternalFileOptions so that ingestion will fail if the file(s) cannot be ingested to the bottommost level.
    • Add output parameter is_in_sec_cache to SecondaryCache::Lookup(). It is to indicate whether the handle is possibly erased from the secondary cache after the Lookup.
  • v7.1.0 Changes

    ๐Ÿ†• New Features

    • ๐Ÿ‘ Allow WriteBatchWithIndex to index a WriteBatch that includes keys with user-defined timestamps. The index itself does not have timestamp.
    • โž• Add support for user-defined timestamps to write-committed transaction without API change. The TransactionDB layer APIs do not allow timestamps because we require that all user-defined-timestamps-aware operations go through the Transaction APIs.
    • โž• Added BlobDB options to ldb
    • BlockBasedTableOptions::detect_filter_construct_corruption can now be dynamically configured using DB::SetOptions.
    • Automatically recover from retryable read IO errors during backgorund flush/compaction.
    • โšก๏ธ Experimental support for preserving file Temperatures through backup and restore, and for updating DB metadata for outside changes to file Temperature (UpdateManifestForFilesState or ldb update_manifest --update_temperatures).
    • ๐Ÿ‘ Experimental support for async_io in ReadOptions which is used by FilePrefetchBuffer to prefetch some of the data asynchronously, if reads are sequential and auto readahead is enabled by rocksdb internally.

    ๐Ÿ› Bug Fixes

    • ๐Ÿ›  Fixed a major performance bug in which Bloom filters generated by pre-7.0 releases are not read by early 7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name() in #9590. This can severely impact read performance and read I/O on upgrade or downgrade with existing DB, but not data correctness.
    • ๐Ÿ›  Fixed a data race on versions_ between DBImpl::ResumeImpl() and threads waiting for recovery to complete (#9496)
    • ๐Ÿ›  Fixed a bug caused by race among flush, incoming writes and taking snapshots. Queries to snapshots created with these race condition can return incorrect result, e.g. resurfacing deleted data.
    • Fixed a bug that DB flush uses options.compression even options.compression_per_level is set.
    • ๐Ÿ›  Fixed a bug that DisableManualCompaction may assert when disable an unscheduled manual compaction.
    • ๐Ÿ›  Fix a race condition when cancel manual compaction with DisableManualCompaction. Also DB close can cancel the manual compaction thread.
    • ๐Ÿ›  Fixed a potential timer crash when open close DB concurrently.
    • ๐ŸŒฒ Fixed a race condition for alive_log_files_ in non-two-write-queues mode. The race is between the write_thread_ in WriteToWAL() and another thread executing FindObsoleteFiles(). The race condition will be caught if __glibcxx_requires_nonempty is enabled.
    • ๐Ÿ›  Fixed a bug that Iterator::Refresh() reads stale keys after DeleteRange() performed.
    • ๐Ÿ›  Fixed a race condition when disable and re-enable manual compaction.
    • ๐Ÿ›  Fixed automatic error recovery failure in atomic flush.
    • ๐Ÿ›  Fixed a race condition when mmaping a WritableFile on POSIX.

    Public API changes

    • โž• Added pure virtual FilterPolicy::CompatibilityName(), which is needed for fixing major performance bug involving FilterPolicy naming in SST metadata without affecting Customizable aspect of FilterPolicy. This change only affects those with their own custom or wrapper FilterPolicy classes.
    • options.compression_per_level is dynamically changeable with SetOptions().
    • Added WriteOptions::rate_limiter_priority. When set to something other than Env::IO_TOTAL, the internal rate limiter (DBOptions::rate_limiter) will be charged at the specified priority for writes associated with the API to which the WriteOptions was provided. Currently the support covers automatic WAL flushes, which happen during live updates (Put(), Write(), Delete(), etc.) when WriteOptions::disableWAL == false and DBOptions::manual_wal_flush == false.
    • ๐Ÿšš Add DB::OpenAndTrimHistory API. This API will open DB and trim data to the timestamp specified by trim_ts (The data with timestamp larger than specified trim bound will be removed). This API should only be used at a timestamp-enabled column families recovery. If the column family doesn't have timestamp enabled, this API won't trim any data on that column family. This API is not compatible with avoid_flush_during_recovery option.
    • Remove BlockBasedTableOptions.hash_index_allow_collision which already takes no effect.
  • v7.0.0 Changes

    ๐Ÿ› Bug Fixes

    • Fixed a major bug in which batched MultiGet could return old values for keys deleted by DeleteRange when memtable Bloom filter is enabled (memtable_prefix_bloom_size_ratio > 0). (The fix includes a substantial MultiGet performance improvement in the unusual case of both memtable_whole_key_filtering and prefix_extractor.)
    • ๐Ÿ›  Fixed more cases of EventListener::OnTableFileCreated called with OK status, file_size==0, and no SST file kept. Now the status is Aborted.
    • ๐Ÿ›  Fixed a read-after-free bug in DB::GetMergeOperands().
    • ๐Ÿ›  Fix a data loss bug for 2PC write-committed transaction caused by concurrent transaction commit and memtable switch (#9571).
    • Fixed NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats to be reported once per MultiGet batch per level.

    ๐ŸŽ Performance Improvements

    • ๐ŸŽ Mitigated the overhead of building the file location hash table used by the online LSM tree consistency checks, which can improve performance for certain workloads (see #9351).
    • ๐Ÿ“‡ Switched to using a sorted std::vector instead of std::map for storing the metadata objects for blob files, which can improve performance for certain workloads, especially when the number of blob files is high.
    • โฑ DisableManualCompaction() doesn't have to wait scheduled manual compaction to be executed in thread-pool to cancel the job.

    Public API changes

    • ๐Ÿ‘€ Require C++17 compatible compiler (GCC >= 7, Clang >= 5, Visual Studio >= 2017) for compiling RocksDB and any code using RocksDB headers. See #9388.
    • Added ReadOptions::rate_limiter_priority. When set to something other than Env::IO_TOTAL, the internal rate limiter (DBOptions::rate_limiter) will be charged at the specified priority for file reads associated with the API to which the ReadOptions was provided.
    • โœ‚ Remove HDFS support from main repo.
    • โœ‚ Remove librados support from main repo.
    • Remove obsolete backupable_db.h and type alias BackupableDBOptions. Use backup_engine.h and BackupEngineOptions. Similar renamings are in the C and Java APIs.
    • Removed obsolete utility_db.h and UtilityDB::OpenTtlDB. Use db_ttl.h and DBWithTTL::Open.
    • โœ‚ Remove deprecated API DB::AddFile from main repo.
    • โœ‚ Remove deprecated API ObjectLibrary::Register() and the (now obsolete) Regex public API. Use ObjectLibrary::AddFactory() with PatternEntry instead.
    • Remove deprecated option DBOption::table_cache_remove_scan_count_limit.
    • Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit.
    • Remove deprecated API AdvancedColumnFamilyOptions::hard_rate_limit.
    • Remove deprecated API DBOption::base_background_compactions.
    • Remove deprecated API DBOptions::purge_redundant_kvs_while_flush.
    • โœ‚ Remove deprecated overloads of API DB::CompactRange.
    • ๐ŸŒฒ Remove deprecated option DBOptions::skip_log_error_on_recovery.
    • Remove ReadOptions::iter_start_seqnum which has been deprecated.
    • โœ‚ Remove DBOptions::preserved_deletes and DB::SetPreserveDeletesSequenceNumber().
    • Remove deprecated API AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds.
    • โœ‚ Removed timestamp from WriteOptions. Accordingly, added to DB APIs Put, Delete, SingleDelete, etc. accepting an additional argument 'timestamp'. Added Put, Delete, SingleDelete, etc to WriteBatch accepting an additional argument 'timestamp'. Removed WriteBatch::AssignTimestamps(vector) API. Renamed WriteBatch::AssignTimestamp() to WriteBatch::UpdateTimestamps() with clarified comments.
    • Changed type of cache buffer passed to Cache::CreateCallback from void* to const void*.
    • โšก๏ธ Significant updates to FilterPolicy-related APIs and configuration:
      • Remove public API support for deprecated, inefficient block-based filter (use_block_based_builder=true).
      • Old code and configuration strings that would enable it now quietly enable full filters instead, though any built-in FilterPolicy can still read block-based filters. This includes changing the longstanding default behavior of the Java API.
      • Remove deprecated FilterPolicy::CreateFilter() and FilterPolicy::KeyMayMatch()
      • Remove rocksdb_filterpolicy_create() from C API, as the only C API support for custom filter policies is now obsolete.
      • If temporary memory usage in full filter creation is a problem, consider using partitioned filters, smaller SST files, or setting reserve_table_builder_memory=true.
      • Remove support for "filter_policy=experimental_ribbon" configuration string. Use something like "filter_policy=ribbonfilter:10" instead.
      • Allow configuration string like "filter_policy=bloomfilter:10" without bool, to minimize acknowledgement of obsolete block-based filter.
      • Made FilterPolicy Customizable. Configuration of filter_policy is now accurately saved in OPTIONS file and can be loaded with LoadOptionsFromFile. (Loading an OPTIONS file generated by a previous version only enables reading and using existing filters, not generating new filters. Previously, no filter_policy would be configured from a saved OPTIONS file.)
      • Change meaning of nullptr return from GetBuilderWithContext() from "use block-based filter" to "generate no filter in this case."
      • Also, when user specifies bits_per_key < 0.5, we now round this down to "no filter" because we expect a filter with >= 80% FP rate is unlikely to be worth the CPU cost of accessing it (esp with cache_index_and_filter_blocks=1 or partition_filters=1).
      • bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP rate)
      • Remove class definitions for FilterBitsBuilder and FilterBitsReader from public API, so these can evolve more easily as implementation details. Custom FilterPolicy can still decide what kind of built-in filter to use under what conditions.
      • Also removed deprecated functions
      • FilterPolicy::GetFilterBitsBuilder()
      • NewExperimentalRibbonFilterPolicy()
      • Remove default implementations of
      • FilterPolicy::GetBuilderWithContext()
    • โœ‚ Remove default implementation of Name() from FileSystemWrapper.
    • Rename SizeApproximationOptions.include_memtabtles to SizeApproximationOptions.include_memtables.
    • Remove deprecated option DBOptions::max_mem_compaction_level.
    • Return Status::InvalidArgument from ObjectRegistry::NewObject if a factory exists but the object ould not be created (returns NotFound if the factory is missing).
    • โœ‚ Remove deprecated overloads of API DB::GetApproximateSizes.
    • Remove deprecated option DBOptions::new_table_reader_for_compaction_inputs.
    • โž• Add Transaction::SetReadTimestampForValidation() and Transaction::SetCommitTimestamp(). Default impl returns NotSupported().
    • โž• Add support for decimal patterns to ObjectLibrary::PatternEntry
    • โœ‚ Remove deprecated remote compaction APIs CompactionService::Start() and CompactionService::WaitForComplete(). Please use CompactionService::StartV2(), CompactionService::WaitForCompleteV2() instead, which provides the same information plus extra data like priority, db_id, etc.
    • ๐Ÿ—„ ColumnFamilyOptions::OldDefaults and DBOptions::OldDefaults are marked deprecated, as they are no longer maintained.
    • โž• Add subcompaction callback APIs: OnSubcompactionBegin() and OnSubcompactionCompleted().
    • โž• Add file Temperature information to FileOperationInfo in event listener API.
    • โšก๏ธ Change the type of SizeApproximationFlags from enum to enum class. Also update the signature of DB::GetApproximateSizes API from uint8_t to SizeApproximationFlags.
    • โž• Add Temperature hints information from RocksDB in API NewSequentialFile(). backup and checkpoint operations need to open the source files with NewSequentialFile(), which will have the temperature hints. Other operations are not covered.

    Behavior Changes

    • Disallow the combination of DBOptions.use_direct_io_for_flush_and_compaction == true and DBOptions.writable_file_max_buffer_size == 0. This combination can cause WritableFileWriter::Append() to loop forever, and it does not make much sense in direct IO.
    • ReadOptions::total_order_seek no longer affects DB::Get(). The original motivation for this interaction has been obsolete since RocksDB has been able to detect whether the current prefix extractor is compatible with that used to generate table files, probably RocksDB 5.14.0.
  • v6.29.0 Changes

    ๐Ÿš€ Note: The next release will be major release 7.0. See https://github.com/facebook/rocksdb/issues/9390 for more info.

    Public API change

    • โž• Added values to TraceFilterType: kTraceFilterIteratorSeek, kTraceFilterIteratorSeekForPrev, and kTraceFilterMultiGet. They can be set in TraceOptions to filter out the operation types after which they are named.
    • Added TraceOptions::preserve_write_order. When enabled it guarantees write records are traced in the same order they are logged to WAL and applied to the DB. By default it is disabled (false) to match the legacy behavior and prevent regression.
    • Made the Env class extend the Customizable class. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • ๐Ÿ—„ Options::OldDefaults is marked deprecated, as it is no longer maintained.
    • โž• Add ObjectLibrary::AddFactory and ObjectLibrary::PatternEntry classes. This method and associated class are the preferred mechanism for registering factories with the ObjectLibrary going forward. The ObjectLibrary::Register method, which uses regular expressions and may be problematic, is deprecated and will be in a future release.
    • Changed BlockBasedTableOptions::block_size from size_t to uint64_t.
    • โž• Added API warning against using Iterator::Refresh() together with DB::DeleteRange(), which are incompatible and have always risked causing the refreshed iterator to return incorrect results.
    • Made AdvancedColumnFamilyOptions.bottommost_temperature dynamically changeable with SetOptions().

    Behavior Changes

    • 0๏ธโƒฃ DB::DestroyColumnFamilyHandle() will return Status::InvalidArgument() if called with DB::DefaultColumnFamily().
    • On 32-bit platforms, mmap reads are no longer quietly disabled, just discouraged.

    ๐Ÿ†• New Features

    • โž• Added Options::DisableExtraChecks() that can be used to improve peak write performance by disabling checks that should not be necessary in the absence of software logic errors or CPU+memory hardware errors. (Default options are slowly moving toward some performance overheads for extra correctness checking.)

    ๐ŸŽ Performance Improvements

    • ๐Ÿ‘Œ Improved read performance when a prefix extractor is used (Seek, Get, MultiGet), even compared to version 6.25 baseline (see bug fix below), by optimizing the common case of prefix extractor compatible with table file and unchanging.

    ๐Ÿ› Bug Fixes

    • ๐Ÿ›  Fix a bug that FlushMemTable may return ok even flush not succeed.
    • ๐Ÿ›  Fixed a bug of Sync() and Fsync() not using fcntl(F_FULLFSYNC) on OS X and iOS.
    • ๐Ÿ›  Fixed a significant performance regression in version 6.26 when a prefix extractor is used on the read path (Seek, Get, MultiGet). (Excessive time was spent in SliceTransform::AsString().)
    • ๐Ÿ›  Fixed a race condition in SstFileManagerImpl error recovery code that can cause a crash during process shutdown.

    ๐Ÿ†• New Features

    • โž• Added RocksJava support for MacOS universal binary (ARM+x86)
  • v6.28.0 Changes

    December 17, 2021

    ๐Ÿ†• New Features

    • ๐Ÿ‘ Introduced 'CommitWithTimestamp' as a new tag. Currently, there is no API for user to trigger a write with this tag to the WAL. This is part of the efforts to support write-commited transactions with user-defined timestamps.
    • ๐Ÿšค Introduce SimulatedHybridFileSystem which can help simulating HDD latency in db_bench. Tiered Storage latency simulation can be enabled using -simulate_hybrid_fs_file (note that it doesn't work if db_bench is interrupted in the middle). -simulate_hdd can also be used to simulate all files on HDD.

    ๐Ÿ› Bug Fixes

    • ๐Ÿ›  Fixed a bug in rocksdb automatic implicit prefetching which got broken because of new feature adaptive_readahead and internal prefetching got disabled when iterator moves from one file to next.
    • Fixed a bug in TableOptions.prepopulate_block_cache which causes segmentation fault when used with TableOptions.partition_filters = true and TableOptions.cache_index_and_filter_blocks = true.
    • ๐Ÿ›  Fixed a bug affecting custom memtable factories which are not registered with the ObjectRegistry. The bug could result in failure to save the OPTIONS file.
    • ๐Ÿ›  Fixed a bug causing two duplicate entries to be appended to a file opened in non-direct mode and tracked by FaultInjectionTestFS.
    • Fixed a bug in TableOptions.prepopulate_block_cache to support block-based filters also.
    • ๐Ÿšš Block cache keys no longer use FSRandomAccessFile::GetUniqueId() (previously used when available), so a filesystem recycling unique ids can no longer lead to incorrect result or crash (#7405). For files generated by RocksDB >= 6.24, the cache keys are stable across DB::Open and DB directory move / copy / import / export / migration, etc. Although collisions are still theoretically possible, they are (a) impossible in many common cases, (b) not dependent on environmental factors, and (c) much less likely than a CPU miscalculation while executing RocksDB.
    • ๐Ÿ›  Fixed a bug in C bindings causing iterator to return incorrect result (#9343).

    Behavior Changes

    • MemTableList::TrimHistory now use allocated bytes when max_write_buffer_size_to_maintain > 0(default in TrasactionDB, introduced in PR#5022) Fix #8371.

    Public API change

    • Extend WriteBatch::AssignTimestamp and AssignTimestamps API so that both functions can accept an optional checker argument that performs additional checking on timestamp sizes.
    • Introduce a new EventListener callback that will be called upon the end of automatic error recovery.
    • Add IncreaseFullHistoryTsLow API so users can advance each column family's full_history_ts_low seperately.
    • Add GetFullHistoryTsLow API so users can query current full_history_low value of specified column family.

    ๐ŸŽ Performance Improvements

    • Replaced map property TableProperties::properties_offsets with uint64_t property external_sst_file_global_seqno_offset to save table properties's memory.
    • ๐Ÿ›  Block cache accesses are faster by RocksDB using cache keys of fixed size (16 bytes).

    Java API Changes

    • โœ‚ Removed Java API TableProperties.getPropertiesOffsets() as it exposed internal details to external users.
  • v6.27.0 Changes

    November 19, 2021

    ๐Ÿ†• New Features

    • โž• Added new ChecksumType kXXH3 which is faster than kCRC32c on almost all x86_64 hardware.
    • โž• Added a new online consistency check for BlobDB which validates that the number/total size of garbage blobs does not exceed the number/total size of all blobs in any given blob file.
    • ๐Ÿ‘ Provided support for tracking per-sst user-defined timestamp information in MANIFEST.
    • Added new option "adaptive_readahead" in ReadOptions. For iterators, RocksDB does auto-readahead on noticing sequential reads and by enabling this option, readahead_size of current file (if reads are sequential) will be carried forward to next file instead of starting from the scratch at each level (except L0 level files). If reads are not sequential it will fall back to 8KB. This option is applicable only for RocksDB internal prefetch buffer and isn't supported with underlying file system prefetching.
    • โž• Added the read count and read bytes related stats to Statistics for tiered storage hot, warm, and cold file reads.
    • Added an option to dynamically charge an updating estimated memory usage of block-based table building to block cache if block cache available. It currently only includes charging memory usage of constructing (new) Bloom Filter and Ribbon Filter to block cache. To enable this feature, set BlockBasedTableOptions::reserve_table_builder_memory = true.
    • โž• Add a new API OnIOError in listener.h that notifies listeners when an IO error occurs during FileSystem operation along with filename, status etc.
    • Added compaction readahead support for blob files to the integrated BlobDB implementation, which can improve compaction performance when the database resides on higher-latency storage like HDDs or remote filesystems. Readahead can be configured using the column family option blob_compaction_readahead_size.

    ๐Ÿ› Bug Fixes

    • Prevent a CompactRange() with CompactRangeOptions::change_level == true from possibly causing corruption to the LSM state (overlapping files within a level) when run in parallel with another manual compaction. Note that setting force_consistency_checks == true (the default) would cause the DB to enter read-only mode in this scenario and return Status::Corruption, rather than committing any corruption.
    • ๐Ÿ›  Fixed a bug in CompactionIterator when write-prepared transaction is used. A released earliest write conflict snapshot may cause assertion failure in dbg mode and unexpected key in opt mode.
    • Fix ticker WRITE_WITH_WAL("rocksdb.write.wal"), this bug is caused by a bad extra RecordTick(stats_, WRITE_WITH_WAL) (at 2 place), this fix remove the extra RecordTicks and fix the corresponding test case.
    • EventListener::OnTableFileCreated was previously called with OK status and file_size==0 in cases of no SST file contents written (because there was no content to add) and the empty file deleted before calling the listener. Now the status is Aborted.
    • ๐Ÿ›  Fixed a bug in CompactionIterator when write-preared transaction is used. Releasing earliest_snapshot during compaction may cause a SingleDelete to be output after a PUT of the same user key whose seq has been zeroed.
    • โž• Added input sanitization on negative bytes passed into GenericRateLimiter::Request.
    • ๐Ÿ›  Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete.
    • Fixed a bug of timestamp-based GC which can cause all versions of a key under full_history_ts_low to be dropped. This bug will be triggered when some of the ikeys' timestamps are lower than full_history_ts_low, while others are newer.
    • In some cases outside of the DB read and compaction paths, SST block checksums are now checked where they were not before.
    • Explicitly check for and disallow the BlockBasedTableOptions if insertion into one of {block_cache, block_cache_compressed, persistent_cache} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.)
    • Users who configured a dedicated thread pool for bottommost compactions by explicitly adding threads to the Env::Priority::BOTTOM pool will no longer see RocksDB schedule automatic compactions exceeding the DB's compaction concurrency limit. For details on per-DB compaction concurrency limit, see API docs of max_background_compactions and max_background_jobs.
    • ๐Ÿ›  Fixed a bug of background flush thread picking more memtables to flush and prematurely advancing column family's log_number.
    • ๐Ÿ›  Fixed an assertion failure in ManifestTailer.
    • ๐Ÿ›  Fixed a bug that could, with WAL enabled, cause backups, checkpoints, and GetSortedWalFiles() to fail randomly with an error like IO error: 001234.log: No such file or directory

    Behavior Changes

    • NUM_FILES_IN_SINGLE_COMPACTION was only counting the first input level files, now it's including all input files.
    • TransactionUtil::CheckKeyForConflicts can also perform conflict-checking based on user-defined timestamps in addition to sequence numbers.
    • โœ‚ Removed GenericRateLimiter's minimum refill bytes per period previously enforced.

    Public API change

    • โฑ When options.ttl is used with leveled compaction with compactinon priority kMinOverlappingRatio, files exceeding half of TTL value will be prioritized more, so that by the time TTL is reached, fewer extra compactions will be scheduled to clear them up. At the same time, when compacting files with data older than half of TTL, output files may be cut off based on those files' boundaries, in order for the early TTL compaction to work properly.
    • Made FileSystem and RateLimiter extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • ๐Ÿ‘ป Clarified in API comments that RocksDB is not exception safe for callbacks and custom extensions. An exception propagating into RocksDB can lead to undefined behavior, including data loss, unreported corruption, deadlocks, and more.
    • Marked WriteBufferManager as final because it is not intended for extension.
    • โœ‚ Removed unimportant implementation details from table_properties.h
    • โž• Add API FSDirectory::FsyncWithDirOptions(), which provides extra information like directory fsync reason in DirFsyncOptions. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the DB::Open() speed by ~20%.
    • DB::Open() is not going be blocked by obsolete file purge if DBOptions::avoid_unnecessary_blocking_io is set to true.
    • ๐Ÿ— In builds where glibc provides gettid(), info log ("LOG" file) lines now print a system-wide thread ID from gettid() instead of the process-local pthread_self(). For all users, the thread ID format is changed from hexadecimal to decimal integer.
    • In builds where glibc provides pthread_setname_np(), the background thread names no longer contain an ID suffix. For example, "rocksdb:bottom7" (and all other threads in the Env::Priority::BOTTOM pool) are now named "rocksdb:bottom". Previously large thread pools could breach the name size limit (e.g., naming "rocksdb:bottom10" would fail).
    • Deprecating ReadOptions::iter_start_seqnum and DBOptions::preserve_deletes, please try using user defined timestamp feature instead. The options will be removed in a future release, currently it logs a warning message when using.

    ๐ŸŽ Performance Improvements

    • ๐Ÿš€ Released some memory related to filter construction earlier in BlockBasedTableBuilder for FullFilter and PartitionedFilter case (#9070)

    Behavior Changes

    • NUM_FILES_IN_SINGLE_COMPACTION was only counting the first input level files, now it's including all input files.
  • v6.26.0 Changes

    October 20, 2021

    ๐Ÿ› Bug Fixes

    • ๐Ÿ›  Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.
    • ๐Ÿ›  Fix the incorrect disabling of SST rate limited deletion when the WAL and DB are in different directories. Only WAL rate limited deletion should be disabled if its in a different directory.
    • Fix DisableManualCompaction() to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true.
    • ๐Ÿ›  Fix contract of Env::ReopenWritableFile() and FileSystem::ReopenWritableFile() to specify any existing file must not be deleted or truncated.
    • ๐Ÿ›  Fixed bug in calls to IngestExternalFiles() with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles() returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).
    • ๐Ÿ›  Fixed a possible race condition impacting users of WriteBufferManager who constructed it with allow_stall == true. The race condition led to undefined behavior (in our experience, typically a process crash).
    • ๐Ÿ›  Fixed a bug where stalled writes would remain stalled forever after the user calls WriteBufferManager::SetBufferSize() with new_size == 0 to dynamically disable memory limiting.
    • ๐Ÿ‘‰ Make DB::close() thread-safe.
    • ๐Ÿ›  Fix a bug in atomic flush where one bg flush thread will wait forever for a preceding bg flush thread to commit its result to MANIFEST but encounters an error which is mapped to a soft error (DB not stopped).
    • ๐Ÿ›  Fix a bug in BackupEngine where some internal callers of GenericRateLimiter::Request() do not honor bytes <= GetSingleBurstBytes().

    ๐Ÿ†• New Features

    • Print information about blob files when using "ldb list_live_files_metadata"
    • ๐Ÿ‘ Provided support for SingleDelete with user defined timestamp.
    • ๐ŸŽ Experimental new function DB::GetLiveFilesStorageInfo offers essentially a unified version of other functions like GetLiveFiles, GetLiveFilesChecksumInfo, and GetSortedWalFiles. Checkpoints and backups could show small behavioral changes and/or improved performance as they now use this new API.
    • Add remote compaction read/write bytes statistics: REMOTE_COMPACT_READ_BYTES, REMOTE_COMPACT_WRITE_BYTES.
    • Introduce an experimental feature to dump out the blocks from block cache and insert them to the secondary cache to reduce the cache warmup time (e.g., used while migrating DB instance). More information are in class CacheDumper and CacheDumpedLoader at rocksdb/utilities/cache_dump_load.h Note that, this feature is subject to the potential change in the future, it is still experimental.
    • Introduced a new BlobDB configuration option blob_garbage_collection_force_threshold, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction.
    • โž• Added EXPERIMENTAL support for table file (SST) unique identifiers that are stable and universally unique, available with new function GetUniqueIdFromTableProperties. Only SST files from RocksDB >= 6.24 support unique IDs.
    • โž• Added GetMapProperty() support for "rocksdb.dbstats" (DB::Properties::kDBStats). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.

    Public API change

    • Made SystemClock extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • ๐Ÿ›  Made SliceTransform extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method. The Capped and Prefixed transform classes return a short name (no length); use GetId for the fully qualified name.
    • Made FileChecksumGenFactory, SstPartitionerFactory, TablePropertiesCollectorFactory, and WalFilter extend the Customizable class and added a CreateFromString method.
    • ๐Ÿ“‡ Some fields of SstFileMetaData are deprecated for compatibility with new base class FileStorageInfo.
    • โž• Add file_temperature to IngestExternalFileArg such that when ingesting SST files, we are able to indicate the temperature of the this batch of files.
    • If DB::Close() failed with a non aborted status, calling DB::Close() again will return the original status instead of Status::OK.
    • Add CacheTier to advanced_options.h to describe the cache tier we used. Add a lowest_used_cache_tier option to DBOptions (immutable) and pass it to BlockBasedTableReader. By default it is CacheTier::kNonVolatileBlockTier, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it to CacheTier::kVolatileTier, the DB will not use the secondary cache.
    • Even when options.max_compaction_bytes is hit, compaction output files are only cut when it aligns with grandparent files' boundaries. options.max_compaction_bytes could be slightly violated with the change, but the violation is no more than one target SST file size, which is usually much smaller.

    ๐ŸŽ Performance Improvements

    • ๐Ÿ‘Œ Improved CPU efficiency of building block-based table (SST) files (#9039 and #9040).

    Java API Changes

    • โž• Add Java API bindings for new integrated BlobDB options
    • ๐Ÿ‘ keyMayExist() supports ByteBuffer.
    • ๐Ÿ›  Fix multiget throwing Null Pointer Exception for num of keys > 70k (https://github.com/facebook/rocksdb/issues/8039).
  • v6.25.0 Changes

    September 20, 2021

    ๐Ÿ› Bug Fixes

    • ๐Ÿ‘ Allow secondary instance to refresh iterator. Assign read seq after referencing SuperVersion.
    • ๐Ÿ›  Fixed a bug of secondary instance's last_sequence going backward, and reads on the secondary fail to see recent updates from the primary.
    • ๐Ÿ›  Fixed a bug that could lead to duplicate DB ID or DB session ID in POSIX environments without /proc/sys/kernel/random/uuid.
    • ๐Ÿ›  Fix a race in DumpStats() with column family destruction due to not taking a Ref on each entry while iterating the ColumnFamilySet.
    • ๐Ÿ›  Fix a race in item ref counting in LRUCache when promoting an item from the SecondaryCache.
    • ๐Ÿ›  Fix a race in BackupEngine if RateLimiter is reconfigured during concurrent Restore operations.
    • ๐Ÿ›  Fix a bug on POSIX in which failure to create a lock file (e.g. out of space) can prevent future LockFile attempts in the same process on the same file from succeeding.
    • Fix a bug that backup_rate_limiter and restore_rate_limiter in BackupEngine could not limit read rates.
    • Fix the implementation of prepopulate_block_cache = kFlushOnly to only apply to flushes rather than to all generated files.
    • Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together. The sync WAL should work with locked log_write_mutex_.
    • โž• Add checks for validity of the IO uring completion queue entries, and fail the BlockBasedTableReader MultiGet sub-batch if there's an invalid completion
    • โž• Add an interface RocksDbIOUringEnable() that, if defined by the user, will allow them to enable/disable the use of IO uring by RocksDB
    • ๐Ÿ›  Fix the bug that when direct I/O is used and MultiRead() returns a short result, RandomAccessFileReader::MultiRead() still returns full size buffer, with returned short value together with some data in original buffer. This bug is unlikely cause incorrect results, because (1) since FileSystem layer is expected to retry on short result, returning short results is only possible when asking more bytes in the end of the file, which RocksDB doesn't do when using MultiRead(); (2) checksum is unlikely to match.

    ๐Ÿ†• New Features

    • RemoteCompaction's interface now includes db_name, db_id, session_id, which could help the user uniquely identify compaction job between db instances and sessions.
    • โž• Added a ticker statistic, "rocksdb.verify_checksum.read.bytes", reporting how many bytes were read from file to serve VerifyChecksum() and VerifyFileChecksums() queries.
    • โž• Added ticker statistics, "rocksdb.backup.read.bytes" and "rocksdb.backup.write.bytes", reporting how many bytes were read and written during backup.
    • โž• Added properties for BlobDB: rocksdb.num-blob-files, rocksdb.blob-stats, rocksdb.total-blob-file-size, and rocksdb.live-blob-file-size. The existing property rocksdb.estimate_live-data-size was also extended to include live bytes residing in blob files.
    • ๐Ÿ‘‰ Added two new RateLimiter IOPriorities: Env::IO_USER,Env::IO_MID. Env::IO_USER will have superior priority over all other RateLimiter IOPriorities without being subject to fair scheduling constraint.
    • ๐Ÿ‘ SstFileWriter now supports Puts and Deletes with user-defined timestamps. Note that the ingestion logic itself is not timestamp-aware yet.
    • ๐Ÿ‘ Allow a single write batch to include keys from multiple column families whose timestamps' formats can differ. For example, some column families may disable timestamp, while others enable timestamp.
    • โž• Add compaction priority information in RemoteCompaction, which can be used to schedule high priority job first.
    • โž• Added new callback APIs OnBlobFileCreationStarted,OnBlobFileCreatedand OnBlobFileDeleted in EventListener class of listener.h. It notifies listeners during creation/deletion of individual blob files in Integrated BlobDB. It also log blob file creation finished event and deletion event in LOG file.
    • Batch blob read requests for DB::MultiGet using MultiRead.
    • โž• Add support for fallback to local compaction, the user can return CompactionServiceJobStatus::kUseLocal to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result.
    • Add built-in rate limiter's implementation of RateLimiter::GetTotalPendingRequest(int64_t* total_pending_requests, const Env::IOPriority pri) for the total number of requests that are pending for bytes in the rate limiter.
    • Charge memory usage during data buffering, from which training samples are gathered for dictionary compression, to block cache. Unbuffering data can now be triggered if the block cache becomes full and strict_capacity_limit=true for the block cache, in addition to existing conditions that can trigger unbuffering.

    Public API change

    • โœ‚ Remove obsolete implementation details FullKey and ParseFullKey from public API
    • Change SstFileMetaData::size from size_t to uint64_t.
    • Made Statistics extend the Customizable class and added a CreateFromString method. Implementations of Statistics need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • Extended FlushJobInfo and CompactionJobInfo in listener.h to provide information about the blob files generated by a flush/compaction and garbage collected during compaction in Integrated BlobDB. Added struct members blob_file_addition_infos and blob_file_garbage_infos that contain this information.
    • Extended parameter output_file_names of CompactFiles API to also include paths of the blob files generated by the compaction in Integrated BlobDB.
    • โšก๏ธ Most BackupEngine functions now return IOStatus instead of Status. Most existing code should be compatible with this change but some calls might need to be updated.
    • Add a new field level_at_creation in TablePropertiesCollectorFactory::Context to capture the level at creating the SST file (i.e, table), of which the properties are being collected.

    Miscellaneous

    • โž• Add a paranoid check where in case FileSystem layer doesn't fill the buffer but returns succeed, checksum is unlikely to match even if buffer contains a previous block. The byte modified is not useful anyway, so it isn't expected to change any behavior when FileSystem is satisfying its contract.