RocksDB v6.25.0 Release NotesRelease Date: 2021-09-20 // 28 days ago
🐛 Bug Fixes
- 👍 Allow secondary instance to refresh iterator. Assign read seq after referencing SuperVersion.
- 🛠 Fixed a bug of secondary instance's last_sequence going backward, and reads on the secondary fail to see recent updates from the primary.
- 🛠 Fixed a bug that could lead to duplicate DB ID or DB session ID in POSIX environments without /proc/sys/kernel/random/uuid.
- 🛠 Fix a race in DumpStats() with column family destruction due to not taking a Ref on each entry while iterating the ColumnFamilySet.
- 🛠 Fix a race in item ref counting in LRUCache when promoting an item from the SecondaryCache.
- 🛠 Fix a race in BackupEngine if RateLimiter is reconfigured during concurrent Restore operations.
- 🛠 Fix a bug on POSIX in which failure to create a lock file (e.g. out of space) can prevent future LockFile attempts in the same process on the same file from succeeding.
- Fix a bug that backup_rate_limiter and restore_rate_limiter in BackupEngine could not limit read rates.
- Fix the implementation of
prepopulate_block_cache = kFlushOnlyto only apply to flushes rather than to all generated files.
- Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together. The sync WAL should work with locked log_write_mutex_.
- ➕ Add checks for validity of the IO uring completion queue entries, and fail the BlockBasedTableReader MultiGet sub-batch if there's an invalid completion
- ➕ Add an interface RocksDbIOUringEnable() that, if defined by the user, will allow them to enable/disable the use of IO uring by RocksDB
- 🛠 Fix the bug that when direct I/O is used and MultiRead() returns a short result, RandomAccessFileReader::MultiRead() still returns full size buffer, with returned short value together with some data in original buffer. This bug is unlikely cause incorrect results, because (1) since FileSystem layer is expected to retry on short result, returning short results is only possible when asking more bytes in the end of the file, which RocksDB doesn't do when using MultiRead(); (2) checksum is unlikely to match.
🆕 New Features
- RemoteCompaction's interface now includes
session_id, which could help the user uniquely identify compaction job between db instances and sessions.
- ➕ Added a ticker statistic, "rocksdb.verify_checksum.read.bytes", reporting how many bytes were read from file to serve
- ➕ Added ticker statistics, "rocksdb.backup.read.bytes" and "rocksdb.backup.write.bytes", reporting how many bytes were read and written during backup.
- ➕ Added properties for BlobDB:
rocksdb.live-blob-file-size. The existing property
rocksdb.estimate_live-data-sizewas also extended to include live bytes residing in blob files.
- 👉 Added two new RateLimiter IOPriorities:
Env::IO_USERwill have superior priority over all other RateLimiter IOPriorities without being subject to fair scheduling constraint.
Deletes with user-defined timestamps. Note that the ingestion logic itself is not timestamp-aware yet.
- 👍 Allow a single write batch to include keys from multiple column families whose timestamps' formats can differ. For example, some column families may disable timestamp, while others enable timestamp.
- ➕ Add compaction priority information in RemoteCompaction, which can be used to schedule high priority job first.
- ➕ Added new callback APIs
EventListenerclass of listener.h. It notifies listeners during creation/deletion of individual blob files in Integrated BlobDB. It also log blob file creation finished event and deletion event in LOG file.
- Batch blob read requests for
- ➕ Add support for fallback to local compaction, the user can return
CompactionServiceJobStatus::kUseLocalto instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result.
- Add built-in rate limiter's implementation of
RateLimiter::GetTotalPendingRequest(int64_t* total_pending_requests, const Env::IOPriority pri)for the total number of requests that are pending for bytes in the rate limiter.
Public API change
- ✂ Remove obsolete implementation details FullKey and ParseFullKey from public API
- Made Statistics extend the Customizable class and added a CreateFromString method. Implementations of Statistics need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
CompactionJobInfoin listener.h to provide information about the blob files generated by a flush/compaction and garbage collected during compaction in Integrated BlobDB. Added struct members
blob_file_garbage_infosthat contain this information.
- Extended parameter
CompactFilesAPI to also include paths of the blob files generated by the compaction in Integrated BlobDB.
- ⚡️ Most
BackupEnginefunctions now return
Status. Most existing code should be compatible with this change but some calls might need to be updated.
- Add a new field
TablePropertiesCollectorFactory::Contextto capture the level at creating the SST file (i.e, table), of which the properties are being collected.
- ➕ Add a paranoid check where in case FileSystem layer doesn't fill the buffer but returns succeed, checksum is unlikely to match even if buffer contains a previous block. The byte modified is not useful anyway, so it isn't expected to change any behavior when FileSystem is satisfying its contract.
Previous changes from v6.24.0
🐛 Bug Fixes
- If the primary's CURRENT file is missing or inaccessible, the secondary instance should not hang repeatedly trying to switch to a new MANIFEST. It should instead return the error code encountered while accessing the file.
- 🔀 Restoring backups with BackupEngine is now a logically atomic operation, so that if a restore operation is interrupted, DB::Open on it will fail. Using BackupEngineOptions::sync (default) ensures atomicity even in case of power loss or OS crash.
- 🛠 Fixed a race related to the destruction of
ColumnFamilyDataobjects. The earlier logic unlocked the DB mutex before destroying the thread-local
SuperVersionpointers, which could result in a process crash if another thread managed to get a reference to the
- ✂ Removed a call to
RenameFile()on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file.
- 🛠 Fixed an issue where
OnFlushCompletedwas not called for atomic flush.
- 🛠 Fixed a bug affecting the batched
MultiGetAPI when used with keys spanning multiple column families and
sorted_input == false.
- 🛠 Fixed a potential incorrect result in opt mode and assertion failures caused by releasing snapshot(s) during compaction.
- 🛠 Fixed passing of BlobFileCompletionCallback to Compaction job and Atomic flush job which was default paramter (nullptr). BlobFileCompletitionCallback is internal callback that manages addition of blob files to SSTFileManager.
- Fixed MultiGet not updating the block_read_count and block_read_byte PerfContext counters.
🆕 New Features
- Made the EventListener extend the Customizable class.
- EventListeners that have a non-empty Name() and that are registered with the ObjectRegistry can now be serialized to/from the OPTIONS file.
- Insert warm blocks (data blocks, uncompressed dict blocks, index and filter blocks) in Block cache during flush under option BlockBasedTableOptions.prepopulate_block_cache. Previously it was enabled for only data blocks.
- BlockBasedTableOptions.prepopulate_block_cache can be dynamically configured using DB::SetOptions.
- Add CompactionOptionsFIFO.age_for_warm, which allows RocksDB to move old files to warm tier in FIFO compactions. Note that file temperature is still an experimental feature.
- ➕ Add a comment to suggest btrfs user to disable file preallocation by setting
- Fast forward option in Trace replay changed to double type to allow replaying at a lower speed, by settings the value between 0 and 1. This option can be set via
Replayer::Replay(), or via
- ➕ Add property
LiveSstFilesSizeAtTemperatureto retrieve sst file size at different temperature.
- ➕ Added a stat rocksdb.secondary.cache.hits.
- Added a PerfContext counter secondary_cache_hit_count.
- The integrated BlobDB implementation now supports the tickers
BLOB_DB_GC_BYTES_RELOCATED, as well as the histograms
- Added hybrid configuration of Ribbon filter and Bloom filter where some LSM levels use Ribbon for memory space efficiency and some use Bloom for speed. See NewRibbonFilterPolicy. This also changes the default behavior of NewRibbonFilterPolicy to use Bloom for flushes under Leveled and Universal compaction and Ribbon otherwise. The C API function
rocksdb_filterpolicy_create_ribbonis unchanged but adds new
Public API change
- Added APIs to decode and replay trace file via Replayer class. Added
DB::NewDefaultReplayer()to create a default Replayer instance. Added
TraceReader::Reset()to restart reading a trace file. Created trace_record.h, trace_record_result.h and utilities/replayer.h files to access the decoded Trace records, replay them, and query the actual operation results.
- ➕ Added Configurable::GetOptionsMap to the public API for use in creating new Customizable classes.
- Generalized bits_per_key parameters in C API from int to double for greater configurability. Although this is a compatible change for existing C source code, anything depending on C API signatures, such as foreign function interfaces, will need to be updated.
🐎 Performance Improvements
- ⚡️ Try to avoid updating DBOptions if
SetDBOptions()does not change any option value.
StringAppendOperatoradditionally accepts a string as the delimiter.
- BackupEngineOptions::sync (default true) now applies to restoring backups in addition to creating backups. This could slow down restores, but ensures they are fully persisted before returning OK. (Consider increasing max_background_operations to improve performance.)