RocksDB v6.25.0 Release Notes

Release Date: 2021-09-20 // 28 days ago
  • ๐Ÿ› Bug Fixes

    • ๐Ÿ‘ Allow secondary instance to refresh iterator. Assign read seq after referencing SuperVersion.
    • ๐Ÿ›  Fixed a bug of secondary instance's last_sequence going backward, and reads on the secondary fail to see recent updates from the primary.
    • ๐Ÿ›  Fixed a bug that could lead to duplicate DB ID or DB session ID in POSIX environments without /proc/sys/kernel/random/uuid.
    • ๐Ÿ›  Fix a race in DumpStats() with column family destruction due to not taking a Ref on each entry while iterating the ColumnFamilySet.
    • ๐Ÿ›  Fix a race in item ref counting in LRUCache when promoting an item from the SecondaryCache.
    • ๐Ÿ›  Fix a race in BackupEngine if RateLimiter is reconfigured during concurrent Restore operations.
    • ๐Ÿ›  Fix a bug on POSIX in which failure to create a lock file (e.g. out of space) can prevent future LockFile attempts in the same process on the same file from succeeding.
    • Fix a bug that backup_rate_limiter and restore_rate_limiter in BackupEngine could not limit read rates.
    • Fix the implementation of prepopulate_block_cache = kFlushOnly to only apply to flushes rather than to all generated files.
    • Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together. The sync WAL should work with locked log_write_mutex_.
    • โž• Add checks for validity of the IO uring completion queue entries, and fail the BlockBasedTableReader MultiGet sub-batch if there's an invalid completion
    • โž• Add an interface RocksDbIOUringEnable() that, if defined by the user, will allow them to enable/disable the use of IO uring by RocksDB
    • ๐Ÿ›  Fix the bug that when direct I/O is used and MultiRead() returns a short result, RandomAccessFileReader::MultiRead() still returns full size buffer, with returned short value together with some data in original buffer. This bug is unlikely cause incorrect results, because (1) since FileSystem layer is expected to retry on short result, returning short results is only possible when asking more bytes in the end of the file, which RocksDB doesn't do when using MultiRead(); (2) checksum is unlikely to match.

    ๐Ÿ†• New Features

    • RemoteCompaction's interface now includes db_name, db_id, session_id, which could help the user uniquely identify compaction job between db instances and sessions.
    • โž• Added a ticker statistic, "rocksdb.verify_checksum.read.bytes", reporting how many bytes were read from file to serve VerifyChecksum() and VerifyFileChecksums() queries.
    • โž• Added ticker statistics, "rocksdb.backup.read.bytes" and "rocksdb.backup.write.bytes", reporting how many bytes were read and written during backup.
    • โž• Added properties for BlobDB: rocksdb.num-blob-files, rocksdb.blob-stats, rocksdb.total-blob-file-size, and rocksdb.live-blob-file-size. The existing property rocksdb.estimate_live-data-size was also extended to include live bytes residing in blob files.
    • ๐Ÿ‘‰ Added two new RateLimiter IOPriorities: Env::IO_USER,Env::IO_MID. Env::IO_USER will have superior priority over all other RateLimiter IOPriorities without being subject to fair scheduling constraint.
    • ๐Ÿ‘ SstFileWriter now supports Puts and Deletes with user-defined timestamps. Note that the ingestion logic itself is not timestamp-aware yet.
    • ๐Ÿ‘ Allow a single write batch to include keys from multiple column families whose timestamps' formats can differ. For example, some column families may disable timestamp, while others enable timestamp.
    • โž• Add compaction priority information in RemoteCompaction, which can be used to schedule high priority job first.
    • โž• Added new callback APIs OnBlobFileCreationStarted,OnBlobFileCreatedand OnBlobFileDeleted in EventListener class of listener.h. It notifies listeners during creation/deletion of individual blob files in Integrated BlobDB. It also log blob file creation finished event and deletion event in LOG file.
    • Batch blob read requests for DB::MultiGet using MultiRead.
    • โž• Add support for fallback to local compaction, the user can return CompactionServiceJobStatus::kUseLocal to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result.
    • Add built-in rate limiter's implementation of RateLimiter::GetTotalPendingRequest(int64_t* total_pending_requests, const Env::IOPriority pri) for the total number of requests that are pending for bytes in the rate limiter.

    Public API change

    • โœ‚ Remove obsolete implementation details FullKey and ParseFullKey from public API
    • Change SstFileMetaData::size from size_t to uint64_t.
    • Made Statistics extend the Customizable class and added a CreateFromString method. Implementations of Statistics need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • Extended FlushJobInfo and CompactionJobInfo in listener.h to provide information about the blob files generated by a flush/compaction and garbage collected during compaction in Integrated BlobDB. Added struct members blob_file_addition_infos and blob_file_garbage_infos that contain this information.
    • Extended parameter output_file_names of CompactFiles API to also include paths of the blob files generated by the compaction in Integrated BlobDB.
    • โšก๏ธ Most BackupEngine functions now return IOStatus instead of Status. Most existing code should be compatible with this change but some calls might need to be updated.
    • Add a new field level_at_creation in TablePropertiesCollectorFactory::Context to capture the level at creating the SST file (i.e, table), of which the properties are being collected.

    Miscellaneous

    • โž• Add a paranoid check where in case FileSystem layer doesn't fill the buffer but returns succeed, checksum is unlikely to match even if buffer contains a previous block. The byte modified is not useful anyway, so it isn't expected to change any behavior when FileSystem is satisfying its contract.

Previous changes from v6.24.0

  • ๐Ÿ› Bug Fixes

    • If the primary's CURRENT file is missing or inaccessible, the secondary instance should not hang repeatedly trying to switch to a new MANIFEST. It should instead return the error code encountered while accessing the file.
    • ๐Ÿ”€ Restoring backups with BackupEngine is now a logically atomic operation, so that if a restore operation is interrupted, DB::Open on it will fail. Using BackupEngineOptions::sync (default) ensures atomicity even in case of power loss or OS crash.
    • ๐Ÿ›  Fixed a race related to the destruction of ColumnFamilyData objects. The earlier logic unlocked the DB mutex before destroying the thread-local SuperVersion pointers, which could result in a process crash if another thread managed to get a reference to the ColumnFamilyData object.
    • โœ‚ Removed a call to RenameFile() on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file.
    • ๐Ÿ›  Fixed an issue where OnFlushCompleted was not called for atomic flush.
    • ๐Ÿ›  Fixed a bug affecting the batched MultiGet API when used with keys spanning multiple column families and sorted_input == false.
    • ๐Ÿ›  Fixed a potential incorrect result in opt mode and assertion failures caused by releasing snapshot(s) during compaction.
    • ๐Ÿ›  Fixed passing of BlobFileCompletionCallback to Compaction job and Atomic flush job which was default paramter (nullptr). BlobFileCompletitionCallback is internal callback that manages addition of blob files to SSTFileManager.
    • Fixed MultiGet not updating the block_read_count and block_read_byte PerfContext counters.

    ๐Ÿ†• New Features

    • Made the EventListener extend the Customizable class.
    • EventListeners that have a non-empty Name() and that are registered with the ObjectRegistry can now be serialized to/from the OPTIONS file.
    • Insert warm blocks (data blocks, uncompressed dict blocks, index and filter blocks) in Block cache during flush under option BlockBasedTableOptions.prepopulate_block_cache. Previously it was enabled for only data blocks.
    • BlockBasedTableOptions.prepopulate_block_cache can be dynamically configured using DB::SetOptions.
    • Add CompactionOptionsFIFO.age_for_warm, which allows RocksDB to move old files to warm tier in FIFO compactions. Note that file temperature is still an experimental feature.
    • โž• Add a comment to suggest btrfs user to disable file preallocation by setting options.allow_fallocate=false.
    • Fast forward option in Trace replay changed to double type to allow replaying at a lower speed, by settings the value between 0 and 1. This option can be set via ReplayOptions in Replayer::Replay(), or via --trace_replay_fast_forward in db_bench.
    • โž• Add property LiveSstFilesSizeAtTemperature to retrieve sst file size at different temperature.
    • โž• Added a stat rocksdb.secondary.cache.hits.
    • Added a PerfContext counter secondary_cache_hit_count.
    • The integrated BlobDB implementation now supports the tickers BLOB_DB_BLOB_FILE_BYTES_READ, BLOB_DB_GC_NUM_KEYS_RELOCATED, and BLOB_DB_GC_BYTES_RELOCATED, as well as the histograms BLOB_DB_COMPRESSION_MICROS and BLOB_DB_DECOMPRESSION_MICROS.
    • Added hybrid configuration of Ribbon filter and Bloom filter where some LSM levels use Ribbon for memory space efficiency and some use Bloom for speed. See NewRibbonFilterPolicy. This also changes the default behavior of NewRibbonFilterPolicy to use Bloom for flushes under Leveled and Universal compaction and Ribbon otherwise. The C API function rocksdb_filterpolicy_create_ribbon is unchanged but adds new rocksdb_filterpolicy_create_ribbon_hybrid.

    Public API change

    • Added APIs to decode and replay trace file via Replayer class. Added DB::NewDefaultReplayer() to create a default Replayer instance. Added TraceReader::Reset() to restart reading a trace file. Created trace_record.h, trace_record_result.h and utilities/replayer.h files to access the decoded Trace records, replay them, and query the actual operation results.
    • โž• Added Configurable::GetOptionsMap to the public API for use in creating new Customizable classes.
    • Generalized bits_per_key parameters in C API from int to double for greater configurability. Although this is a compatible change for existing C source code, anything depending on C API signatures, such as foreign function interfaces, will need to be updated.

    ๐ŸŽ Performance Improvements

    • โšก๏ธ Try to avoid updating DBOptions if SetDBOptions() does not change any option value.

    Behavior Changes

    • StringAppendOperator additionally accepts a string as the delimiter.
    • BackupEngineOptions::sync (default true) now applies to restoring backups in addition to creating backups. This could slow down restores, but ensures they are fully persisted before returning OK. (Consider increasing max_background_operations to improve performance.)