RocksDB v7.8.0 Release Notes
-
🆕 New Features
- 👍
DeleteRange()
now supports user-defined timestamp. - 👍 Provide support for async_io with tailing iterators when ReadOptions.tailing is enabled during scans.
- Tiered Storage: allow data moving up from the last level to the penultimate level if the input level is penultimate level or above.
- ➕ Added
DB::Properties::kFastBlockCacheEntryStats
, which is similar toDB::Properties::kBlockCacheEntryStats
, except returns cached (stale) values in more cases to reduce overhead. - 👍 FIFO compaction now supports migrating from a multi-level DB via DB::Open(). During the migration phase, FIFO compaction picker will:
- picks the sst file with the smallest starting key in the bottom-most non-empty level.
- Note that during the migration phase, the file purge order will only be an approximation of "FIFO" as files in lower-level might sometime contain newer keys than files in upper-level.
- Added an option
ignore_max_compaction_bytes_for_input
to ignore max_compaction_bytes limit when adding files to be compacted from input level. This should help reduce write amplification. The option is enabled by default. - Tiered Storage: allow data moving up from the last level even if it's a last level only compaction, as long as the penultimate level is empty.
- Add a new option IOOptions.do_not_recurse that can be used by underlying file systems to skip recursing through sub directories and list only files in GetChildren API.
- Add option
preserve_internal_time_seconds
to preserve the time information for the latest data. Which can be used to determine the age of data whenpreclude_last_level_data_seconds
is enabled. The time information is attached with SST in table propertyrocksdb.seqno.time.map
which can be parsed by tool ldb or sst_dump.
🐛 Bug Fixes
- Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
- 🛠 Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
- 🛠 Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
- 🛠 Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).
- 🛠 Fixed a bug causing manual flush with
flush_opts.wait=false
to stall when database has stopped all writes (#10001). - 🛠 Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
- Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).
- 🛠 Fixed a memory safety bug in experimental HyperClockCache (#10768)
- Fixed some cases where
ldb update_manifest
andldb unsafe_remove_sst_file
are not usable because they were requiring the DB files to match the existing manifest state (before updating the manifest to match a desired state).
🐎 Performance Improvements
- Try to align the compaction output file boundaries to the next level ones, which can reduce more than 10% compaction load for the default level compaction. The feature is enabled by default, to disable, set
AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size
to false. As a side effect, it can create SSTs larger than the target_file_size (capped at 2x target_file_size) or smaller files. - 👌 Improve RoundRobin TTL compaction, which is going to be the same as normal RoundRobin compaction to move the compaction cursor.
- 🛠 Fix a small CPU regression caused by a change that UserComparatorWrapper was made Customizable, because Customizable itself has small CPU overhead for initialization.
Behavior Changes
- Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).
- With periodic stat dumper waits up every options.stats_dump_period_sec seconds, it won't dump stats for a CF if it has no change in the period, unless 7 periods have been skipped.
- Only periodic stats dumper triggered by options.stats_dump_period_sec will update stats interval. Ones triggered by DB::GetProperty() will not update stats interval and will report based on an interval since the last time stats dump period.
Public API changes
- 🐎 Make kXXH3 checksum the new default, because it is faster on common hardware, especially with kCRC32c affected by a performance bug in some versions of clang (https://github.com/facebook/rocksdb/issues/9891). DBs written with this new setting can be read by RocksDB 6.27 and newer.
- Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Introduced an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. More details in rocksdb/includb/block_cache_trace_writer.h.
- 👍