All Versions
95
Latest Version
Avg Release Cycle
29 days
Latest Release
-
Changelog History
Page 1
Changelog History
Page 1
-
v6.15.0
π Bug Fixes
- π Fixed a bug in the following combination of features: indexes with user keys (
format_version >= 3
), indexes are partitioned (index_type == kTwoLevelIndexSearch
), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior. - π Fixed a bug when indexes are partitioned (
index_type == kTwoLevelIndexSearch
), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
), and partitions reads could be mixed between block cache and directly from the file (e.g., withenable_index_compression == 1
andmmap_read == 1
, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results. - Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
- Since 6.14, fix false positive flush/compaction
Status::Corruption
failure whenparanoid_file_checks == true
and range tombstones were written to the compaction output files. - Since 6.14, fix a bug that could cause a stalled write to crash with mixed of slowdown and no_slowdown writes (
WriteOptions.no_slowdown=true
). - π Fixed a bug which causes hang in closing DB when refit level is set in opt build. It was because ContinueBackgroundWork() was called in assert statement which is a no op. It was introduced in 6.14.
- π Fixed a bug which causes Get() to return incorrect result when a key's merge operand is applied twice. This can occur if the thread performing Get() runs concurrently with a background flush thread and another thread writing to the MANIFEST file (PR6069).
- Reverted a behavior change silently introduced in 6.14.2, in which the effects of the
ignore_unknown_options
flag (used in option parsing/loading functions) changed. - βͺ Reverted a behavior change silently introduced in 6.14, in which options parsing/loading functions began returning
NotFound
instead ofInvalidArgument
for option names not available in the present version. - π Fixed MultiGet bugs it doesn't return valid data with user defined timestamp.
- π Fixed a potential bug caused by evaluating
TableBuilder::NeedCompact()
beforeTableBuilder::Finish()
in compaction job. For example, theNeedCompact()
method ofCompactOnDeletionCollector
returned by built-inCompactOnDeletionCollectorFactory
requiresBlockBasedTable::Finish()
to return the correct result. The bug can cause a compaction-generated file not to be marked for future compaction based on deletion ratio. - π Fixed a seek issue with prefix extractor and timestamp.
- Fixed a bug of encoding and parsing BlockBasedTableOptions::read_amp_bytes_per_bit as a 64-bit integer.
- π Fixed a bug of a recovery corner case, details in PR7621.
Public API Change
- Deprecate
BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
andBlockBasedTableOptions::pin_top_level_index_and_filter
. These options still take effect until users migrate to the replacement APIs inBlockBasedTableOptions::metadata_cache_options
. Migration guidance can be found in the API comments on the deprecated options. - β Add new API
DB::VerifyFileChecksums
to verify SST file checksum with corresponding entries in the MANIFEST if present. Current implementation requires scanning and recomputing file checksums. - Added a new option
track_and_verify_wals_in_manifest
. Iftrue
, the log numbers and sizes of the synced WALs are tracked in MANIFEST, then during DB recovery, if a synced WAL is missing from disk, or the WAL's size does not match the recorded size in MANIFEST, an error will be reported and the recovery will be aborted. Note that this option does not work with secondary instance.
Behavior Changes
- The dictionary compression settings specified in
ColumnFamilyOptions::compression_opts
now additionally affect files generated by flush and compaction to non-bottommost level. Previously those settings at most affected files generated by compaction to bottommost level, depending on whetherColumnFamilyOptions::bottommost_compression_opts
overrode them. Users who relied on dictionary compression settings inColumnFamilyOptions::compression_opts
affecting only the bottommost level can keep the behavior by moving their dictionary settings toColumnFamilyOptions::bottommost_compression_opts
and setting itsenabled
flag. - When the
enabled
flag is set inColumnFamilyOptions::bottommost_compression_opts
, those compression options now take effect regardless of the value inColumnFamilyOptions::bottommost_compression
. Previously, those compression options only took effect whenColumnFamilyOptions::bottommost_compression != kDisableCompressionOption
. Now, they additionally take effect whenColumnFamilyOptions::bottommost_compression == kDisableCompressionOption
(such a setting causes bottommost compression type to fall back toColumnFamilyOptions::compression_per_level
if configured, and otherwise fall back toColumnFamilyOptions::compression
).
π New Features
- An EXPERIMENTAL new Bloom alternative that saves about 30% space compared to Bloom filters, with about 3-4x construction time and similar query times is available using NewExperimentalRibbonFilterPolicy.
- π Fixed a bug in the following combination of features: indexes with user keys (
-
v6.14.6
December 01, 20206.14.6 (12/01/2020)
π Bug Fixes
- Truncated WALs ending in incomplete records can no longer produce gaps in the recovered data when
WALRecoveryMode::kPointInTimeRecovery
is used. Gaps are still possible when WALs are truncated exactly on record boundaries.
- Truncated WALs ending in incomplete records can no longer produce gaps in the recovered data when
-
v6.14.5
November 17, 20206.14.5 (11/15/2020)
π Bug Fixes
- Fix a bug of encoding and parsing BlockBasedTableOptions::read_amp_bytes_per_bit as a 64-bit integer.
6.14.4 (11/05/2020)
π Bug Fixes
π Fixed a potential bug caused by evaluating
TableBuilder::NeedCompact()
beforeTableBuilder::Finish()
in compaction job. For example, theNeedCompact()
method ofCompactOnDeletionCollector
returned by built-inCompactOnDeletionCollectorFactory
requiresBlockBasedTable::Finish()
to return the correct result. The bug can cause a compaction-generated file not to be marked for future compaction based on deletion ratio.6.14.3 (10/30/2020)
π Bug Fixes
- Reverted a behavior change silently introduced in 6.14.2, in which the effects of the
ignore_unknown_options
flag (used in option parsing/loading functions) changed. - βͺ Reverted a behavior change silently introduced in 6.14, in which options parsing/loading functions began returning
NotFound
instead ofInvalidArgument
for option names not available in the present version.
6.14.2 (10/21/2020)
π Bug Fixes
- π Fixed a bug which causes hang in closing DB when refit level is set in opt build. It was because ContinueBackgroundWork() was called in assert statement which is a no op. It was introduced in 6.14.
6.14.1 (10/13/2020)
π Bug Fixes
- Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
- Since 6.14, fix false positive flush/compaction
Status::Corruption
failure whenparanoid_file_checks == true
and range tombstones were written to the compaction output files. - π Fixed a bug in the following combination of features: indexes with user keys (
format_version >= 3
), indexes are partitioned (index_type == kTwoLevelIndexSearch
), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior. - π Fixed a bug when indexes are partitioned (
index_type == kTwoLevelIndexSearch
), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
), and partitions reads could be mixed between block cache and directly from the file (e.g., withenable_index_compression == 1
andmmap_read == 1
, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.
6.14 (10/09/2020)
π Bug fixes
- Fixed a bug after a
CompactRange()
withCompactRangeOptions::change_level
set fails due to a conflict in the level change step, which caused all subsequent calls toCompactRange()
withCompactRangeOptions::change_level
set to incorrectly fail with aStatus::NotSupported("another thread is refitting")
error. - π Fixed a bug that the bottom most level compaction could still be a trivial move even if
BottommostLevelCompaction.kForce
orkForceOptimized
is set.
Public API Change
- The methods to create and manage EncrypedEnv have been changed. The EncryptionProvider is now passed to NewEncryptedEnv as a shared pointer, rather than a raw pointer. Comparably, the CTREncryptedProvider now takes a shared pointer, rather than a reference, to a BlockCipher. CreateFromString methods have been added to BlockCipher and EncryptionProvider to provide a single API by which different ciphers and providers can be created, respectively.
- π The internal classes (CTREncryptionProvider, ROT13BlockCipher, CTRCipherStream) associated with the EncryptedEnv have been moved out of the public API. To create a CTREncryptionProvider, one can either use EncryptionProvider::NewCTRProvider, or EncryptionProvider::CreateFromString("CTR"). To create a new ROT13BlockCipher, one can either use BlockCipher::NewROT13Cipher or BlockCipher::CreateFromString("ROT13").
- π The EncryptionProvider::AddCipher method has been added to allow keys to be added to an EncryptionProvider. This API will allow future providers to support multiple cipher keys.
- Add a new option "allow_data_in_errors". When this new option is set by users, it allows users to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data. By default value of this option is set false to prevent users data to be exposed in the messages so currently, data will be redacted from logs, messages, status by default.
- AdvancedColumnFamilyOptions::force_consistency_checks is now true by default, for more proactive DB corruption detection at virtually no cost (estimated two extra CPU cycles per million on a major production workload). Corruptions reported by these checks now mention "force_consistency_checks" in case a false positive corruption report is suspected and the option needs to be disabled (unlikely). Since existing column families have a saved setting for force_consistency_checks, only new column families will pick up the new default.
General Improvements
- π The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.
π New Features
- β‘οΈ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherit). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.
- Introduce options.check_flush_compaction_key_order with default value to be true. With this option, during flush and compaction, key order will be checked when writing to each SST file. If the order is violated, the flush or compaction will fail.
- Added is_full_compaction to CompactionJobStats, so that the information is available through the EventListener interface.
- β Add more stats for MultiGet in Histogram to get number of data blocks, index blocks, filter blocks and sst files read from file system per level.
-
v6.14
September 10, 2020π Bug fixes
- Fixed a bug after a
CompactRange()
withCompactRangeOptions::change_level
set fails due to a conflict in the level change step, which caused all subsequent calls toCompactRange()
withCompactRangeOptions::change_level
set to incorrectly fail with aStatus::NotSupported("another thread is refitting")
error. - π Fixed a bug that the bottom most level compaction could still be a trivial move even if
BottommostLevelCompaction.kForce
orkForceOptimized
is set.
Public API Change
- The methods to create and manage EncrypedEnv have been changed. The EncryptionProvider is now passed to NewEncryptedEnv as a shared pointer, rather than a raw pointer. Comparably, the CTREncryptedProvider now takes a shared pointer, rather than a reference, to a BlockCipher. CreateFromString methods have been added to BlockCipher and EncryptionProvider to provide a single API by which different ciphers and providers can be created, respectively.
- π The internal classes (CTREncryptionProvider, ROT13BlockCipher, CTRCipherStream) associated with the EncryptedEnv have been moved out of the public API. To create a CTREncryptionProvider, one can either use EncryptionProvider::NewCTRProvider, or EncryptionProvider::CreateFromString("CTR"). To create a new ROT13BlockCipher, one can either use BlockCipher::NewROT13Cipher or BlockCipher::CreateFromString("ROT13").
- π The EncryptionProvider::AddCipher method has been added to allow keys to be added to an EncryptionProvider. This API will allow future providers to support multiple cipher keys.
- Add a new option "allow_data_in_errors". When this new option is set by users, it allows users to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data. By default value of this option is set false to prevent users data to be exposed in the messages so currently, data will be redacted from logs, messages, status by default.
- AdvancedColumnFamilyOptions::force_consistency_checks is now true by default, for more proactive DB corruption detection at virtually no cost (estimated two extra CPU cycles per million on a major production workload). Corruptions reported by these checks now mention "force_consistency_checks" in case a false positive corruption report is suspected and the option needs to be disabled (unlikely). Since existing column families have a saved setting for force_consistency_checks, only new column families will pick up the new default.
General Improvements
- π The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.
π New Features
- β‘οΈ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherit). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.
- Introduce options.check_flush_compaction_key_order with default value to be true. With this option, during flush and compaction, key order will be checked when writing to each SST file. If the order is violated, the flush or compaction will fail.
- Added is_full_compaction to CompactionJobStats, so that the information is available through the EventListener interface.
- β Add more stats for MultiGet in Histogram to get number of data blocks, index blocks, filter blocks and sst files read from file system per level.
- SST files have a new table property called db_host_id, which is set to the hostname by default. A new option in DBOptions, db_host_id, allows the property value to be overridden with a user specified string, or disable it completely by making the option string empty.
- π§ Methods to create customizable extensions -- such as TableFactory -- are exposed directly through the Customizable base class (from which these objects inherit). This change will allow these Customizable classes to be loaded and configured in a standard way (via CreateFromString). More information on how to write and use Customizable classes is in the customizable.h header file.
- Fixed a bug after a
-
v6.13.3
October 13, 20206.13.3 (10/14/2020)
π Bug Fixes
- Fix a bug that could cause a stalled write to crash with mixed of slowdown and no_slowdown writes (
WriteOptions.no_slowdown=true
).
6.13.2 (10/13/2020)
π Bug Fixes
- Fix false positive flush/compaction
Status::Corruption
failure whenparanoid_file_checks == true
and range tombstones were written to the compaction output files.
6.13.1 (10/12/2020)
π Bug Fixes
- Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
- π Fixed a bug in the following combination of features: indexes with user keys (
format_version >= 3
), indexes are partitioned (index_type == kTwoLevelIndexSearch
), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior. - π Fixed a bug when indexes are partitioned (
index_type == kTwoLevelIndexSearch
), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
), and partitions reads could be mixed between block cache and directly from the file (e.g., withenable_index_compression == 1
andmmap_read == 1
, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.
6.13 (09/24/2020)
π Bug fixes
- π Fix a performance regression introduced in 6.4 that makes a upper bound check for every Next() even if keys are within a data block that is within the upper bound.
- π Fix a possible corruption to the LSM state (overlapping files within a level) when a
CompactRange()
for refitting levels (CompactRangeOptions::change_level == true
) and another manual compaction are executed in parallel. - π² Sanitize
recycle_log_file_num
to zero when the user attempts to enable it in combination withWALRecoveryMode::kTolerateCorruptedTailRecords
. Previously the two features were allowed together, which compromised the user's configured crash-recovery guarantees. - π Fix a bug where a level refitting in CompactRange() might race with an automatic compaction that puts the data to the target level of the refitting. The bug has been there for years.
- Fixed a bug in version 6.12 in which BackupEngine::CreateNewBackup could fail intermittently with non-OK status when backing up a read-write DB configured with a DBOptions::file_checksum_gen_factory.
- π Fix useless no-op compactions scheduled upon snapshot release when options.disable-auto-compactions = true.
- Fix a bug when max_write_buffer_size_to_maintain is set, immutable flushed memtable destruction is delayed until the next super version is installed. A memtable is not added to delete list because of its reference hold by super version and super version doesn't switch because of empt delete list. So memory usage keeps on increasing beyond write_buffer_size + max_write_buffer_size_to_maintain.
- Avoid converting MERGES to PUTS when allow_ingest_behind is true.
- π Fix compression dictionary sampling together with
SstFileWriter
. Previously, the dictionary would be trained/finalized immediately with zero samples. Now, the wholeSstFileWriter
file is buffered in memory and then sampled. - Fix a bug with
avoid_unnecessary_blocking_io=1
and creating backups (BackupEngine::CreateNewBackup) or checkpoints (Checkpoint::Create). With this setting and WAL enabled, these operations could randomly fail with non-OK status. - π Fix a bug in which bottommost compaction continues to advance the underlying InternalIterator to skip tombstones even after shutdown.
π New Features
- A new field
std::string requested_checksum_func_name
is added toFileChecksumGenContext
, which enables the checksum factory to create generators for a suite of different functions. - β Added a new subcommand,
ldb unsafe_remove_sst_file
, which removes a lost or corrupt SST file from a DB's metadata. This command involves data loss and must not be used on a live DB.
π Performance Improvements
- β¬οΈ Reduce thread number for multiple DB instances by re-using one global thread for statistics dumping and persisting.
- Reduce write-amp in heavy write bursts in
kCompactionStyleLevel
compaction style withlevel_compaction_dynamic_level_bytes
set. - BackupEngine incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless
share_files_with_checksum
is used withkLegacyCrc32cAndFileSize
naming (discouraged).- For
share_files_with_checksum
, we are confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time. - For
share_table_files
without "checksum" (not recommended), there is a regression in detecting fundamentally unsafe use of the option, greatly mitigated by file size checking (under "Behavior Changes"). Almost no reason to useshare_files_with_checksum=false
should remain. DB::VerifyChecksum
andBackupEngine::VerifyBackup
with checksum checking are still able to catch corruptions thatCreateNewBackup
does not.
- For
Public API Change
- β‘οΈ Expose kTypeDeleteWithTimestamp in EntryType and update GetEntryType() accordingly.
- Added file_checksum and file_checksum_func_name to TableFileCreationInfo, which can pass the table file checksum information through the OnTableFileCreated callback during flush and compaction.
- π A warning is added to
DB::DeleteFile()
API describing its known problems and deprecation plan. - β Add a new stats level, i.e. StatsLevel::kExceptTickers (PR7329) to exclude tickers even if application passes a non-null Statistics object.
- β Added a new status code IOStatus::IOFenced() for the Env/FileSystem to indicate that writes from this instance are fenced off. Like any other background error, this error is returned to the user in Put/Merge/Delete/Flush calls and can be checked using Status::IsIOFenced().
Behavior Changes
- π File abstraction
FSRandomAccessFile.Prefetch()
default return status is changed fromOK
toNotSupported
. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance. - When retryable IO error happens during Flush (manifest write error is excluded) and WAL is disabled, originally it is mapped to kHardError. Now,it is mapped to soft error. So DB will not stall the writes unless the memtable is full. At the same time, when auto resume is triggered to recover the retryable IO error during Flush, SwitchMemtable is not called to avoid generating to many small immutable memtables. If WAL is enabled, no behavior changes.
- When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch.
Others
- Error in prefetching partitioned index blocks will not be swallowed. It will fail the query and return the IOError users.
- Fix a bug that could cause a stalled write to crash with mixed of slowdown and no_slowdown writes (
-
v6.13.2
October 13, 2020 -
v6.13
December 09, 2020π Bug fixes
- π Fix a performance regression introduced in 6.4 that makes a upper bound check for every Next() even if keys are within a data block that is within the upper bound.
- π Fix a possible corruption to the LSM state (overlapping files within a level) when a
CompactRange()
for refitting levels (CompactRangeOptions::change_level == true
) and another manual compaction are executed in parallel. - π² Sanitize
recycle_log_file_num
to zero when the user attempts to enable it in combination withWALRecoveryMode::kTolerateCorruptedTailRecords
. Previously the two features were allowed together, which compromised the user's configured crash-recovery guarantees. - π Fix a bug where a level refitting in CompactRange() might race with an automatic compaction that puts the data to the target level of the refitting. The bug has been there for years.
- Fixed a bug in version 6.12 in which BackupEngine::CreateNewBackup could fail intermittently with non-OK status when backing up a read-write DB configured with a DBOptions::file_checksum_gen_factory.
- π Fix useless no-op compactions scheduled upon snapshot release when options.disable-auto-compactions = true.
- Fix a bug when max_write_buffer_size_to_maintain is set, immutable flushed memtable destruction is delayed until the next super version is installed. A memtable is not added to delete list because of its reference hold by super version and super version doesn't switch because of empt delete list. So memory usage keeps on increasing beyond write_buffer_size + max_write_buffer_size_to_maintain.
- Avoid converting MERGES to PUTS when allow_ingest_behind is true.
- π Fix compression dictionary sampling together with
SstFileWriter
. Previously, the dictionary would be trained/finalized immediately with zero samples. Now, the wholeSstFileWriter
file is buffered in memory and then sampled. - Fix a bug with
avoid_unnecessary_blocking_io=1
and creating backups (BackupEngine::CreateNewBackup) or checkpoints (Checkpoint::Create). With this setting and WAL enabled, these operations could randomly fail with non-OK status. - π Fix a bug in which bottommost compaction continues to advance the underlying InternalIterator to skip tombstones even after shutdown.
π New Features
- A new field
std::string requested_checksum_func_name
is added toFileChecksumGenContext
, which enables the checksum factory to create generators for a suite of different functions. - β Added a new subcommand,
ldb unsafe_remove_sst_file
, which removes a lost or corrupt SST file from a DB's metadata. This command involves data loss and must not be used on a live DB.
π Performance Improvements
- β¬οΈ Reduce thread number for multiple DB instances by re-using one global thread for statistics dumping and persisting.
- Reduce write-amp in heavy write bursts in
kCompactionStyleLevel
compaction style withlevel_compaction_dynamic_level_bytes
set. - BackupEngine incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless
share_files_with_checksum
is used withkLegacyCrc32cAndFileSize
naming (discouraged).- For
share_files_with_checksum
, we are confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time. - For
share_table_files
without "checksum" (not recommended), there is a regression in detecting fundamentally unsafe use of the option, greatly mitigated by file size checking (under "Behavior Changes"). Almost no reason to useshare_files_with_checksum=false
should remain. DB::VerifyChecksum
andBackupEngine::VerifyBackup
with checksum checking are still able to catch corruptions thatCreateNewBackup
does not.
- For
Public API Change
- β‘οΈ Expose kTypeDeleteWithTimestamp in EntryType and update GetEntryType() accordingly.
- Added file_checksum and file_checksum_func_name to TableFileCreationInfo, which can pass the table file checksum information through the OnTableFileCreated callback during flush and compaction.
- π A warning is added to
DB::DeleteFile()
API describing its known problems and deprecation plan. - β Add a new stats level, i.e. StatsLevel::kExceptTickers (PR7329) to exclude tickers even if application passes a non-null Statistics object.
- β Added a new status code IOStatus::IOFenced() for the Env/FileSystem to indicate that writes from this instance are fenced off. Like any other background error, this error is returned to the user in Put/Merge/Delete/Flush calls and can be checked using Status::IsIOFenced().
Behavior Changes
- π File abstraction
FSRandomAccessFile.Prefetch()
default return status is changed fromOK
toNotSupported
. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance. - When retryabel IO error happens during Flush (manifest write error is excluded) and WAL is disabled, originally it is mapped to kHardError. Now,it is mapped to soft error. So DB will not stall the writes unless the memtable is full. At the same time, when auto resume is triggered to recover the retryable IO error during Flush, SwitchMemtable is not called to avoid generating to many small immutable memtables. If WAL is enabled, no behavior changes.
- When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch.
Others
- Error in prefetching partitioned index blocks will not be swallowed. It will fail the query and return the IOError users.
-
v6.12.7
October 14, 20206.12.7 (2020-10-14)
Other
π Fix build issue to enable RocksJava release for ppc64le
-
v6.12.6
October 13, 20206.12.6 (2020-10-13)
π Bug Fixes
- Fix false positive flush/compaction
Status::Corruption
failure whenparanoid_file_checks == true
and range tombstones were written to the compaction output files.
6.12.5 (2020-10-12)
π Bug Fixes
- Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
- π Fixed a bug in the following combination of features: indexes with user keys (
format_version >= 3
), indexes are partitioned (index_type == kTwoLevelIndexSearch
), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior. - π Fixed a bug when indexes are partitioned (
index_type == kTwoLevelIndexSearch
), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
), and partitions reads could be mixed between block cache and directly from the file (e.g., withenable_index_compression == 1
andmmap_read == 1
, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.
6.12.4 (2020-09-18)
Public API Change
- Reworked
BackupableDBOptions::share_files_with_checksum_naming
(new in 6.12) with some minor improvements and to better support those who were extracting files sizes from backup file names.
6.12.3 (2020-09-16)
π Bug fixes
- π Fixed a bug in size-amp-triggered and periodic-triggered universal compaction, where the compression settings for the first input level were used rather than the compression settings for the output (bottom) level.
6.12.2 (2020-09-14)
Public API Change
- π BlobDB now exposes the start of the expiration range of TTL blob files via the
GetLiveFilesMetaData
API.
6.12.1 (2020-08-20)
π Bug fixes
- BackupEngine::CreateNewBackup could fail intermittently with non-OK status when backing up a read-write DB configured with a DBOptions::file_checksum_gen_factory. This issue has been worked-around such that CreateNewBackup should succeed, but (until fully fixed) BackupEngine might not see all checksums available in the DB.
6.12 (2020-07-28)
Public API Change
- Encryption file classes now exposed for inheritance in env_encryption.h
- File I/O listener is extended to cover more I/O operations. Now class
EventListener
in listener.h contains new callback functions:OnFileFlushFinish()
,OnFileSyncFinish()
,OnFileRangeSyncFinish()
,OnFileTruncateFinish()
, andOnFileCloseFinish()
. FileOperationInfo
now reportsduration
measured bystd::chrono::steady_clock
andstart_ts
measured bystd::chrono::system_clock
instead of start and finish timestamps measured bysystem_clock
. Note thatsystem_clock
is called beforesteady_clock
in program order at operation starts.DB::GetDbSessionId(std::string& session_id)
is added.session_id
stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".- π
DB::OpenForReadOnly()
now returnsStatus::NotFound
when the specified DB directory does not exist. Previously the error returned depended on the underlyingEnv
. This change is available in all 6.11 releases as well. - A parameter
verify_with_checksum
is added toBackupEngine::VerifyBackup
, which is false by default. If it is ture,BackupEngine::VerifyBackup
verifies checksums and file sizes of backup files. Passfalse
forverify_with_checksum
to maintain the previous behavior and performance ofBackupEngine::VerifyBackup
, by only verifying sizes of backup files.
Behavior Changes
- Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
- In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
- When
file_checksum_gen_factory
is set toGetFileChecksumGenCrc32cFactory()
, BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will failCreateNewBackup()
on mismatch (corruption). If thefile_checksum_gen_factory
is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine. - When a DB sets
stats_dump_period_sec > 0
, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in[0, stats_dump_period_sec)
. Subsequent stats dumps are still spacedstats_dump_period_sec
seconds apart. - When the paranoid_file_checks option is true, a hash is generated of all keys and values are generated when the SST file is written, and then the values are read back in to validate the file. A corruption is signaled if the two hashes do not match.
π Bug fixes
- π Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
- π Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.
- π Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.
- π Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
- π Make compaction report InternalKey corruption while iterating over the input.
- π Fix a bug which may cause MultiGet to be slow because it may read more data than requested, but this won't affect correctness. The bug was introduced in 6.10 release.
- π² Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.
π New Features
- DB identity (
db_id
) and DB session identity (db_session_id
) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity βSST Writerβ and βDB Repairerβ, respectively. Their DB session IDs are generated in the same way asDB::GetDbSessionId
. The session ID for SstFileWriter (resp., Repairer) resets every timeSstFileWriter::Open
(resp.,Repairer::Run
) is called. - Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
BackupableDBOptions::share_files_with_checksum_naming
is added with new default behavior for naming backup files withshare_files_with_checksum
, to address performance and backup integrity issues. See API comments for details.- Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.
- Option
max_subcompactions
can be set dynamically using DB::SetDBOptions(). - Added experimental ColumnFamilyOptions::sst_partitioner_factory to define determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).
π Performance Improvements
- Eliminate key copies for internal comparisons while accessing ingested block-based tables.
- β¬οΈ Reduce key comparisons during random access in all block-based tables.
- BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the
shared_checksum
directory when usingshare_files_with_checksum_naming = kUseDbSessionId
(new default), except on SST files generated before this version of RocksDB, which fall back on usingkLegacyCrc32cAndFileSize
.
- Fix false positive flush/compaction
-
v6.12
July 28, 2020Public API Change
- Encryption file classes now exposed for inheritance in env_encryption.h
- File I/O listener is extended to cover more I/O operations. Now class
EventListener
in listener.h contains new callback functions:OnFileFlushFinish()
,OnFileSyncFinish()
,OnFileRangeSyncFinish()
,OnFileTruncateFinish()
, andOnFileCloseFinish()
. FileOperationInfo
now reportsduration
measured bystd::chrono::steady_clock
andstart_ts
measured bystd::chrono::system_clock
instead of start and finish timestamps measured bysystem_clock
. Note thatsystem_clock
is called beforesteady_clock
in program order at operation starts.DB::GetDbSessionId(std::string& session_id)
is added.session_id
stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".- π
DB::OpenForReadOnly()
now returnsStatus::NotFound
when the specified DB directory does not exist. Previously the error returned depended on the underlyingEnv
. This change is available in all 6.11 releases as well. - A parameter
verify_with_checksum
is added toBackupEngine::VerifyBackup
, which is false by default. If it is ture,BackupEngine::VerifyBackup
verifies checksums and file sizes of backup files. Passfalse
forverify_with_checksum
to maintain the previous behavior and performance ofBackupEngine::VerifyBackup
, by only verifying sizes of backup files. - π§ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future ### Behavior Changes
- Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
- In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
- When
file_checksum_gen_factory
is set toGetFileChecksumGenCrc32cFactory()
, BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will failCreateNewBackup()
on mismatch (corruption). If thefile_checksum_gen_factory
is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine. - When a DB sets
stats_dump_period_sec > 0
, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in[0, stats_dump_period_sec)
. Subsequent stats dumps are still spacedstats_dump_period_sec
seconds apart. - When the paranoid_file_checks option is true, a hash is generated of all keys and values are generated when the SST file is written, and then the values are read back in to validate the file. A corruption is signaled if the two hashes do not match.
π Bug fixes
- π Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
- π Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.
- π Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.
- π Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
- π Make compaction report InternalKey corruption while iterating over the input.
- π Fix a bug which may cause MultiGet to be slow because it may read more data than requested, but this won't affect correctness. The bug was introduced in 6.10 release.
- π² Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.
π New Features
- DB identity (
db_id
) and DB session identity (db_session_id
) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity βSST Writerβ and βDB Repairerβ, respectively. Their DB session IDs are generated in the same way asDB::GetDbSessionId
. The session ID for SstFileWriter (resp., Repairer) resets every timeSstFileWriter::Open
(resp.,Repairer::Run
) is called. - Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming
is added, whereBackupTableNameOption
is anenum
type with two enumeratorskChecksumAndFileSize
andkOptionalChecksumAndDbSessionId
. By default,BackupableDBOptions::share_files_with_checksum_naming
is set tokOptionalChecksumAndDbSessionId
. In the default case, backup table filenames generated by this version of RocksDB are of the form either<file_number>_<crc32c>_<db_session_id>.sst
or<file_number>_<db_session_id>.sst
as opposed to<file_number>_<crc32c>_<file_size>.sst
. Specifically, table filenames are of the form<file_number>_<crc32c>_<db_session_id>.sst
ifDBOptions::file_checksum_gen_factory
is set toGetFileChecksumGenCrc32cFactory()
. Futhermore, the checksum value<crc32c>
appeared in the filenames is hexadecimal-encoded, instead of being decimal-encodeduint32_t
value. IfDBOptions::file_checksum_gen_factory
isnullptr
, the table filenames are of the form<file_number>_<db_session_id>.sst
. The new default behavior fixes the backup file name collision problem, which might be possible at large scale, but the optionkChecksumAndFileSize
is added to allow use of old naming in case it is needed. Moreover, for table files generated prior to this version of RocksDB, usingkOptionalChecksumAndDbSessionId
will fall back onkChecksumAndFileSize
. In these cases, the checksum value<crc32c>
in the filenames<file_number>_<crc32c>_<file_size>.sst
is decimal-encodeduint32_t
value as before. This default behavior change is not an upgrade issue, because previous versions of RocksDB can read, restore, and delete backups using new names, and it's OK for a backup directory to use a mixture of table file naming schemes. Note thatshare_files_with_checksum_naming
comes into effect only when bothshare_files_with_checksum
andshare_table_files
are true.- Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.
- Option
max_subcompactions
can be set dynamically using DB::SetDBOptions(). - Added experimental ColumnFamilyOptions::sst_partitioner_factory to define determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).
- β‘οΈ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.
π Performance Improvements
- Eliminate key copies for internal comparisons while accessing ingested block-based tables.
- β¬οΈ Reduce key comparisons during random access in all block-based tables.
- BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the
shared_checksum
directory when usingkOptionalChecksumAndDbSessionId
, except on SST files generated before this version of RocksDB, which fall back on usingkChecksumAndFileSize
.
General Improvements
- π The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.