All Versions
95
Latest Version
Avg Release Cycle
29 days
Latest Release
-

Changelog History
Page 1

  • v6.15.0

    πŸ› Bug Fixes

    • πŸ”– Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
    • πŸ“Œ Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.
    • Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
    • Since 6.14, fix false positive flush/compaction Status::Corruption failure when paranoid_file_checks == true and range tombstones were written to the compaction output files.
    • Since 6.14, fix a bug that could cause a stalled write to crash with mixed of slowdown and no_slowdown writes (WriteOptions.no_slowdown=true).
    • πŸ›  Fixed a bug which causes hang in closing DB when refit level is set in opt build. It was because ContinueBackgroundWork() was called in assert statement which is a no op. It was introduced in 6.14.
    • πŸ›  Fixed a bug which causes Get() to return incorrect result when a key's merge operand is applied twice. This can occur if the thread performing Get() runs concurrently with a background flush thread and another thread writing to the MANIFEST file (PR6069).
    • Reverted a behavior change silently introduced in 6.14.2, in which the effects of the ignore_unknown_options flag (used in option parsing/loading functions) changed.
    • βͺ Reverted a behavior change silently introduced in 6.14, in which options parsing/loading functions began returning NotFound instead of InvalidArgument for option names not available in the present version.
    • πŸ›  Fixed MultiGet bugs it doesn't return valid data with user defined timestamp.
    • πŸ›  Fixed a potential bug caused by evaluating TableBuilder::NeedCompact() before TableBuilder::Finish() in compaction job. For example, the NeedCompact() method of CompactOnDeletionCollector returned by built-in CompactOnDeletionCollectorFactory requires BlockBasedTable::Finish() to return the correct result. The bug can cause a compaction-generated file not to be marked for future compaction based on deletion ratio.
    • πŸ›  Fixed a seek issue with prefix extractor and timestamp.
    • Fixed a bug of encoding and parsing BlockBasedTableOptions::read_amp_bytes_per_bit as a 64-bit integer.
    • πŸ›  Fixed a bug of a recovery corner case, details in PR7621.

    Public API Change

    • Deprecate BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache and BlockBasedTableOptions::pin_top_level_index_and_filter. These options still take effect until users migrate to the replacement APIs in BlockBasedTableOptions::metadata_cache_options. Migration guidance can be found in the API comments on the deprecated options.
    • βž• Add new API DB::VerifyFileChecksums to verify SST file checksum with corresponding entries in the MANIFEST if present. Current implementation requires scanning and recomputing file checksums.
    • Added a new option track_and_verify_wals_in_manifest. If true, the log numbers and sizes of the synced WALs are tracked in MANIFEST, then during DB recovery, if a synced WAL is missing from disk, or the WAL's size does not match the recorded size in MANIFEST, an error will be reported and the recovery will be aborted. Note that this option does not work with secondary instance.

    Behavior Changes

    • The dictionary compression settings specified in ColumnFamilyOptions::compression_opts now additionally affect files generated by flush and compaction to non-bottommost level. Previously those settings at most affected files generated by compaction to bottommost level, depending on whether ColumnFamilyOptions::bottommost_compression_opts overrode them. Users who relied on dictionary compression settings in ColumnFamilyOptions::compression_opts affecting only the bottommost level can keep the behavior by moving their dictionary settings to ColumnFamilyOptions::bottommost_compression_opts and setting its enabled flag.
    • When the enabled flag is set in ColumnFamilyOptions::bottommost_compression_opts, those compression options now take effect regardless of the value in ColumnFamilyOptions::bottommost_compression. Previously, those compression options only took effect when ColumnFamilyOptions::bottommost_compression != kDisableCompressionOption. Now, they additionally take effect when ColumnFamilyOptions::bottommost_compression == kDisableCompressionOption (such a setting causes bottommost compression type to fall back to ColumnFamilyOptions::compression_per_level if configured, and otherwise fall back to ColumnFamilyOptions::compression).

    πŸ†• New Features

    • An EXPERIMENTAL new Bloom alternative that saves about 30% space compared to Bloom filters, with about 3-4x construction time and similar query times is available using NewExperimentalRibbonFilterPolicy.
  • v6.14.6

    December 01, 2020

    6.14.6 (12/01/2020)

    πŸ› Bug Fixes

    • Truncated WALs ending in incomplete records can no longer produce gaps in the recovered data when WALRecoveryMode::kPointInTimeRecovery is used. Gaps are still possible when WALs are truncated exactly on record boundaries.
  • v6.14.5

    November 17, 2020

    6.14.5 (11/15/2020)

    πŸ› Bug Fixes

    • Fix a bug of encoding and parsing BlockBasedTableOptions::read_amp_bytes_per_bit as a 64-bit integer.

    6.14.4 (11/05/2020)

    πŸ› Bug Fixes

    πŸ›  Fixed a potential bug caused by evaluating TableBuilder::NeedCompact() before TableBuilder::Finish() in compaction job. For example, the NeedCompact() method of CompactOnDeletionCollector returned by built-in CompactOnDeletionCollectorFactory requires BlockBasedTable::Finish() to return the correct result. The bug can cause a compaction-generated file not to be marked for future compaction based on deletion ratio.

    6.14.3 (10/30/2020)

    πŸ› Bug Fixes

    • Reverted a behavior change silently introduced in 6.14.2, in which the effects of the ignore_unknown_options flag (used in option parsing/loading functions) changed.
    • βͺ Reverted a behavior change silently introduced in 6.14, in which options parsing/loading functions began returning NotFound instead of InvalidArgument for option names not available in the present version.

    6.14.2 (10/21/2020)

    πŸ› Bug Fixes

    • πŸ›  Fixed a bug which causes hang in closing DB when refit level is set in opt build. It was because ContinueBackgroundWork() was called in assert statement which is a no op. It was introduced in 6.14.

    6.14.1 (10/13/2020)

    πŸ› Bug Fixes

    • Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
    • Since 6.14, fix false positive flush/compaction Status::Corruption failure when paranoid_file_checks == true and range tombstones were written to the compaction output files.
    • πŸ”– Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
    • πŸ“Œ Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.

    6.14 (10/09/2020)

    πŸ› Bug fixes

    • Fixed a bug after a CompactRange() with CompactRangeOptions::change_level set fails due to a conflict in the level change step, which caused all subsequent calls to CompactRange() with CompactRangeOptions::change_level set to incorrectly fail with a Status::NotSupported("another thread is refitting") error.
    • πŸ›  Fixed a bug that the bottom most level compaction could still be a trivial move even if BottommostLevelCompaction.kForce or kForceOptimized is set.

    Public API Change

    • The methods to create and manage EncrypedEnv have been changed. The EncryptionProvider is now passed to NewEncryptedEnv as a shared pointer, rather than a raw pointer. Comparably, the CTREncryptedProvider now takes a shared pointer, rather than a reference, to a BlockCipher. CreateFromString methods have been added to BlockCipher and EncryptionProvider to provide a single API by which different ciphers and providers can be created, respectively.
    • 🚚 The internal classes (CTREncryptionProvider, ROT13BlockCipher, CTRCipherStream) associated with the EncryptedEnv have been moved out of the public API. To create a CTREncryptionProvider, one can either use EncryptionProvider::NewCTRProvider, or EncryptionProvider::CreateFromString("CTR"). To create a new ROT13BlockCipher, one can either use BlockCipher::NewROT13Cipher or BlockCipher::CreateFromString("ROT13").
    • πŸ‘ The EncryptionProvider::AddCipher method has been added to allow keys to be added to an EncryptionProvider. This API will allow future providers to support multiple cipher keys.
    • Add a new option "allow_data_in_errors". When this new option is set by users, it allows users to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data. By default value of this option is set false to prevent users data to be exposed in the messages so currently, data will be redacted from logs, messages, status by default.
    • AdvancedColumnFamilyOptions::force_consistency_checks is now true by default, for more proactive DB corruption detection at virtually no cost (estimated two extra CPU cycles per million on a major production workload). Corruptions reported by these checks now mention "force_consistency_checks" in case a false positive corruption report is suspected and the option needs to be disabled (unlikely). Since existing column families have a saved setting for force_consistency_checks, only new column families will pick up the new default.

    General Improvements

    • πŸ‘€ The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.

    πŸ†• New Features

    • ⚑️ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherit). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.
    • Introduce options.check_flush_compaction_key_order with default value to be true. With this option, during flush and compaction, key order will be checked when writing to each SST file. If the order is violated, the flush or compaction will fail.
    • Added is_full_compaction to CompactionJobStats, so that the information is available through the EventListener interface.
    • βž• Add more stats for MultiGet in Histogram to get number of data blocks, index blocks, filter blocks and sst files read from file system per level.
  • v6.14

    September 10, 2020

    πŸ› Bug fixes

    • Fixed a bug after a CompactRange() with CompactRangeOptions::change_level set fails due to a conflict in the level change step, which caused all subsequent calls to CompactRange() with CompactRangeOptions::change_level set to incorrectly fail with a Status::NotSupported("another thread is refitting") error.
    • πŸ›  Fixed a bug that the bottom most level compaction could still be a trivial move even if BottommostLevelCompaction.kForce or kForceOptimized is set.

    Public API Change

    • The methods to create and manage EncrypedEnv have been changed. The EncryptionProvider is now passed to NewEncryptedEnv as a shared pointer, rather than a raw pointer. Comparably, the CTREncryptedProvider now takes a shared pointer, rather than a reference, to a BlockCipher. CreateFromString methods have been added to BlockCipher and EncryptionProvider to provide a single API by which different ciphers and providers can be created, respectively.
    • 🚚 The internal classes (CTREncryptionProvider, ROT13BlockCipher, CTRCipherStream) associated with the EncryptedEnv have been moved out of the public API. To create a CTREncryptionProvider, one can either use EncryptionProvider::NewCTRProvider, or EncryptionProvider::CreateFromString("CTR"). To create a new ROT13BlockCipher, one can either use BlockCipher::NewROT13Cipher or BlockCipher::CreateFromString("ROT13").
    • πŸ‘ The EncryptionProvider::AddCipher method has been added to allow keys to be added to an EncryptionProvider. This API will allow future providers to support multiple cipher keys.
    • Add a new option "allow_data_in_errors". When this new option is set by users, it allows users to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data. By default value of this option is set false to prevent users data to be exposed in the messages so currently, data will be redacted from logs, messages, status by default.
    • AdvancedColumnFamilyOptions::force_consistency_checks is now true by default, for more proactive DB corruption detection at virtually no cost (estimated two extra CPU cycles per million on a major production workload). Corruptions reported by these checks now mention "force_consistency_checks" in case a false positive corruption report is suspected and the option needs to be disabled (unlikely). Since existing column families have a saved setting for force_consistency_checks, only new column families will pick up the new default.

    General Improvements

    • πŸ‘€ The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.

    πŸ†• New Features

    • ⚑️ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherit). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.
    • Introduce options.check_flush_compaction_key_order with default value to be true. With this option, during flush and compaction, key order will be checked when writing to each SST file. If the order is violated, the flush or compaction will fail.
    • Added is_full_compaction to CompactionJobStats, so that the information is available through the EventListener interface.
    • βž• Add more stats for MultiGet in Histogram to get number of data blocks, index blocks, filter blocks and sst files read from file system per level.
    • SST files have a new table property called db_host_id, which is set to the hostname by default. A new option in DBOptions, db_host_id, allows the property value to be overridden with a user specified string, or disable it completely by making the option string empty.
    • πŸ”§ Methods to create customizable extensions -- such as TableFactory -- are exposed directly through the Customizable base class (from which these objects inherit). This change will allow these Customizable classes to be loaded and configured in a standard way (via CreateFromString). More information on how to write and use Customizable classes is in the customizable.h header file.
  • v6.13.3

    October 13, 2020

    6.13.3 (10/14/2020)

    πŸ› Bug Fixes

    • Fix a bug that could cause a stalled write to crash with mixed of slowdown and no_slowdown writes (WriteOptions.no_slowdown=true).

    6.13.2 (10/13/2020)

    πŸ› Bug Fixes

    • Fix false positive flush/compaction Status::Corruption failure when paranoid_file_checks == true and range tombstones were written to the compaction output files.

    6.13.1 (10/12/2020)

    πŸ› Bug Fixes

    • Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
    • πŸ”– Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
    • πŸ“Œ Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.

    6.13 (09/24/2020)

    πŸ› Bug fixes

    • πŸ›  Fix a performance regression introduced in 6.4 that makes a upper bound check for every Next() even if keys are within a data block that is within the upper bound.
    • πŸ›  Fix a possible corruption to the LSM state (overlapping files within a level) when a CompactRange() for refitting levels (CompactRangeOptions::change_level == true) and another manual compaction are executed in parallel.
    • 🌲 Sanitize recycle_log_file_num to zero when the user attempts to enable it in combination with WALRecoveryMode::kTolerateCorruptedTailRecords. Previously the two features were allowed together, which compromised the user's configured crash-recovery guarantees.
    • πŸ›  Fix a bug where a level refitting in CompactRange() might race with an automatic compaction that puts the data to the target level of the refitting. The bug has been there for years.
    • Fixed a bug in version 6.12 in which BackupEngine::CreateNewBackup could fail intermittently with non-OK status when backing up a read-write DB configured with a DBOptions::file_checksum_gen_factory.
    • πŸ›  Fix useless no-op compactions scheduled upon snapshot release when options.disable-auto-compactions = true.
    • Fix a bug when max_write_buffer_size_to_maintain is set, immutable flushed memtable destruction is delayed until the next super version is installed. A memtable is not added to delete list because of its reference hold by super version and super version doesn't switch because of empt delete list. So memory usage keeps on increasing beyond write_buffer_size + max_write_buffer_size_to_maintain.
    • Avoid converting MERGES to PUTS when allow_ingest_behind is true.
    • πŸ›  Fix compression dictionary sampling together with SstFileWriter. Previously, the dictionary would be trained/finalized immediately with zero samples. Now, the whole SstFileWriter file is buffered in memory and then sampled.
    • Fix a bug with avoid_unnecessary_blocking_io=1 and creating backups (BackupEngine::CreateNewBackup) or checkpoints (Checkpoint::Create). With this setting and WAL enabled, these operations could randomly fail with non-OK status.
    • πŸ›  Fix a bug in which bottommost compaction continues to advance the underlying InternalIterator to skip tombstones even after shutdown.

    πŸ†• New Features

    • A new field std::string requested_checksum_func_name is added to FileChecksumGenContext, which enables the checksum factory to create generators for a suite of different functions.
    • βœ‚ Added a new subcommand, ldb unsafe_remove_sst_file, which removes a lost or corrupt SST file from a DB's metadata. This command involves data loss and must not be used on a live DB.

    🐎 Performance Improvements

    • ⬇️ Reduce thread number for multiple DB instances by re-using one global thread for statistics dumping and persisting.
    • Reduce write-amp in heavy write bursts in kCompactionStyleLevel compaction style with level_compaction_dynamic_level_bytes set.
    • BackupEngine incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless share_files_with_checksum is used with kLegacyCrc32cAndFileSize naming (discouraged).
      • For share_files_with_checksum, we are confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time.
      • For share_table_files without "checksum" (not recommended), there is a regression in detecting fundamentally unsafe use of the option, greatly mitigated by file size checking (under "Behavior Changes"). Almost no reason to use share_files_with_checksum=false should remain.
      • DB::VerifyChecksum and BackupEngine::VerifyBackup with checksum checking are still able to catch corruptions that CreateNewBackup does not.

    Public API Change

    • ⚑️ Expose kTypeDeleteWithTimestamp in EntryType and update GetEntryType() accordingly.
    • Added file_checksum and file_checksum_func_name to TableFileCreationInfo, which can pass the table file checksum information through the OnTableFileCreated callback during flush and compaction.
    • πŸ—„ A warning is added to DB::DeleteFile() API describing its known problems and deprecation plan.
    • βž• Add a new stats level, i.e. StatsLevel::kExceptTickers (PR7329) to exclude tickers even if application passes a non-null Statistics object.
    • βž• Added a new status code IOStatus::IOFenced() for the Env/FileSystem to indicate that writes from this instance are fenced off. Like any other background error, this error is returned to the user in Put/Merge/Delete/Flush calls and can be checked using Status::IsIOFenced().

    Behavior Changes

    • 🐎 File abstraction FSRandomAccessFile.Prefetch() default return status is changed from OK to NotSupported. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance.
    • When retryable IO error happens during Flush (manifest write error is excluded) and WAL is disabled, originally it is mapped to kHardError. Now,it is mapped to soft error. So DB will not stall the writes unless the memtable is full. At the same time, when auto resume is triggered to recover the retryable IO error during Flush, SwitchMemtable is not called to avoid generating to many small immutable memtables. If WAL is enabled, no behavior changes.
    • When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch.

    Others

    • Error in prefetching partitioned index blocks will not be swallowed. It will fail the query and return the IOError users.
  • v6.13.2

    October 13, 2020
  • v6.13

    December 09, 2020

    πŸ› Bug fixes

    • πŸ›  Fix a performance regression introduced in 6.4 that makes a upper bound check for every Next() even if keys are within a data block that is within the upper bound.
    • πŸ›  Fix a possible corruption to the LSM state (overlapping files within a level) when a CompactRange() for refitting levels (CompactRangeOptions::change_level == true) and another manual compaction are executed in parallel.
    • 🌲 Sanitize recycle_log_file_num to zero when the user attempts to enable it in combination with WALRecoveryMode::kTolerateCorruptedTailRecords. Previously the two features were allowed together, which compromised the user's configured crash-recovery guarantees.
    • πŸ›  Fix a bug where a level refitting in CompactRange() might race with an automatic compaction that puts the data to the target level of the refitting. The bug has been there for years.
    • Fixed a bug in version 6.12 in which BackupEngine::CreateNewBackup could fail intermittently with non-OK status when backing up a read-write DB configured with a DBOptions::file_checksum_gen_factory.
    • πŸ›  Fix useless no-op compactions scheduled upon snapshot release when options.disable-auto-compactions = true.
    • Fix a bug when max_write_buffer_size_to_maintain is set, immutable flushed memtable destruction is delayed until the next super version is installed. A memtable is not added to delete list because of its reference hold by super version and super version doesn't switch because of empt delete list. So memory usage keeps on increasing beyond write_buffer_size + max_write_buffer_size_to_maintain.
    • Avoid converting MERGES to PUTS when allow_ingest_behind is true.
    • πŸ›  Fix compression dictionary sampling together with SstFileWriter. Previously, the dictionary would be trained/finalized immediately with zero samples. Now, the whole SstFileWriter file is buffered in memory and then sampled.
    • Fix a bug with avoid_unnecessary_blocking_io=1 and creating backups (BackupEngine::CreateNewBackup) or checkpoints (Checkpoint::Create). With this setting and WAL enabled, these operations could randomly fail with non-OK status.
    • πŸ›  Fix a bug in which bottommost compaction continues to advance the underlying InternalIterator to skip tombstones even after shutdown.

    πŸ†• New Features

    • A new field std::string requested_checksum_func_name is added to FileChecksumGenContext, which enables the checksum factory to create generators for a suite of different functions.
    • βœ‚ Added a new subcommand, ldb unsafe_remove_sst_file, which removes a lost or corrupt SST file from a DB's metadata. This command involves data loss and must not be used on a live DB.

    🐎 Performance Improvements

    • ⬇️ Reduce thread number for multiple DB instances by re-using one global thread for statistics dumping and persisting.
    • Reduce write-amp in heavy write bursts in kCompactionStyleLevel compaction style with level_compaction_dynamic_level_bytes set.
    • BackupEngine incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless share_files_with_checksum is used with kLegacyCrc32cAndFileSize naming (discouraged).
      • For share_files_with_checksum, we are confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time.
      • For share_table_files without "checksum" (not recommended), there is a regression in detecting fundamentally unsafe use of the option, greatly mitigated by file size checking (under "Behavior Changes"). Almost no reason to use share_files_with_checksum=false should remain.
      • DB::VerifyChecksum and BackupEngine::VerifyBackup with checksum checking are still able to catch corruptions that CreateNewBackup does not.

    Public API Change

    • ⚑️ Expose kTypeDeleteWithTimestamp in EntryType and update GetEntryType() accordingly.
    • Added file_checksum and file_checksum_func_name to TableFileCreationInfo, which can pass the table file checksum information through the OnTableFileCreated callback during flush and compaction.
    • πŸ—„ A warning is added to DB::DeleteFile() API describing its known problems and deprecation plan.
    • βž• Add a new stats level, i.e. StatsLevel::kExceptTickers (PR7329) to exclude tickers even if application passes a non-null Statistics object.
    • βž• Added a new status code IOStatus::IOFenced() for the Env/FileSystem to indicate that writes from this instance are fenced off. Like any other background error, this error is returned to the user in Put/Merge/Delete/Flush calls and can be checked using Status::IsIOFenced().

    Behavior Changes

    • 🐎 File abstraction FSRandomAccessFile.Prefetch() default return status is changed from OK to NotSupported. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance.
    • When retryabel IO error happens during Flush (manifest write error is excluded) and WAL is disabled, originally it is mapped to kHardError. Now,it is mapped to soft error. So DB will not stall the writes unless the memtable is full. At the same time, when auto resume is triggered to recover the retryable IO error during Flush, SwitchMemtable is not called to avoid generating to many small immutable memtables. If WAL is enabled, no behavior changes.
    • When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch.

    Others

    • Error in prefetching partitioned index blocks will not be swallowed. It will fail the query and return the IOError users.
  • v6.12.7

    October 14, 2020

    6.12.7 (2020-10-14)

    Other

    πŸ›  Fix build issue to enable RocksJava release for ppc64le

  • v6.12.6

    October 13, 2020

    6.12.6 (2020-10-13)

    πŸ› Bug Fixes

    • Fix false positive flush/compaction Status::Corruption failure when paranoid_file_checks == true and range tombstones were written to the compaction output files.

    6.12.5 (2020-10-12)

    πŸ› Bug Fixes

    • Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
    • πŸ”– Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
    • πŸ“Œ Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.

    6.12.4 (2020-09-18)

    Public API Change

    • Reworked BackupableDBOptions::share_files_with_checksum_naming (new in 6.12) with some minor improvements and to better support those who were extracting files sizes from backup file names.

    6.12.3 (2020-09-16)

    πŸ› Bug fixes

    • πŸ›  Fixed a bug in size-amp-triggered and periodic-triggered universal compaction, where the compression settings for the first input level were used rather than the compression settings for the output (bottom) level.

    6.12.2 (2020-09-14)

    Public API Change

    • πŸ“‡ BlobDB now exposes the start of the expiration range of TTL blob files via the GetLiveFilesMetaData API.

    6.12.1 (2020-08-20)

    πŸ› Bug fixes

    • BackupEngine::CreateNewBackup could fail intermittently with non-OK status when backing up a read-write DB configured with a DBOptions::file_checksum_gen_factory. This issue has been worked-around such that CreateNewBackup should succeed, but (until fully fixed) BackupEngine might not see all checksums available in the DB.

    6.12 (2020-07-28)

    Public API Change

    • Encryption file classes now exposed for inheritance in env_encryption.h
    • File I/O listener is extended to cover more I/O operations. Now class EventListener in listener.h contains new callback functions: OnFileFlushFinish(), OnFileSyncFinish(), OnFileRangeSyncFinish(), OnFileTruncateFinish(), and OnFileCloseFinish().
    • FileOperationInfo now reports duration measured by std::chrono::steady_clock and start_ts measured by std::chrono::system_clock instead of start and finish timestamps measured by system_clock. Note that system_clock is called before steady_clock in program order at operation starts.
    • DB::GetDbSessionId(std::string& session_id) is added. session_id stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".
    • πŸš€ DB::OpenForReadOnly() now returns Status::NotFound when the specified DB directory does not exist. Previously the error returned depended on the underlying Env. This change is available in all 6.11 releases as well.
    • A parameter verify_with_checksum is added to BackupEngine::VerifyBackup, which is false by default. If it is ture, BackupEngine::VerifyBackup verifies checksums and file sizes of backup files. Pass false for verify_with_checksum to maintain the previous behavior and performance of BackupEngine::VerifyBackup, by only verifying sizes of backup files.

    Behavior Changes

    • Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
    • In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
    • When file_checksum_gen_factory is set to GetFileChecksumGenCrc32cFactory(), BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will fail CreateNewBackup() on mismatch (corruption). If the file_checksum_gen_factory is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine.
    • When a DB sets stats_dump_period_sec > 0, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in [0, stats_dump_period_sec). Subsequent stats dumps are still spaced stats_dump_period_sec seconds apart.
    • When the paranoid_file_checks option is true, a hash is generated of all keys and values are generated when the SST file is written, and then the values are read back in to validate the file. A corruption is signaled if the two hashes do not match.

    πŸ› Bug fixes

    • πŸ›  Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
    • πŸ›  Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.
    • πŸ‘€ Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.
    • πŸ›  Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
    • πŸ‘‰ Make compaction report InternalKey corruption while iterating over the input.
    • πŸ›  Fix a bug which may cause MultiGet to be slow because it may read more data than requested, but this won't affect correctness. The bug was introduced in 6.10 release.
    • 🌲 Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.

    πŸ†• New Features

    • DB identity (db_id) and DB session identity (db_session_id) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity β€œSST Writer” and β€œDB Repairer”, respectively. Their DB session IDs are generated in the same way as DB::GetDbSessionId. The session ID for SstFileWriter (resp., Repairer) resets every time SstFileWriter::Open (resp., Repairer::Run) is called.
    • Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
    • BackupableDBOptions::share_files_with_checksum_naming is added with new default behavior for naming backup files with share_files_with_checksum, to address performance and backup integrity issues. See API comments for details.
    • Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.
    • Option max_subcompactions can be set dynamically using DB::SetDBOptions().
    • Added experimental ColumnFamilyOptions::sst_partitioner_factory to define determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).

    🐎 Performance Improvements

    • Eliminate key copies for internal comparisons while accessing ingested block-based tables.
    • ⬇️ Reduce key comparisons during random access in all block-based tables.
    • BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the shared_checksum directory when using share_files_with_checksum_naming = kUseDbSessionId (new default), except on SST files generated before this version of RocksDB, which fall back on using kLegacyCrc32cAndFileSize.
  • v6.12

    July 28, 2020

    Public API Change

    • Encryption file classes now exposed for inheritance in env_encryption.h
    • File I/O listener is extended to cover more I/O operations. Now class EventListener in listener.h contains new callback functions: OnFileFlushFinish(), OnFileSyncFinish(), OnFileRangeSyncFinish(), OnFileTruncateFinish(), and OnFileCloseFinish().
    • FileOperationInfo now reports duration measured by std::chrono::steady_clock and start_ts measured by std::chrono::system_clock instead of start and finish timestamps measured by system_clock. Note that system_clock is called before steady_clock in program order at operation starts.
    • DB::GetDbSessionId(std::string& session_id) is added. session_id stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".
    • πŸš€ DB::OpenForReadOnly() now returns Status::NotFound when the specified DB directory does not exist. Previously the error returned depended on the underlying Env. This change is available in all 6.11 releases as well.
    • A parameter verify_with_checksum is added to BackupEngine::VerifyBackup, which is false by default. If it is ture, BackupEngine::VerifyBackup verifies checksums and file sizes of backup files. Pass false for verify_with_checksum to maintain the previous behavior and performance of BackupEngine::VerifyBackup, by only verifying sizes of backup files.
    • πŸ”§ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future ### Behavior Changes
    • Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
    • In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
    • When file_checksum_gen_factory is set to GetFileChecksumGenCrc32cFactory(), BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will fail CreateNewBackup() on mismatch (corruption). If the file_checksum_gen_factory is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine.
    • When a DB sets stats_dump_period_sec > 0, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in [0, stats_dump_period_sec). Subsequent stats dumps are still spaced stats_dump_period_sec seconds apart.
    • When the paranoid_file_checks option is true, a hash is generated of all keys and values are generated when the SST file is written, and then the values are read back in to validate the file. A corruption is signaled if the two hashes do not match.

    πŸ› Bug fixes

    • πŸ›  Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
    • πŸ›  Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.
    • πŸ‘€ Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.
    • πŸ›  Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
    • πŸ‘‰ Make compaction report InternalKey corruption while iterating over the input.
    • πŸ›  Fix a bug which may cause MultiGet to be slow because it may read more data than requested, but this won't affect correctness. The bug was introduced in 6.10 release.
    • 🌲 Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.

    πŸ†• New Features

    • DB identity (db_id) and DB session identity (db_session_id) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity β€œSST Writer” and β€œDB Repairer”, respectively. Their DB session IDs are generated in the same way as DB::GetDbSessionId. The session ID for SstFileWriter (resp., Repairer) resets every time SstFileWriter::Open (resp., Repairer::Run) is called.
    • Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
    • BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming is added, where BackupTableNameOption is an enum type with two enumerators kChecksumAndFileSize and kOptionalChecksumAndDbSessionId. By default, BackupableDBOptions::share_files_with_checksum_naming is set to kOptionalChecksumAndDbSessionId. In the default case, backup table filenames generated by this version of RocksDB are of the form either <file_number>_<crc32c>_<db_session_id>.sst or <file_number>_<db_session_id>.sst as opposed to <file_number>_<crc32c>_<file_size>.sst. Specifically, table filenames are of the form <file_number>_<crc32c>_<db_session_id>.sst if DBOptions::file_checksum_gen_factory is set to GetFileChecksumGenCrc32cFactory(). Futhermore, the checksum value <crc32c> appeared in the filenames is hexadecimal-encoded, instead of being decimal-encoded uint32_t value. If DBOptions::file_checksum_gen_factory is nullptr, the table filenames are of the form <file_number>_<db_session_id>.sst. The new default behavior fixes the backup file name collision problem, which might be possible at large scale, but the option kChecksumAndFileSize is added to allow use of old naming in case it is needed. Moreover, for table files generated prior to this version of RocksDB, using kOptionalChecksumAndDbSessionId will fall back on kChecksumAndFileSize. In these cases, the checksum value <crc32c> in the filenames <file_number>_<crc32c>_<file_size>.sst is decimal-encoded uint32_t value as before. This default behavior change is not an upgrade issue, because previous versions of RocksDB can read, restore, and delete backups using new names, and it's OK for a backup directory to use a mixture of table file naming schemes. Note that share_files_with_checksum_naming comes into effect only when both share_files_with_checksum and share_table_files are true.
    • Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.
    • Option max_subcompactions can be set dynamically using DB::SetDBOptions().
    • Added experimental ColumnFamilyOptions::sst_partitioner_factory to define determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).
    • ⚑️ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.

    🐎 Performance Improvements

    • Eliminate key copies for internal comparisons while accessing ingested block-based tables.
    • ⬇️ Reduce key comparisons during random access in all block-based tables.
    • BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the shared_checksum directory when using kOptionalChecksumAndDbSessionId, except on SST files generated before this version of RocksDB, which fall back on using kChecksumAndFileSize.

    General Improvements

    • πŸ‘€ The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.