All Versions
116
Latest Version
Avg Release Cycle
29 days
Latest Release
-

Changelog History
Page 4

  • v6.12 Changes

    July 28, 2020

    Public API Change

    • Encryption file classes now exposed for inheritance in env_encryption.h
    • File I/O listener is extended to cover more I/O operations. Now class EventListener in listener.h contains new callback functions: OnFileFlushFinish(), OnFileSyncFinish(), OnFileRangeSyncFinish(), OnFileTruncateFinish(), and OnFileCloseFinish().
    • FileOperationInfo now reports duration measured by std::chrono::steady_clock and start_ts measured by std::chrono::system_clock instead of start and finish timestamps measured by system_clock. Note that system_clock is called before steady_clock in program order at operation starts.
    • DB::GetDbSessionId(std::string& session_id) is added. session_id stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".
    • πŸš€ DB::OpenForReadOnly() now returns Status::NotFound when the specified DB directory does not exist. Previously the error returned depended on the underlying Env. This change is available in all 6.11 releases as well.
    • A parameter verify_with_checksum is added to BackupEngine::VerifyBackup, which is false by default. If it is ture, BackupEngine::VerifyBackup verifies checksums and file sizes of backup files. Pass false for verify_with_checksum to maintain the previous behavior and performance of BackupEngine::VerifyBackup, by only verifying sizes of backup files.
    • πŸ”§ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future ### Behavior Changes
    • Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
    • In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
    • When file_checksum_gen_factory is set to GetFileChecksumGenCrc32cFactory(), BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will fail CreateNewBackup() on mismatch (corruption). If the file_checksum_gen_factory is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine.
    • When a DB sets stats_dump_period_sec > 0, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in [0, stats_dump_period_sec). Subsequent stats dumps are still spaced stats_dump_period_sec seconds apart.
    • When the paranoid_file_checks option is true, a hash is generated of all keys and values are generated when the SST file is written, and then the values are read back in to validate the file. A corruption is signaled if the two hashes do not match.

    πŸ› Bug fixes

    • πŸ›  Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
    • πŸ›  Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.
    • πŸ‘€ Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.
    • πŸ›  Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
    • πŸ‘‰ Make compaction report InternalKey corruption while iterating over the input.
    • πŸ›  Fix a bug which may cause MultiGet to be slow because it may read more data than requested, but this won't affect correctness. The bug was introduced in 6.10 release.
    • 🌲 Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.

    πŸ†• New Features

    • DB identity (db_id) and DB session identity (db_session_id) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity β€œSST Writer” and β€œDB Repairer”, respectively. Their DB session IDs are generated in the same way as DB::GetDbSessionId. The session ID for SstFileWriter (resp., Repairer) resets every time SstFileWriter::Open (resp., Repairer::Run) is called.
    • Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
    • BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming is added, where BackupTableNameOption is an enum type with two enumerators kChecksumAndFileSize and kOptionalChecksumAndDbSessionId. By default, BackupableDBOptions::share_files_with_checksum_naming is set to kOptionalChecksumAndDbSessionId. In the default case, backup table filenames generated by this version of RocksDB are of the form either <file_number>_<crc32c>_<db_session_id>.sst or <file_number>_<db_session_id>.sst as opposed to <file_number>_<crc32c>_<file_size>.sst. Specifically, table filenames are of the form <file_number>_<crc32c>_<db_session_id>.sst if DBOptions::file_checksum_gen_factory is set to GetFileChecksumGenCrc32cFactory(). Futhermore, the checksum value <crc32c> appeared in the filenames is hexadecimal-encoded, instead of being decimal-encoded uint32_t value. If DBOptions::file_checksum_gen_factory is nullptr, the table filenames are of the form <file_number>_<db_session_id>.sst. The new default behavior fixes the backup file name collision problem, which might be possible at large scale, but the option kChecksumAndFileSize is added to allow use of old naming in case it is needed. Moreover, for table files generated prior to this version of RocksDB, using kOptionalChecksumAndDbSessionId will fall back on kChecksumAndFileSize. In these cases, the checksum value <crc32c> in the filenames <file_number>_<crc32c>_<file_size>.sst is decimal-encoded uint32_t value as before. This default behavior change is not an upgrade issue, because previous versions of RocksDB can read, restore, and delete backups using new names, and it's OK for a backup directory to use a mixture of table file naming schemes. Note that share_files_with_checksum_naming comes into effect only when both share_files_with_checksum and share_table_files are true.
    • Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.
    • Option max_subcompactions can be set dynamically using DB::SetDBOptions().
    • Added experimental ColumnFamilyOptions::sst_partitioner_factory to define determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).
    • ⚑️ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.

    🐎 Performance Improvements

    • Eliminate key copies for internal comparisons while accessing ingested block-based tables.
    • ⬇️ Reduce key comparisons during random access in all block-based tables.
    • BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the shared_checksum directory when using kOptionalChecksumAndDbSessionId, except on SST files generated before this version of RocksDB, which fall back on using kChecksumAndFileSize.

    General Improvements

    • πŸ‘€ The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.
  • v6.11.6 Changes

    October 13, 2020

    6.11.6 (10/12/2020)

    πŸ› Bug Fixes

    • πŸ”– Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
    • πŸ“Œ Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.

    6.11.5 (7/23/2020)

    πŸ› Bug Fixes

    • Memtable lookup should report unrecognized value_type as corruption (#7121).
  • v6.11.4 Changes

    July 20, 2020

    6.11.4 (2020-07-15)

    πŸ› Bug Fixes

    • πŸ‘‰ Make compaction report InternalKey corruption while iterating over the input.

    6.11.3 (2020-07-09)

    πŸ› Bug Fixes

    • πŸ›  Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
    • πŸ‘€ Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.

    6.11.1 (2020-06-23)

    πŸ› Bug Fixes

    • Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
    • In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
    • πŸ›  Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
    • 🌲 Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.
    • πŸ›  Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.

    Public API Change

    • DB::OpenForReadOnly() now returns Status::NotFound when the specified DB directory does not exist. Previously the error returned depended on the underlying Env.

    6.11 (2020-06-12)

    πŸ› Bug Fixes

    • Fix consistency checking error swallowing in some cases when options.force_consistency_checks = true.
    • πŸ›  Fix possible false NotFound status from batched MultiGet using index type kHashSearch.
    • πŸ›  Fix corruption caused by enabling delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode, along with parallel compactions. The bug can result in two parallel compactions picking the same input files, resulting in the DB resurrecting older and deleted versions of some keys.
    • Fix a use-after-free bug in best-efforts recovery. column_family_memtables_ needs to point to valid ColumnFamilySet.
    • Let best-efforts recovery ignore corrupted files during table loading.
    • πŸ›  Fix corrupt key read from ingested file when iterator direction switches from reverse to forward at a key that is a prefix of another key in the same file. It is only possible in files with a non-zero global seqno.
    • πŸ›  Fix abnormally large estimate from GetApproximateSizes when a range starts near the end of one SST file and near the beginning of another. Now GetApproximateSizes consistently and fairly includes the size of SST metadata in addition to data blocks, attributing metadata proportionally among the data blocks based on their size.
    • πŸ›  Fix potential file descriptor leakage in PosixEnv's IsDirectory() and NewRandomAccessFile().
    • πŸ›  Fix false negative from the VerifyChecksum() API when there is a checksum mismatch in an index partition block in a BlockBasedTable format table file (index_type is kTwoLevelIndexSearch).
    • πŸ›  Fix sst_dump to return non-zero exit code if the specified file is not a recognized SST file or fails requested checks.
    • πŸ›  Fix incorrect results from batched MultiGet for duplicate keys, when the duplicate key matches the largest key of an SST file and the value type for the key in the file is a merge value.

    Public API Change

    • Flush(..., column_family) may return Status::ColumnFamilyDropped() instead of Status::InvalidArgument() if column_family is dropped while processing the flush request.
    • 0️⃣ BlobDB now explicitly disallows using the default column family's storage directories as blob directory.
    • βœ‚ DeleteRange now returns Status::InvalidArgument if the range's end key comes before its start key according to the user comparator. Previously the behavior was undefined.
    • ldb now uses options.force_consistency_checks = true by default and "--disable_consistency_checks" is added to disable it.
    • DB::OpenForReadOnly no longer creates files or directories if the named DB does not exist, unless create_if_missing is set to true.
    • The consistency checks that validate LSM state changes (table file additions/deletions during flushes and compactions) are now stricter, more efficient, and no longer optional, i.e. they are performed even if force_consistency_checks is false.
    • Disable delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode and num_levels = 1 in order to avoid a corruption bug.
    • pin_l0_filter_and_index_blocks_in_cache no longer applies to L0 files larger than 1.5 * write_buffer_size to give more predictable memory usage. Such L0 files may exist due to intra-L0 compaction, external file ingestion, or user dynamically changing write_buffer_size (note, however, that files that are already pinned will continue being pinned, even after such a dynamic change).
    • In point-in-time wal recovery mode, fail database recovery in case of IOError while reading the WAL to avoid data loss.

    πŸ†• New Features

    • sst_dump to add a new --readahead_size argument. Users can specify read size when scanning the data. Sst_dump also tries to prefetch tail part of the SST files so usually some number of I/Os are saved there too.
    • Generate file checksum in SstFileWriter if Options.file_checksum_gen_factory is set. The checksum and checksum function name are stored in ExternalSstFileInfo after the sst file write is finished.
    • Add a value_size_soft_limit in read options which limits the cumulative value size of keys read in batches in MultiGet. Once the cumulative value size of found keys exceeds read_options.value_size_soft_limit, all the remaining keys are returned with status Abort without further finding their values. By default the value_size_soft_limit is std::numeric_limits<uint64_t>::max().
    • Enable SST file ingestion with file checksum information when calling IngestExternalFiles(const std::vector& args). Added files_checksums and files_checksum_func_names to IngestExternalFileArg such that user can ingest the sst files with their file checksum information. Added verify_file_checksum to IngestExternalFileOptions (default is True). To be backward compatible, if DB does not enable file checksum or user does not provide checksum information (vectors of files_checksums and files_checksum_func_names are both empty), verification of file checksum is always sucessful. If DB enables file checksum, DB will always generate the checksum for each ingested SST file during Prepare stage of ingestion and store the checksum in Manifest, unless verify_file_checksum is False and checksum information is provided by the application. In this case, we only verify the checksum function name and directly store the ingested checksum in Manifest. If verify_file_checksum is set to True, DB will verify the ingested checksum and function name with the genrated ones. Any mismatch will fail the ingestion. Note that, if IngestExternalFileOptions::write_global_seqno is True, the seqno will be changed in the ingested file. Therefore, the checksum of the file will be changed. In this case, a new checksum will be generated after the seqno is updated and be stored in the Manifest.

    🐎 Performance Improvements

    • Eliminate redundant key comparisons during random access in block-based tables.
  • v6.11 Changes

    June 12, 2020

    πŸ› Bug Fixes

    • Fix consistency checking error swallowing in some cases when options.force_consistency_checks = true.
    • πŸ›  Fix possible false NotFound status from batched MultiGet using index type kHashSearch.
    • πŸ›  Fix corruption caused by enabling delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode, along with parallel compactions. The bug can result in two parallel compactions picking the same input files, resulting in the DB resurrecting older and deleted versions of some keys.
    • Fix a use-after-free bug in best-efforts recovery. column_family_memtables_ needs to point to valid ColumnFamilySet.
    • Let best-efforts recovery ignore corrupted files during table loading.
    • πŸ›  Fix corrupt key read from ingested file when iterator direction switches from reverse to forward at a key that is a prefix of another key in the same file. It is only possible in files with a non-zero global seqno.
    • πŸ›  Fix abnormally large estimate from GetApproximateSizes when a range starts near the end of one SST file and near the beginning of another. Now GetApproximateSizes consistently and fairly includes the size of SST metadata in addition to data blocks, attributing metadata proportionally among the data blocks based on their size.
    • πŸ›  Fix potential file descriptor leakage in PosixEnv's IsDirectory() and NewRandomAccessFile().
    • πŸ›  Fix false negative from the VerifyChecksum() API when there is a checksum mismatch in an index partition block in a BlockBasedTable format table file (index_type is kTwoLevelIndexSearch).
    • πŸ›  Fix sst_dump to return non-zero exit code if the specified file is not a recognized SST file or fails requested checks.
    • πŸ›  Fix incorrect results from batched MultiGet for duplicate keys, when the duplicate key matches the largest key of an SST file and the value type for the key in the file is a merge value.
    • πŸ›  Fix "bad block type" error from persistent cache on Windows.

    Public API Change

    • Flush(..., column_family) may return Status::ColumnFamilyDropped() instead of Status::InvalidArgument() if column_family is dropped while processing the flush request.
    • 0️⃣ BlobDB now explicitly disallows using the default column family's storage directories as blob directory.
    • βœ‚ DeleteRange now returns Status::InvalidArgument if the range's end key comes before its start key according to the user comparator. Previously the behavior was undefined.
    • ldb now uses options.force_consistency_checks = true by default and "--disable_consistency_checks" is added to disable it.
    • DB::OpenForReadOnly no longer creates files or directories if the named DB does not exist, unless create_if_missing is set to true.
    • The consistency checks that validate LSM state changes (table file additions/deletions during flushes and compactions) are now stricter, more efficient, and no longer optional, i.e. they are performed even if force_consistency_checks is false.
    • Disable delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode and num_levels = 1 in order to avoid a corruption bug.
    • pin_l0_filter_and_index_blocks_in_cache no longer applies to L0 files larger than 1.5 * write_buffer_size to give more predictable memory usage. Such L0 files may exist due to intra-L0 compaction, external file ingestion, or user dynamically changing write_buffer_size (note, however, that files that are already pinned will continue being pinned, even after such a dynamic change).
    • In point-in-time wal recovery mode, fail database recovery in case of IOError while reading the WAL to avoid data loss.
    • A new method Env::LowerThreadPoolCPUPriority(Priority, CpuPriority) is added to Env to be able to lower to a specific priority such as CpuPriority::kIdle.
    • DB::GetDbSessionId(std::string& session_id) is added. session_id stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with DB Session ID:.

    πŸ†• New Features

    • sst_dump to add a new --readahead_size argument. Users can specify read size when scanning the data. Sst_dump also tries to prefetch tail part of the SST files so usually some number of I/Os are saved there too.
    • Generate file checksum in SstFileWriter if Options.file_checksum_gen_factory is set. The checksum and checksum function name are stored in ExternalSstFileInfo after the sst file write is finished.
    • Add a value_size_soft_limit in read options which limits the cumulative value size of keys read in batches in MultiGet. Once the cumulative value size of found keys exceeds read_options.value_size_soft_limit, all the remaining keys are returned with status Abort without further finding their values. By default the value_size_soft_limit is std::numeric_limits::max().
    • Enable SST file ingestion with file checksum information when calling IngestExternalFiles(const std::vector& args). Added files_checksums and files_checksum_func_names to IngestExternalFileArg such that user can ingest the sst files with their file checksum information. Added verify_file_checksum to IngestExternalFileOptions (default is True). To be backward compatible, if DB does not enable file checksum or user does not provide checksum information (vectors of files_checksums and files_checksum_func_names are both empty), verification of file checksum is always sucessful. If DB enables file checksum, DB will always generate the checksum for each ingested SST file during Prepare stage of ingestion and store the checksum in Manifest, unless verify_file_checksum is False and checksum information is provided by the application. In this case, we only verify the checksum function name and directly store the ingested checksum in Manifest. If verify_file_checksum is set to True, DB will verify the ingested checksum and function name with the genrated ones. Any mismatch will fail the ingestion. Note that, if IngestExternalFileOptions::write_global_seqno is True, the seqno will be changed in the ingested file. Therefore, the checksum of the file will be changed. In this case, a new checksum will be generated after the seqno is updated and be stored in the Manifest.

    🐎 Performance Improvements

    • Eliminate redundant key comparisons during random access in block-based tables.
  • v6.10.2 Changes

    June 11, 2020

    6.10.2 (6/5/2020)

    πŸ› Bug fix

    • πŸ›  Fix false negative from the VerifyChecksum() API when there is a checksum mismatch in an index partition block in a BlockBasedTable format table file (index_type is kTwoLevelIndexSearch).
  • v6.10.1 Changes

    May 28, 2020

    6.10.1 (5/27/2020)

    πŸ› Bug fix

    • βœ‚ Remove "u''" in TARGETS file.
    • Fix db_stress_lib target in buck.

    6.10 (5/2/2020)

    Behavior Changes

    • Disable delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode and num_levels = 1 in order to avoid a corruption bug.

    πŸ› Bug Fixes

    • πŸ›  Fix wrong result being read from ingested file. May happen when a key in the file happen to be prefix of another key also in the file. The issue can further cause more data corruption. The issue exists with rocksdb >= 5.0.0 since DB::IngestExternalFile() was introduced.
    • πŸ‘€ Finish implementation of BlockBasedTableOptions::IndexType::kBinarySearchWithFirstKey. It's now ready for use. Significantly reduces read amplification in some setups, especially for iterator seeks.
    • πŸ›  Fix a bug by updating CURRENT file so that it points to the correct MANIFEST file after best-efforts recovery.
    • πŸ›  Fixed a bug where ColumnFamilyHandle objects were not cleaned up in case an error happened during BlobDB's open after the base DB had been opened.
    • πŸ›  Fix a potential undefined behavior caused by trying to dereference nullable pointer (timestamp argument) in DB::MultiGet.
    • πŸ›  Fix a bug caused by not including user timestamp in MultiGet LookupKey construction. This can lead to wrong query result since the trailing bytes of a user key, if not shorter than timestamp, will be mistaken for user timestamp.
    • πŸ›  Fix a bug caused by using wrong compare function when sorting the input keys of MultiGet with timestamps.
      ⬆️ Upgraded version of bzip library (1.0.6 -> 1.0.8) used with RocksJava to address potential vulnerabilities if an attacker can manipulate compressed data saved and loaded by RocksDB (not normal). See issue #6703.
    • Fix consistency checking error swallowing in some cases when options.force_consistency_checks = true.
    • πŸ›  Fix possible false NotFound status from batched MultiGet using index type kHashSearch.
    • πŸ›  Fix corruption caused by enabling delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode, along with parallel compactions. The bug can result in two parallel compactions picking the same input files, resulting in the DB resurrecting older and deleted versions of some keys.
    • Fix a use-after-free bug in best-efforts recovery. column_family_memtables_ needs to point to valid ColumnFamilySet.
    • Let best-efforts recovery ignore corrupted files during table loading.
    • Fix a bug when making options.bottommost_compression, options.compression_opts and options.bottommost_compression_opts dynamically changeable: the modified values are not written to option files or returned back to users when being queried.
    • Fix a bug where index key comparisons were unaccounted in PerfContext::user_key_comparison_count for lookups in files written with format_version >= 3.
    • πŸ›  Fix many bloom.filter statistics not being updated in batch MultiGet.

    Public API Change

    • Add a ConfigOptions argument to the APIs dealing with converting options to and from strings and files. The ConfigOptions is meant to replace some of the options (such as input_strings_escaped and ignore_unknown_options) and allow for more parameters to be passed in the future without changing the function signature.
    • βž• Add NewFileChecksumGenCrc32cFactory to the file checksum public API, such that the builtin Crc32c based file checksum generator factory can be used by applications.
    • βž• Add IsDirectory to Env and FS to indicate if a path is a directory.
      ldb now uses options.force_consistency_checks = true by default and "--disable_consistency_checks" is added to disable it.
    • βž• Add ReadOptions::deadline to allow users to specify a deadline for MultiGet requests

    πŸ†• New Features

    • βž• Added support for pipelined & parallel compression optimization for BlockBasedTableBuilder. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set CompressionOptions::parallel_threads greater than 1 to enable compression parallelism. This feature is experimental for now.
    • Provide an allocator for memkind to be used with block cache. This is to work with memory technologies (Intel DCPMM is one such technology currently available) that require different libraries for allocation and management (such as PMDK and memkind). The high capacities available make it possible to provision large caches (up to several TBs in size) beyond what is achievable with DRAM.
    • Option max_background_flushes can be set dynamically using DB::SetDBOptions().
    • πŸ–¨ Added functionality in sst_dump tool to check the compressed file size for different compression levels and print the time spent on compressing files with each compression type. Added arguments --compression_level_from and --compression_level_to to report size of all compression levels and one compression_type must be specified with it so that it will report compressed sizes of one compression type with different levels.
    • βž• Added statistics for redundant insertions into block cache: rocksdb.block.cache.*add.redundant. (There is currently no coordination to ensure that only one thread loads a table block when many threads are trying to access that same table block.)

    🐎 Performance Improvements

    • πŸ‘Œ Improve performance of batch MultiGet with partitioned filters, by sharing block cache lookups to applicable filter blocks.
    • ⬇️ Reduced memory copies when fetching and uncompressing compressed blocks from sst files.
  • v6.10 Changes

    May 02, 2020

    πŸ› Bug Fixes

    • πŸ›  Fix wrong result being read from ingested file. May happen when a key in the file happen to be prefix of another key also in the file. The issue can further cause more data corruption. The issue exists with rocksdb >= 5.0.0 since DB::IngestExternalFile() was introduced.
    • πŸ‘€ Finish implementation of BlockBasedTableOptions::IndexType::kBinarySearchWithFirstKey. It's now ready for use. Significantly reduces read amplification in some setups, especially for iterator seeks.
    • πŸ›  Fix a bug by updating CURRENT file so that it points to the correct MANIFEST file after best-efforts recovery.
    • πŸ›  Fixed a bug where ColumnFamilyHandle objects were not cleaned up in case an error happened during BlobDB's open after the base DB had been opened.
    • πŸ›  Fix a potential undefined behavior caused by trying to dereference nullable pointer (timestamp argument) in DB::MultiGet.
    • πŸ›  Fix a bug caused by not including user timestamp in MultiGet LookupKey construction. This can lead to wrong query result since the trailing bytes of a user key, if not shorter than timestamp, will be mistaken for user timestamp.
    • πŸ›  Fix a bug caused by using wrong compare function when sorting the input keys of MultiGet with timestamps.
    • ⬆️ Upgraded version of bzip library (1.0.6 -> 1.0.8) used with RocksJava to address potential vulnerabilities if an attacker can manipulate compressed data saved and loaded by RocksDB (not normal). See issue #6703.

    Public API Change

    • Add a ConfigOptions argument to the APIs dealing with converting options to and from strings and files. The ConfigOptions is meant to replace some of the options (such as input_strings_escaped and ignore_unknown_options) and allow for more parameters to be passed in the future without changing the function signature.
    • βž• Add NewFileChecksumGenCrc32cFactory to the file checksum public API, such that the builtin Crc32c based file checksum generator factory can be used by applications.
    • βž• Add IsDirectory to Env and FS to indicate if a path is a directory.

    πŸ†• New Features

    • βž• Added support for pipelined & parallel compression optimization for BlockBasedTableBuilder. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set CompressionOptions::parallel_threads greater than 1 to enable compression parallelism. This feature is experimental for now.
    • Provide an allocator for memkind to be used with block cache. This is to work with memory technologies (Intel DCPMM is one such technology currently available) that require different libraries for allocation and management (such as PMDK and memkind). The high capacities available make it possible to provision large caches (up to several TBs in size) beyond what is achievable with DRAM.
    • Option max_background_flushes can be set dynamically using DB::SetDBOptions().
    • πŸ–¨ Added functionality in sst_dump tool to check the compressed file size for different compression levels and print the time spent on compressing files with each compression type. Added arguments --compression_level_from and --compression_level_to to report size of all compression levels and one compression_type must be specified with it so that it will report compressed sizes of one compression type with different levels.
    • βž• Added statistics for redundant insertions into block cache: rocksdb.block.cache.*add.redundant. (There is currently no coordination to ensure that only one thread loads a table block when many threads are trying to access that same table block.)

    πŸ› Bug Fixes

    • Fix a bug when making options.bottommost_compression, options.compression_opts and options.bottommost_compression_opts dynamically changeable: the modified values are not written to option files or returned back to users when being queried.
    • Fix a bug where index key comparisons were unaccounted in PerfContext::user_key_comparison_count for lookups in files written with format_version >= 3.
    • πŸ›  Fix many bloom.filter statistics not being updated in batch MultiGet.

    🐎 Performance Improvements

    • πŸ‘Œ Improve performance of batch MultiGet with partitioned filters, by sharing block cache lookups to applicable filter blocks.
    • ⬇️ Reduced memory copies when fetching and uncompressing compressed blocks from sst files.
  • v6.9.0 Changes

    March 29, 2020

    Behavior changes

    • Since RocksDB 6.8, ttl-based FIFO compaction can drop a file whose oldest key becomes older than options.ttl while others have not. This fix reverts this and makes ttl-based FIFO compaction use the file's flush time as the criterion. This fix also requires that max_open_files = -1 and compaction_options_fifo.allow_compaction = false to function properly.

    Public API Change

    • πŸ›  Fix spelling so that API now has correctly spelled transaction state name COMMITTED, while the old misspelled COMMITED is still available as an alias.
    • ⚑️ Updated default format_version in BlockBasedTableOptions from 2 to 4. SST files generated with the new default can be read by RocksDB versions 5.16 and newer, and use more efficient encoding of keys in index blocks.
    • A new parameter CreateBackupOptions is added to both BackupEngine::CreateNewBackup and BackupEngine::CreateNewBackupWithMetadata, you can decrease CPU priority of BackupEngine's background threads by setting decrease_background_thread_cpu_priority and background_thread_cpu_priority in CreateBackupOptions.
    • ⚑️ Updated the public API of SST file checksum. Introduce the FileChecksumGenFactory to create the FileChecksumGenerator for each SST file, such that the FileChecksumGenerator is not shared and it can be more general for checksum implementations. Changed the FileChecksumGenerator interface from Value, Extend, and GetChecksum to Update, Finalize, and GetChecksum. Finalize should be only called once after all data is processed to generate the final checksum. Temproal data should be maintained by the FileChecksumGenerator object itself and finally it can return the checksum string.

    πŸ› Bug Fixes

    • πŸ›  Fix a bug where range tombstone blocks in ingested files were cached incorrectly during ingestion. If range tombstones were read from those incorrectly cached blocks, the keys they covered would be exposed.
    • πŸ›  Fix a data race that might cause crash when calling DB::GetCreationTimeOfOldestFile() by a small chance. The bug was introduced in 6.6 Release.
    • Fix a bug where a boolean value optimize_filters_for_hits was for max threads when calling load table handles after a flush or compaction. The value is correct to 1. The bug should not cause user visible problems.
    • πŸ›  Fix a bug which might crash the service when write buffer manager fails to insert the dummy handle to the block cache.

    🐎 Performance Improvements

    • In CompactRange, for levels starting from 0, if the level does not have any file with any key falling in the specified range, the level is skipped. So instead of always compacting from level 0, the compaction starts from the first level with keys in the specified range until the last such level.
    • ⬇️ Reduced memory copy when reading sst footer and blobdb in direct IO mode.
    • ⚑️ When restarting a database with large numbers of sst files, large amount of CPU time is spent on getting logical block size of the sst files, which slows down the starting progress, this inefficiency is optimized away with an internal cache for the logical block sizes.

    πŸ†• New Features

    • πŸ‘€ Basic support for user timestamp in iterator. Seek/SeekToFirst/Next and lower/upper bounds are supported. Reverse iteration is not supported. Merge is not considered.
    • πŸ”’ When file lock failure when the lock is held by the current process, return acquiring time and thread ID in the error message.
    • Added a new option, best_efforts_recovery (default: false), to allow database to open in a db dir with missing table files. During best efforts recovery, missing table files are ignored, and database recovers to the most recent state without missing table file. Cross-column-family consistency is not guaranteed even if WAL is enabled.
    • options.bottommost_compression, options.compression_opts and options.bottommost_compression_opts are now dynamically changeable.
  • v6.8.1 Changes

    April 25, 2020

    6.8.1 (03/30/2020)

    Behavior changes

    • Since RocksDB 6.8.0, ttl-based FIFO compaction can drop a file whose oldest key becomes older than options.ttl while others have not. This fix reverts this and makes ttl-based FIFO compaction use the file's flush time as the criterion. This fix also requires that max_open_files = -1 and compaction_options_fifo.allow_compaction = false to function properly.

    6.8.0 (02/24/2020)

    Java API Changes

    • Major breaking changes to Java comparators, toward standardizing on ByteBuffer for performant, locale-neutral operations on keys (#6252).
    • βž• Added overloads of common API methods using direct ByteBuffers for keys and values (#2283).

    πŸ› Bug Fixes

    • πŸ›  Fix incorrect results while block-based table uses kHashSearch, together with Prev()/SeekForPrev().
    • πŸ›  Fix a bug that prevents opening a DB after two consecutive crash with TransactionDB, where the first crash recovers from a corrupted WAL with kPointInTimeRecovery but the second cannot.
    • πŸ›  Fixed issue #6316 that can cause a corruption of the MANIFEST file in the middle when writing to it fails due to no disk space.
    • Add DBOptions::skip_checking_sst_file_sizes_on_db_open. It disables potentially expensive checking of all sst file sizes in DB::Open().
    • ⚑️ BlobDB now ignores trivially moved files when updating the mapping between blob files and SSTs. This should mitigate issue #6338 where out of order flush/compaction notifications could trigger an assertion with the earlier code.
    • Batched MultiGet() ignores IO errors while reading data blocks, causing it to potentially continue looking for a key and returning stale results.
    • WriteBatchWithIndex::DeleteRange returns Status::NotSupported. Previously it returned success even though reads on the batch did not account for range tombstones. The corresponding language bindings now cannot be used. In C, that includes rocksdb_writebatch_wi_delete_range, rocksdb_writebatch_wi_delete_range_cf, rocksdb_writebatch_wi_delete_rangev, and rocksdb_writebatch_wi_delete_rangev_cf. In Java, that includes WriteBatchWithIndex::deleteRange.
    • Assign new MANIFEST file number when caller tries to create a new MANIFEST by calling LogAndApply(..., new_descriptor_log=true). This bug can cause MANIFEST being overwritten during recovery if options.write_dbid_to_manifest = true and there are WAL file(s).

    🐎 Performance Improvements

    • Perfom readahead when reading from option files. Inside DB, options.log_readahead_size will be used as the readahead size. In other cases, a default 512KB is used.

    Public API Change

    • The BlobDB garbage collector now emits the statistics BLOB_DB_GC_NUM_FILES (number of blob files obsoleted during GC), BLOB_DB_GC_NUM_NEW_FILES (number of new blob files generated during GC), BLOB_DB_GC_FAILURES (number of failed GC passes), BLOB_DB_GC_NUM_KEYS_RELOCATED (number of blobs relocated during GC), and BLOB_DB_GC_BYTES_RELOCATED (total size of blobs relocated during GC). On the other hand, the following statistics, which are not relevant for the new GC implementation, are now deprecated: BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, BLOB_DB_GC_NUM_KEYS_EXPIRED, BLOB_DB_GC_BYTES_OVERWRITTEN, BLOB_DB_GC_BYTES_EXPIRED, and BLOB_DB_GC_MICROS.
    • 🌲 Disable recycle_log_file_num when an inconsistent recovery modes are requested: kPointInTimeRecovery and kAbsoluteConsistency

    πŸ†• New Features

    • Added the checksum for each SST file generated by Flush or Compaction. Added sst_file_checksum_func to Options such that user can plugin their own SST file checksum function via override the FileChecksumFunc class. If user does not set the sst_file_checksum_func, SST file checksum calculation will not be enabled. The checksum information inlcuding uint32_t checksum value and a checksum function name (string). The checksum information is stored in FileMetadata in version store and also logged to MANIFEST. A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST (stored in an unordered_map).
    • πŸ‘ db_bench now supports value_size_distribution_type, value_size_min, value_size_max options for generating random variable sized value. Added blob_db_compression_type option for BlobDB to enable blob compression.
    • Replace RocksDB namespace "rocksdb" with flag "ROCKSDB_NAMESPACE" which if is not defined, defined as "rocksdb" in header file rocksdb_namespace.h.
  • v6.8.0 Changes

    February 24, 2020

    Java API Changes

    • Major breaking changes to Java comparators, toward standardizing on ByteBuffer for performant, locale-neutral operations on keys (#6252).
    • βž• Added overloads of common API methods using direct ByteBuffers for keys and values (#2283).

    πŸ› Bug Fixes

    • πŸ›  Fix incorrect results while block-based table uses kHashSearch, together with Prev()/SeekForPrev().
    • πŸ›  Fix a bug that prevents opening a DB after two consecutive crash with TransactionDB, where the first crash recovers from a corrupted WAL with kPointInTimeRecovery but the second cannot.
    • πŸ›  Fixed issue #6316 that can cause a corruption of the MANIFEST file in the middle when writing to it fails due to no disk space.
    • Add DBOptions::skip_checking_sst_file_sizes_on_db_open. It disables potentially expensive checking of all sst file sizes in DB::Open().
    • ⚑️ BlobDB now ignores trivially moved files when updating the mapping between blob files and SSTs. This should mitigate issue #6338 where out of order flush/compaction notifications could trigger an assertion with the earlier code.
    • Batched MultiGet() ignores IO errors while reading data blocks, causing it to potentially continue looking for a key and returning stale results.
    • WriteBatchWithIndex::DeleteRange returns Status::NotSupported. Previously it returned success even though reads on the batch did not account for range tombstones. The corresponding language bindings now cannot be used. In C, that includes rocksdb_writebatch_wi_delete_range, rocksdb_writebatch_wi_delete_range_cf, rocksdb_writebatch_wi_delete_rangev, and rocksdb_writebatch_wi_delete_rangev_cf. In Java, that includes WriteBatchWithIndex::deleteRange.
    • Assign new MANIFEST file number when caller tries to create a new MANIFEST by calling LogAndApply(..., new_descriptor_log=true). This bug can cause MANIFEST being overwritten during recovery if options.write_dbid_to_manifest = true and there are WAL file(s).

    🐎 Performance Improvements

    • Perfom readahead when reading from option files. Inside DB, options.log_readahead_size will be used as the readahead size. In other cases, a default 512KB is used.

    Public API Change

    • The BlobDB garbage collector now emits the statistics BLOB_DB_GC_NUM_FILES (number of blob files obsoleted during GC), BLOB_DB_GC_NUM_NEW_FILES (number of new blob files generated during GC), BLOB_DB_GC_FAILURES (number of failed GC passes), BLOB_DB_GC_NUM_KEYS_RELOCATED (number of blobs relocated during GC), and BLOB_DB_GC_BYTES_RELOCATED (total size of blobs relocated during GC). On the other hand, the following statistics, which are not relevant for the new GC implementation, are now deprecated: BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, BLOB_DB_GC_NUM_KEYS_EXPIRED, BLOB_DB_GC_BYTES_OVERWRITTEN, BLOB_DB_GC_BYTES_EXPIRED, and BLOB_DB_GC_MICROS.
    • 🌲 Disable recycle_log_file_num when an inconsistent recovery modes are requested: kPointInTimeRecovery and kAbsoluteConsistency

    πŸ†• New Features

    • Added the checksum for each SST file generated by Flush or Compaction. Added sst_file_checksum_func to Options such that user can plugin their own SST file checksum function via override the FileChecksumFunc class. If user does not set the sst_file_checksum_func, SST file checksum calculation will not be enabled. The checksum information inlcuding uint32_t checksum value and a checksum function name (string). The checksum information is stored in FileMetadata in version store and also logged to MANIFEST. A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST (stored in an unordered_map).
    • πŸ‘ db_bench now supports value_size_distribution_type, value_size_min, value_size_max options for generating random variable sized value. Added blob_db_compression_type option for BlobDB to enable blob compression.
    • Replace RocksDB namespace "rocksdb" with flag "ROCKSDB_NAMESPACE" which if is not defined, defined as "rocksdb" in header file rocksdb_namespace.h.