RocksDB v6.12 Release Notes

Release Date: 2020-07-28 // over 3 years ago
  • Public API Change

    • Encryption file classes now exposed for inheritance in env_encryption.h
    • File I/O listener is extended to cover more I/O operations. Now class EventListener in listener.h contains new callback functions: OnFileFlushFinish(), OnFileSyncFinish(), OnFileRangeSyncFinish(), OnFileTruncateFinish(), and OnFileCloseFinish().
    • FileOperationInfo now reports duration measured by std::chrono::steady_clock and start_ts measured by std::chrono::system_clock instead of start and finish timestamps measured by system_clock. Note that system_clock is called before steady_clock in program order at operation starts.
    • DB::GetDbSessionId(std::string& session_id) is added. session_id stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".
    • πŸš€ DB::OpenForReadOnly() now returns Status::NotFound when the specified DB directory does not exist. Previously the error returned depended on the underlying Env. This change is available in all 6.11 releases as well.
    • A parameter verify_with_checksum is added to BackupEngine::VerifyBackup, which is false by default. If it is ture, BackupEngine::VerifyBackup verifies checksums and file sizes of backup files. Pass false for verify_with_checksum to maintain the previous behavior and performance of BackupEngine::VerifyBackup, by only verifying sizes of backup files.
    • πŸ”§ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future ### Behavior Changes
    • Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
    • In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
    • When file_checksum_gen_factory is set to GetFileChecksumGenCrc32cFactory(), BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will fail CreateNewBackup() on mismatch (corruption). If the file_checksum_gen_factory is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine.
    • When a DB sets stats_dump_period_sec > 0, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in [0, stats_dump_period_sec). Subsequent stats dumps are still spaced stats_dump_period_sec seconds apart.
    • When the paranoid_file_checks option is true, a hash is generated of all keys and values are generated when the SST file is written, and then the values are read back in to validate the file. A corruption is signaled if the two hashes do not match.

    πŸ› Bug fixes

    • πŸ›  Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
    • πŸ›  Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.
    • πŸ‘€ Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.
    • πŸ›  Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
    • πŸ‘‰ Make compaction report InternalKey corruption while iterating over the input.
    • πŸ›  Fix a bug which may cause MultiGet to be slow because it may read more data than requested, but this won't affect correctness. The bug was introduced in 6.10 release.
    • 🌲 Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.

    πŸ†• New Features

    • DB identity (db_id) and DB session identity (db_session_id) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity β€œSST Writer” and β€œDB Repairer”, respectively. Their DB session IDs are generated in the same way as DB::GetDbSessionId. The session ID for SstFileWriter (resp., Repairer) resets every time SstFileWriter::Open (resp., Repairer::Run) is called.
    • Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
    • BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming is added, where BackupTableNameOption is an enum type with two enumerators kChecksumAndFileSize and kOptionalChecksumAndDbSessionId. By default, BackupableDBOptions::share_files_with_checksum_naming is set to kOptionalChecksumAndDbSessionId. In the default case, backup table filenames generated by this version of RocksDB are of the form either <file_number>_<crc32c>_<db_session_id>.sst or <file_number>_<db_session_id>.sst as opposed to <file_number>_<crc32c>_<file_size>.sst. Specifically, table filenames are of the form <file_number>_<crc32c>_<db_session_id>.sst if DBOptions::file_checksum_gen_factory is set to GetFileChecksumGenCrc32cFactory(). Futhermore, the checksum value <crc32c> appeared in the filenames is hexadecimal-encoded, instead of being decimal-encoded uint32_t value. If DBOptions::file_checksum_gen_factory is nullptr, the table filenames are of the form <file_number>_<db_session_id>.sst. The new default behavior fixes the backup file name collision problem, which might be possible at large scale, but the option kChecksumAndFileSize is added to allow use of old naming in case it is needed. Moreover, for table files generated prior to this version of RocksDB, using kOptionalChecksumAndDbSessionId will fall back on kChecksumAndFileSize. In these cases, the checksum value <crc32c> in the filenames <file_number>_<crc32c>_<file_size>.sst is decimal-encoded uint32_t value as before. This default behavior change is not an upgrade issue, because previous versions of RocksDB can read, restore, and delete backups using new names, and it's OK for a backup directory to use a mixture of table file naming schemes. Note that share_files_with_checksum_naming comes into effect only when both share_files_with_checksum and share_table_files are true.
    • Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.
    • Option max_subcompactions can be set dynamically using DB::SetDBOptions().
    • Added experimental ColumnFamilyOptions::sst_partitioner_factory to define determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).
    • ⚑️ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.

    🐎 Performance Improvements

    • Eliminate key copies for internal comparisons while accessing ingested block-based tables.
    • ⬇️ Reduce key comparisons during random access in all block-based tables.
    • BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the shared_checksum directory when using kOptionalChecksumAndDbSessionId, except on SST files generated before this version of RocksDB, which fall back on using kChecksumAndFileSize.

    General Improvements

    • πŸ‘€ The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.