RocksDB v6.12 Release Notes
Release Date: 2020-07-28 // over 3 years ago-
Public API Change
- Encryption file classes now exposed for inheritance in env_encryption.h
- File I/O listener is extended to cover more I/O operations. Now class
EventListener
in listener.h contains new callback functions:OnFileFlushFinish()
,OnFileSyncFinish()
,OnFileRangeSyncFinish()
,OnFileTruncateFinish()
, andOnFileCloseFinish()
. FileOperationInfo
now reportsduration
measured bystd::chrono::steady_clock
andstart_ts
measured bystd::chrono::system_clock
instead of start and finish timestamps measured bysystem_clock
. Note thatsystem_clock
is called beforesteady_clock
in program order at operation starts.DB::GetDbSessionId(std::string& session_id)
is added.session_id
stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".- π
DB::OpenForReadOnly()
now returnsStatus::NotFound
when the specified DB directory does not exist. Previously the error returned depended on the underlyingEnv
. This change is available in all 6.11 releases as well. - A parameter
verify_with_checksum
is added toBackupEngine::VerifyBackup
, which is false by default. If it is ture,BackupEngine::VerifyBackup
verifies checksums and file sizes of backup files. Passfalse
forverify_with_checksum
to maintain the previous behavior and performance ofBackupEngine::VerifyBackup
, by only verifying sizes of backup files. - π§ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future ### Behavior Changes
- Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
- In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
- When
file_checksum_gen_factory
is set toGetFileChecksumGenCrc32cFactory()
, BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will failCreateNewBackup()
on mismatch (corruption). If thefile_checksum_gen_factory
is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine. - When a DB sets
stats_dump_period_sec > 0
, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in[0, stats_dump_period_sec)
. Subsequent stats dumps are still spacedstats_dump_period_sec
seconds apart. - When the paranoid_file_checks option is true, a hash is generated of all keys and values are generated when the SST file is written, and then the values are read back in to validate the file. A corruption is signaled if the two hashes do not match.
π Bug fixes
- π Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
- π Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.
- π Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.
- π Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
- π Make compaction report InternalKey corruption while iterating over the input.
- π Fix a bug which may cause MultiGet to be slow because it may read more data than requested, but this won't affect correctness. The bug was introduced in 6.10 release.
- π² Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.
π New Features
- DB identity (
db_id
) and DB session identity (db_session_id
) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity βSST Writerβ and βDB Repairerβ, respectively. Their DB session IDs are generated in the same way asDB::GetDbSessionId
. The session ID for SstFileWriter (resp., Repairer) resets every timeSstFileWriter::Open
(resp.,Repairer::Run
) is called. - Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming
is added, whereBackupTableNameOption
is anenum
type with two enumeratorskChecksumAndFileSize
andkOptionalChecksumAndDbSessionId
. By default,BackupableDBOptions::share_files_with_checksum_naming
is set tokOptionalChecksumAndDbSessionId
. In the default case, backup table filenames generated by this version of RocksDB are of the form either<file_number>_<crc32c>_<db_session_id>.sst
or<file_number>_<db_session_id>.sst
as opposed to<file_number>_<crc32c>_<file_size>.sst
. Specifically, table filenames are of the form<file_number>_<crc32c>_<db_session_id>.sst
ifDBOptions::file_checksum_gen_factory
is set toGetFileChecksumGenCrc32cFactory()
. Futhermore, the checksum value<crc32c>
appeared in the filenames is hexadecimal-encoded, instead of being decimal-encodeduint32_t
value. IfDBOptions::file_checksum_gen_factory
isnullptr
, the table filenames are of the form<file_number>_<db_session_id>.sst
. The new default behavior fixes the backup file name collision problem, which might be possible at large scale, but the optionkChecksumAndFileSize
is added to allow use of old naming in case it is needed. Moreover, for table files generated prior to this version of RocksDB, usingkOptionalChecksumAndDbSessionId
will fall back onkChecksumAndFileSize
. In these cases, the checksum value<crc32c>
in the filenames<file_number>_<crc32c>_<file_size>.sst
is decimal-encodeduint32_t
value as before. This default behavior change is not an upgrade issue, because previous versions of RocksDB can read, restore, and delete backups using new names, and it's OK for a backup directory to use a mixture of table file naming schemes. Note thatshare_files_with_checksum_naming
comes into effect only when bothshare_files_with_checksum
andshare_table_files
are true.- Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.
- Option
max_subcompactions
can be set dynamically using DB::SetDBOptions(). - Added experimental ColumnFamilyOptions::sst_partitioner_factory to define determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).
- β‘οΈ Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherity). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.
π Performance Improvements
- Eliminate key copies for internal comparisons while accessing ingested block-based tables.
- β¬οΈ Reduce key comparisons during random access in all block-based tables.
- BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the
shared_checksum
directory when usingkOptionalChecksumAndDbSessionId
, except on SST files generated before this version of RocksDB, which fall back on usingkChecksumAndFileSize
.
General Improvements
- π The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.