Changelog History
Page 1
-
v0.24.3 Changes
November 18, 2020🚀 Release 0.24.3
🆕 New functionality
- 👌 Support fstr text features and embeddings. Issue #1293
🛠 Bugfixes:
- 🛠 Fix model apply speed regression introduced in 0.24.1
- 🛠 Different fixes in embeddings support: fixed apply and model serialization, fixed apply on texts and embeddings
- 🛠 Fixed virtual ensembles prediction - use proper scaling, fix apply (issue #1462)
- 🛠 Fix
score()
method forRMSEWithUncertainty
issue #1482 - Automatically use correct
prediction_type
inscore()
-
v0.24.2 Changes
October 07, 2020Uncertainty prediction
- 👌 Supported uncertainty prediction for classification models.
- 🛠 Fixed RMSEWithUncertainty data uncertainty prediction - now it predicts variance, not standard deviation.
🆕 New functionality
- 👍 Allow categorical feature counters for
MultiRMSE
loss function. group_weight
parameter added tocatboost.utils.eval_metric
method to allow passing weights for object groups. Allows correctly match weighted ranking metrics computation when group weights present.- 🚚 Faster non-owning deserialization from memory with less memory overhead - moved some dynamically computed data to model file, other data is computed in lazy manner only when needed.
Experimental functionality
- 👌 Supported embedding features as input and linear discriminant analysis for embeddings preprocessing. Try adding your embeddings as new columns with embedding values array in Pandas.Dataframe and passing corresponding column names to
Pool
constructor orfit
function withembedding_features=['EmbeddingFeaturesColumnName1, ...]
parameter. Another way of adding your embedding vectors is new type of column in Column Description fileNumVector
and adding semicolon separated embeddings column to your XSV file:ClassLabel\t0.1;0.2;0.3\t...
.
Educational materials
- Published new tutorial on uncertainty prediction.
🛠 Bugfixes:
- ⬇️ Reduced GPU memory usage in multi gpu training when there is no need to compute categorical feature counters.
- Now CatBoost allows to specify
use_weights
for metrics whenauto_class_weights
parameter is set. - Correctly handle NaN values in
plot_predictions
function. - 🛠 Fixed floating point precision drop releated bugs during Multiclass training with lots of objects in our case, bug was triggered while training on 25mln objects on single GPU card.
- Now
average
parameter is passed to TotalF1 metric while training on GPU. - ➕ Added class labels checks
- Disallow feature remapping in model predict when there is empty feature names in model.
-
v0.24.1 Changes
August 27, 2020Uncertainty prediction
🚀 Main feature of this release is total uncertainty prediction support via virtual ensembles.
🖨 You can read the theoretical background in the preprint Uncertainty in Gradient Boosting via Ensembles from our research team.
We introduced new training parameterposterior_sampling
, that allows to estimate total uncertainty.
Settingposterior_sampling=True
implies enabling Langevin boosting, settingmodel_shrink_rate
to1/(2*N)
and settingdiffusion_temperature
toN
, whereN
is dataset size.
CatBoost object methodvirtual_ensembles_predict
splits model intovirtual_ensembles_count
submodels.
Callingmodel.virtual_ensembles_predict(.., prediction_type='TotalUncertainty')
returns mean prediction, variance (and knowledge uncertrainty for models, trained withRMSEWithUncertainty
loss function).
Callingmodel.virtual_ensembles_predict(.., prediction_type='VirtEnsembles')
returnsvirtual_ensembles_count
predictions of virtual submodels for each object.🆕 New functionality
- 👌 Supported non-owning model deserialization for models with categorical feature counters
Speedups
- 📜 We've done lot's of speedups for sparse data loading. For example, on bosch sparse dataset preprocessing speed got 4.5x speedup while running in 28 thread setting.
🛠 Bugfixes:
-
v0.24 Changes
August 05, 2020🆕 New functionality
- 0️⃣ We've finally implemented MVS sampling for GPU training. Switched default bootstrap algorithm to MVS for RMSE loss function while training on GPU
- Implemented near-zero cost model deserialization from memory blob. Currently, if your model doesn't use categorical features CTR counters and text features you can deserialize model from, for example, memory-mapped file.
- Added ability to load trained models from binary string or file-like stream. To load model from bytes string use
load_model(blob=b'....')
, to deserialize form file-like stream useload_model(stream=gzip.open('model.cbm.gz', 'rb'))
- 🛠 Fixed auto-learning rate estimation params for GPU
- 👌 Supported beta parameter for QuerySoftMax function on CPU and GPU
🆕 New losses and metrics
- 🆕 New loss function
RMSEWithUncertainty
- it allows to estimate data uncertainty for trained regression models. The trained model will give you a two-element vector for each object with the first element as regression model prediction and the second element as an estimation of data uncertainty for that prediction.
Speedups
- Our team and our contributors (Thanks @dmsivkov!) have made major speedups for CPU training: kdd98 -9%, higgs -18%, msrank -28%
🛠 Bugfixes:
- 🛠 Fixed CatBoost model export as Python code
- 🛠 Fixed AUC metric creation
- Add text features to
model.feature_names_
. Issue #1314 - Allow models, trained on datasets with NaN values (Min treatment) and without NaNs in
model_sum()
or as the base model ininit_model=
. Issue #1271
Educational materials
-
v0.23.2 Changes
May 26, 2020🆕 New functionality
- Added
plot_partial_dependence
method in python-package (Now it works for models with symmetric trees trained on dataset with numerical features only). Implemented by @felixandrer. - Allowed using
boost_from_average
option together withmodel_shrink_rate
option. In this case shrinkage is applied to the starting value.. - Added new
auto_class_weights
option in python-package, R-package and cli with possible valuesBalanced
andSqrtBalanced
. ForBalanced
every class is weightedmaxSumWeightInClass / sumWeightInClass
, where sumWeightInClass is sum of weights of all samples in this class. If no weights are present then sample weight is 1. And maxSumWeightInClass - is maximum sum weight among all classes. ForSqrtBalanced
the formula issqrt(maxSumWeightInClass / sumWeightInClass)
. This option supported in binclass and multiclass tasks. Implemented by @egiby. - Supported
model_size_reg
option on GPU. Set to 0.5 by default (same as in CPU). This regularization works slightly differently on GPU: feature combinations are regularized more aggressively than on CPU. For CPU cost of a combination is equal to number of different feature values in this combinations that are present in training dataset. On GPU cost of a combination is equal to number of all possible different values of this combination. For example, if combination contains two categorical features c1 and c2, then the cost will be #categories in c1 * #categories in c2, even though many of the values from this combination might not be present in the dataset. - Added calculation of Shapley values, (see formula (2) from https://arxiv.org/pdf/1802.03888.pdf). By default estimation from this paper (Algorithm 2) is calcucated, that is much more faster. To use this mode specify shap_calc_type parameter of CatBoost.get_feature_importance function as "Exact". Implemented by @LordProtoss.
🛠 Bugfixes:
- 🛠 Fixed onnx converter for old onnx versions.
- Added
-
v0.23.1 Changes
May 15, 2020🆕 New functionality
- CatBoost model could be simply converted into ONNX object in Python with
catboost.utils.convert_to_onnx_object
method. Implemented by @monkey0head - 🔊 We now print metric options with metric names as metric description in error logs by default. This allows you to distinguish between metrics of the same type with different parameters. For example, if user sets weigheted average
TotalF1
metric CatBoost will printTotalF1:average=Weighted
as corresponding metric column header in error logs. Implemented by @ivanychev - Implemented PRAUC metric (issue #737). Thanks @azikmsu
- It's now possible to write custom multiregression objective in Python. Thanks @azikmsu
- 👌 Supported nonsymmetric models export to PMML
class_weights
parameter accepts dictionary with class name to class weight mapping- Added
_get_tags()
method for compatibility with sklearn (issue #1282). Implemented by @crazyleg - ✅ Lot's of improvements in .Net CatBoost library: implemented IDisposable interface, splitted ML.NET compatible and basic prediction classes in separate libraries, added base UNIX compatibility, supported GPU model evaluation, fixed tests. Thanks @khanova
- 🔋 In addition to first_feature_use_penalties presented in the previous release, we added new option per_object_feature_penalties which considers feature usage on each object individually. For more details refer the tutorial.
💥 Breaking changes
- From now on we require explicit
loss_function
param in pythoncv
method.
🛠 Bugfixes:
- 🛠 Fixed deprecation warning on import (issue #1269)
- 🛠 Fixed saved models logging_level/verbose parameters conflict (issue #696)
- 🛠 Fixed kappa metric - in some cases there were integer overflow, switched accumulation types to double
- 🛠 Fixed per float feature quantization settings defaults
Educational materials
- Extended shap values tutorial with summary plot examples. Thanks @azanovivan02
- CatBoost model could be simply converted into ONNX object in Python with
-
v0.23 Changes
April 25, 2020🆕 New functionality
- It is possible now to train models on huge datasets that do not fit into CPU RAM.
👀 This can be accomplished by storing only quantized data in memory (it is many times smaller). Usecatboost.utils.quantize
function to create quantizedPool
this way. See usage example in the issue #1116.
Implemented by @noxwell. - Python Pool class now has
save_quantization_borders
method that allows to save resulting borders into a file and use it for quantization of other datasets. Quantization can be a bottleneck of training, especially on GPU. Doing quantization once for several trainings can significantly reduce running time. It is recommended for large dataset to perform quantization first, save quantization borders, use them to quantize validation dataset, and then use quantized training and validation datasets for further training.
👉 Use saved borders when quantizing other Pools by specifyinginput_borders
parameter of thequantize
method.
Implemented by @noxwell. - 👍 Training with text features is now supported on CPU
- 👀 It is now possible to set
border_count
> 255 for GPU training. This might be useful if you have a "golden feature", see docs. - 🔋 Feature weights are implemented.
Specify weights for specific features by index or name likefeature_weights="FeatureName1:1.5,FeatureName2:0.5"
.
Scores for splits with this features will be multiplied by corresponding weights.
Implemented by @Taube03. - 🔋 Feature penalties can be used for cost efficient gradient boosting.
👉 Penalties are specified in a similar fashion to feature weights, using parameterfirst_use_feature_penalties
.
This parameter penalized the first usage of a feature. This should be used in case if the calculation of the feature is costly.
The penalty value (or the cost of using a feature) is subtracted from scores of the splits of this feature if feature has not been used in the model.
🆓 After the feature has been used once, it is considered free to proceed using this feature, so no substruction is done.
👉 There is also a common multiplier for allfirst_use_feature_penalties
, it can be specified bypenalties_coefficient
parameter.
Implemented by @Taube03 (issue #1155) recordCount
attribute is added to PMML models (issue #1026).
🆕 New losses and metrics
- 🆕 New ranking objective 'StochasticRank', details in paper.
- 👀
Tweedie
loss is supported now. It can be a good solution for right-skewed target with many zero values, see tutorial.
👕 When usingCatBoostRegressor.predict
function, defaultprediction_type
for this loss will be equal toExponent
. Implemented by @ilya-pchelintsev (issue #577) - 👍 Classification metrics now support a new parameter
proba_border
. With this parameter you can set decision boundary for treating prediction as negative or positive. Implemented by @ivanychev. - 👕 Metric
TotalF1
supports a new parameteraverage
with possible valueweighted
,micro
,macro
. Implemented by @ilya-pchelintsev. - It is possible now to specify a custom multi-label metric in python. Note that it is only possible to calculate this metric and use it as
eval_metric
. It is not possible to used it as an optimization objective.
To write a multi-label metric, you need to define a python class which inherits fromMultiLabelCustomMetric
class. Implemented by @azikmsu.
👌 Improvements of grid and randomized search
- 👍
class_weights
parameter is now supported in grid/randomized search. Implemented by @vazgenk. - 🔧 Invalid option configurations are automatically skipped during grid/randomized search. Implemented by @borzunov.
get_best_score
returns train/validation best score after grid/randomized search (in case of refit=False). Implemented by @rednevaler.
👌 Improvements of model analysis tools
- 🔋 Computation of SHAP interaction values for CatBoost models. You can pass type=EFstrType.ShapInteractionValues to
CatBoost.get_feature_importance
to get a matrix of SHAP values for every prediction.
0️⃣ By default, SHAP interaction values are calculated for all features. You may specify features of interest using theinteraction_indices
argument.
Implemented by @IvanKozlov98. - SHAP values can be calculated approximately now which is much faster than default mode. To use this mode specify
shap_calc_type
parameter ofCatBoost.get_feature_importance
function as"Approximate"
. Implemented by @LordProtoss (issue #1146). PredictionDiff
model analysis method can now be used with models that contain non symmetric trees. Implemented by @felixandrer.
🆕 New educational materials
- A tutorial on tweedie regression
- A tutorial on poisson regression
- A detailed tutorial on different types of AUC metric, which explains how different types of AUC can be used for binary classification, multiclassification and ranking tasks.
💥 Breaking changes
- 0️⃣ When using
CatBoostRegressor.predict
function for models trained withPoisson
loss, defaultprediction_type
will be equal toExponent
(issue #1184). Implemented by @garkavem.
🚀 This release also contains bug fixes and performance improvements, including a major speedup for sparse data on GPU.
- It is possible now to train models on huge datasets that do not fit into CPU RAM.
-
v0.22 Changes
March 02, 2020🆕 New features:
- 🚀 The main feature of the release is the support of non symmetric trees for training on CPU.
Using non symmetric trees might be useful if one-hot encoding is present, or data has little noise.
📄 To try non symmetric trees changegrow_policy
parameter.
🚀 Starting from this release non symmetric trees are supported for both CPU and GPU training. - 👍 The next big feature improves catboost text features support.
Now tokenization is done during training, you don't have to do lowercasing, digit extraction and other tokenization on your own, catboost does it for you. - 👍 Auto learning-rate is now supported in CPU MultiClass mode.
- CatBoost class supports
to_regressor
andto_classifier
methods.
🚀 The release also contains a list of bug fixes.
- 🚀 The main feature of the release is the support of non symmetric trees for training on CPU.
-
v0.21 Changes
January 31, 2020🆕 New features:
- The main feature of this release is the Stochastic Gradient Langevin Boosting (SGLB) mode that can improve quality of your models with non-convex loss functions. To use it specify
langevin
option and tunediffusion_temperature
andmodel_shrink_rate
. See the corresponding paper for details.
👌 Improvements:
- 0️⃣ Automatic learning rate is applied by default not only for
Logloss
objective, but also forRMSE
(on CPU and GPU) andMultiClass
(on GPU). - Class labels type information is stored in the model. Now estimators in python package return values of proper type in
classes_
attribute and for prediction functions withprediction_type=Class
. #305, #999, #1017.
📄 Note: Class labels loaded from datasets in CatBoost dsv format always have string type now.
🐛 Bug fixes:
- 🛠 Fixed huge memory consumption for text features. #1107
- 🛠 Fixed crash on GPU on big datasets with groups (hundred million+ groups).
- 🛠 Fixed class labels consistency check and merging in model sums (now class names in binary classification are properly checked and added to the result as well)
- 🛠 Fix for confusion matrix (PR #1152), thanks to @dmsivkov.
- Fixed shap values calculation when
boost_from_average=True
. #1125 - 🛠 Fixed use-after-free in fstr PredictionValuesChange with specified dataset
- Target border and class weights are now taken from model when necessary for feature strength, metrics evaluation, roc_curve, object importances and calc_feature_statistics calculations.
- 🛠 Fixed that L2 regularization was not applied for non symmetric trees for binary classification on GPU.
- 🔋 [R-package] Fixed the bug that
catboost.get_feature_importance
did not work after model is loaded #1064 - 📦 [R-package] Fixed the bug that
catboost.train
did not work when called with the single dataset parameter. #1162 - 🛠 Fixed L2 score calculation on CPU
Other:
- 🚀 Starting from this release Java applier is released simultaneously with other components and has the same version.
Compatibility:
- 🚀 Models trained with this release require applier from this release or later to work correctly.
- The main feature of this release is the Stochastic Gradient Langevin Boosting (SGLB) mode that can improve quality of your models with non-convex loss functions. To use it specify
-
v0.20.2 Changes
December 25, 2019🆕 New features:
- 👍 String class labels are now supported for binary classification
- [CLI only] Timestamp column for the datasets can be provided in separate files.
- [CLI only] Timesplit feature evaluation.
- 🖨 Process groups of any size in block processing.
🐛 Bug fixes:
classes_count
andclass_weight
params can be now used with user-defined loss functions. #1119- 0️⃣ Form correct metric descriptions on GPU if
use_weights
gets value by default. #1106 - Correct
model.classes_
attribute for binary classification (proper labels instead of always0
and1
). #984 - Fix
model.classes_
attribute when classes_count parameter was specified. - Proper error message when categorical features specified for MultiRMSE training. #1112
- Block processing: It is valid for all groups in a single block to have weights equal to 0
- 🛠 fix empty asymmetric tree index calculation. #1104