# $$CatBoost\ Tutorial$$

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catboost/tutorials/blob/master/python_tutorial.ipynb)

In this tutorial we would explore some base cases of using catboost, such as model training, cross-validation and predicting, as well as some useful features like early stopping,  snapshot support, feature importances and parameters tuning.
  
You could run this tutorial in Google Colaboratory environment with free CPU or GPU. Just click on this <a href="https://colab.research.google.com/github/catboost/tutorials/blob/master/python_tutorial.ipynb" target="_blank" title="Colab">link</a>.

## $$Contents$$
* [1. Data Preparation](#$$1.\-Data\-Preparation$$)
    * [1.1 Data Loading](#1.1-Data-Loading)
    * [1.2 Feature Preparation](#1.2-Feature-Preparation)
    * [1.3 Data Splitting](#1.3-Data-Splitting)
* [2. CatBoost Basics](#$$2.\-CatBoost\-Basics$$)
    * [2.1 Model Training](#2.1-Model-Training)
    * [2.2 Model Cross-Validation](#2.2-Model-Cross-Validation)
    * [2.3 Model Applying](#2.3-Model-Applying)
* [3. CatBoost Features](#$$3.\-CatBoost\-Features$$)
    * [3.1 Using the best model](#3.1-Using-the-best-model)
    * [3.2 Early Stopping](#3.2-Early-Stopping)
    * [3.3 Using Baseline](#3.3-Using-Baseline)
    * [3.4 Snapshot Support](#3.4-Snapshot-Support)
    * [3.5 User Defined Objective Function](#3.5-User-Defined-Objective-Function)
    * [3.6 User Defined Metric Function](#3.6-User-Defined-Metric-Function)
    * [3.7 Staged Predict](#3.7-Staged-Predict)
    * [3.8 Feature Importances](#3.8-Feature-Importances)
    * [3.9 Eval Metrics](#3.9-Eval-Metrics)
    * [3.10 Learning Processes Comparison](#3.10-Learning-Processes-Comparison)
    * [3.11 Model Saving](#3.11-Model-Saving)
* [4. Parameters Tuning](#$$4.\-Parameters\-Tuning$$)

## $$1.\ Data\ Preparation$$
### 1.1 CatBoost installation
If you have not already installed CatBoost, you can do so by running '!pip install catboost' command.  
  
Also you should install ipywidgets package and run special command before launching jupyter notebook to draw plots.

In [3]:
!pip install catboost
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/49/d9/898a290d24bfd20a3e0758f4639b4da15fc338aea1e160c91e288c574195/catboost-0.11.2-cp37-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (7.4MB)
[K    100% |████████████████████████████████| 7.4MB 2.7MB/s ta 0:00:011
Collecting enum34 (from catboost)
  Using cached https://files.pythonhosted.org/packages/af/42/cb9355df32c69b553e72a2e28daee25d1611d2c0d9c272aa1d34204205b2/enum34-1.1.6-py3-none-any.whl
Installing collected packages: enum34, catboost
Successfully installed catboost-0.11.2 enum34-1.1.6
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


### 1.2 Data Loading
The data for this tutorial can be obtained from [this page](https://www.kaggle.com/c/titanic/data) (you would have to register a kaggle account or just login with facebook or google+) or you could use catboost.datasets as in code below.

In [1]:
from catboost.datasets import titanic
import numpy as np

train_df, test_df = titanic()

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 1.3 Feature Preparation
First of all let's check how many absent values do we have:

In [2]:
null_value_stats = train_df.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

Age         177
Cabin       687
Embarked      2
dtype: int64

As we cat see, **`Age`**, **`Cabin`** and **`Embarked`** indeed have some missing values, so let's fill them with some number way out of their distributions - so the model would be able to easily distinguish between them and take it into account:

In [3]:
train_df.fillna(-999, inplace=True)
test_df.fillna(-999, inplace=True)

Now let's separate features and label variable:

In [4]:
X = train_df.drop('Survived', axis=1)
y = train_df.Survived

Pay attention that our features are of differnt types - some of them are numeric, some are categorical, and some are even just strings, which normally should be handled in some specific way (for example encoded with bag-of-words representation). But in our case we could treat these string features just as categorical one - all the heavy lifting is done inside CatBoost. How cool is that? :)

In [5]:
print(X.dtypes)

categorical_features_indices = np.where(X.dtypes != np.float)[0]

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


### 1.4 Data Splitting
Let's split the train data into training and validation sets.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)

X_test = test_df



## $$2.\ CatBoost\ Basics$$

Let's make necessary imports.

In [7]:
from catboost import CatBoostClassifier, Pool, cv
from sklearn.metrics import accuracy_score

### 2.1 Model Training
Now let's create the model itself: We would go here with default parameters (as they provide a _really_ good baseline almost all the time), the only thing We would like to specify here is `custom_loss` parameter, as this would give us an ability to see what's going on in terms of this competition metric - accuracy, as well as to be able to watch for logloss, as it would be more smooth on dataset of such size.

In [8]:
model = CatBoostClassifier(
    custom_loss=['Accuracy'],
    random_seed=42,
    logging_level='Silent'
)

In [9]:
model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
#     logging_level='Verbose',  # you can uncomment this for text output
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

As you can see, it is possible to watch our model learn through verbose output or with nice plots (personally I would definately go with the second option - just check out those plots: you can, for example, zoom in areas of interest!)

With this we can see that the best accuracy value of **0.8340** (on validation set) was acheived on **157** boosting step.

### 2.2 Model Cross-Validation

It is good to validate your model, but to cross-validate it - even better. And also with plots! So with no more words:

In [10]:
cv_params = model.get_params()
cv_params.update({
    'loss_function': 'Logloss'
})
cv_data = cv(
    Pool(X, y, cat_features=categorical_features_indices),
    cv_params,
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Now we have values of our loss functions at each boosting step averaged by 3 folds, which should provide us with a more accurate estimation of our model performance:

In [11]:
print('Best validation accuracy score: {:.2f}±{:.2f} on step {}'.format(
    np.max(cv_data['test-Accuracy-mean']),
    cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])],
    np.argmax(cv_data['test-Accuracy-mean'])
))

Best validation accuracy score: 0.82±0.02 on step 586


will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)


In [12]:
print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))

Precise validation accuracy score: 0.819304152637486


As we can see, our initial estimation of performance on single validation fold was too optimistic - that is why cross-validation is so important!

### 2.3 Model Applying
All you have to do to get predictions is

In [13]:
predictions = model.predict(X_test)
predictions_probs = model.predict_proba(X_test)
print(predictions[:10])
print(predictions_probs[:10])

[0. 0. 0. 0. 1. 0. 1. 0. 1. 0.]
[[0.89483338 0.10516662]
 [0.83027766 0.16972234]
 [0.89202782 0.10797218]
 [0.91357855 0.08642145]
 [0.23361409 0.76638591]
 [0.92390626 0.07609374]
 [0.33580898 0.66419102]
 [0.74204312 0.25795688]
 [0.37852235 0.62147765]
 [0.95868962 0.04131038]]


But let's try to get a better predictions and Catboost features help us in it.

## $$3.\ CatBoost\ Features$$
You may have noticed that on model creation step I've specified not only `custom_loss` but also `random_seed` parameter. That was done in order to make this notebook reproducible - by default catboost chooses some random value for seed:

In [14]:
model_without_seed = CatBoostClassifier(iterations=10, logging_level='Silent')
model_without_seed.fit(X, y, cat_features=categorical_features_indices)

print('Random seed assigned for this model: {}'.format(model_without_seed.random_seed_))

Random seed assigned for this model: 0


Let's define some params and create `Pool` for more convenience. It stores all information about dataset (features, labeles, categorical features indices, weights and and much more).

In [15]:
params = {
    'iterations': 500,
    'learning_rate': 0.1,
    'eval_metric': 'Accuracy',
    'random_seed': 42,
    'logging_level': 'Silent',
    'use_best_model': False
}
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
validate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices)

### 3.1 Using the best model
If you essentially have a validation set, it's always better to use the `use_best_model` parameter during training. By default, this parameter is enabled. If it is enabled, the resulting trees ensemble is shrinking to the best iteration.

In [16]:
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=validate_pool)

best_model_params = params.copy()
best_model_params.update({
    'use_best_model': True
})
best_model = CatBoostClassifier(**best_model_params)
best_model.fit(train_pool, eval_set=validate_pool);

print('Simple model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, model.predict(X_validation))
))
print('')

print('Best model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, best_model.predict(X_validation))
))

Simple model validation accuracy: 0.8251

Best model validation accuracy: 0.8386


### 3.2 Early Stopping
If you essentially have a validation set, it's always easier and better to use early stopping. This feature is similar to the previous one, but only in addition to improving the quality it still saves time.

In [17]:
%%time
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=validate_pool)

CPU times: user 28.3 s, sys: 9.41 s, total: 37.7 s
Wall time: 15.4 s


In [18]:
%%time
earlystop_params = params.copy()
earlystop_params.update({
    'od_type': 'Iter',
    'od_wait': 40
})
earlystop_model = CatBoostClassifier(**earlystop_params)
earlystop_model.fit(train_pool, eval_set=validate_pool);

CPU times: user 2.23 s, sys: 630 ms, total: 2.86 s
Wall time: 1.03 s


In [19]:
print('Simple model tree count: {}'.format(model.tree_count_))
print('Simple model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, model.predict(X_validation))
))
print('')

print('Early-stopped model tree count: {}'.format(earlystop_model.tree_count_))
print('Early-stopped model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, earlystop_model.predict(X_validation))
))

Simple model tree count: 500
Simple model validation accuracy: 0.8251

Early-stopped model tree count: 53
Early-stopped model validation accuracy: 0.8072


So we get better quality in a shorter time.

Though as was shown earlier simple validation scheme does not precisely describes model out-of-train score (may be biased because of dataset split) it is still nice to track model improvement dynamics - and thereby as we can see from this example it is really good to stop boosting process earlier (before the overfitting kicks in)

### 3.3 Using Baseline
It is posible to use pre-training results (baseline) for training.

In [20]:
current_params = params.copy()
current_params.update({
    'iterations': 10
})
model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices)
# Get baseline (only with prediction_type='RawFormulaVal')
baseline = model.predict(X_train, prediction_type='RawFormulaVal')
# Fit new model
model.fit(X_train, y_train, categorical_features_indices, baseline=baseline);

### 3.4 Snapshot Support
Catboost supports snapshots. You can use it for recovering training after an interruption or for starting training with previous results. 

In [21]:
params_with_snapshot = params.copy()
params_with_snapshot.update({
    'iterations': 5,
    'learning_rate': 0.5,
    'logging_level': 'Verbose'
})
model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)
params_with_snapshot.update({
    'iterations': 10,
    'learning_rate': 0.1,
})
model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)

0:	learn: 0.7919162	test: 0.7847534	best: 0.7847534 (0)	total: 50.3ms	remaining: 201ms
1:	learn: 0.8278443	test: 0.8206278	best: 0.8206278 (1)	total: 109ms	remaining: 164ms
2:	learn: 0.8293413	test: 0.8206278	best: 0.8206278 (1)	total: 155ms	remaining: 103ms
3:	learn: 0.8338323	test: 0.8206278	best: 0.8206278 (1)	total: 186ms	remaining: 46.6ms
4:	learn: 0.8368263	test: 0.8161435	best: 0.8206278 (1)	total: 200ms	remaining: 0us

bestTest = 0.8206278027
bestIteration = 1

5:	learn: 0.8383234	test: 0.8161435	best: 0.8206278 (1)	total: 210ms	remaining: 41.6ms
6:	learn: 0.8398204	test: 0.8161435	best: 0.8206278 (1)	total: 228ms	remaining: 42.9ms
7:	learn: 0.8398204	test: 0.8161435	best: 0.8206278 (1)	total: 256ms	remaining: 37.7ms
8:	learn: 0.8458084	test: 0.8161435	best: 0.8206278 (1)	total: 322ms	remaining: 30.7ms
9:	learn: 0.8443114	test: 0.8161435	best: 0.8206278 (1)	total: 349ms	remaining: 0us

bestTest = 0.8206278027
bestIteration = 1



### 3.5 User Defined Objective Function
It is possible to create your own objective function. Let's create logloss objective function.

In [22]:
class LoglossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        # approxes, targets, weights are indexed containers of floats
        # (containers which have only __len__ and __getitem__ defined).
        # weights parameter can be None.
        #
        # To understand what these parameters mean, assume that there is
        # a subset of your dataset that is currently being processed.
        # approxes contains current predictions for this subset,
        # targets contains target values you provided with the dataset.
        #
        # This function should return a list of pairs (der1, der2), where
        # der1 is the first derivative of the loss function with respect
        # to the predicted value, and der2 is the second derivative.
        #
        # In our case, logloss is defined by the following formula:
        # target * log(sigmoid(approx)) + (1 - target) * (1 - sigmoid(approx))
        # where sigmoid(x) = 1 / (1 + e^(-x)).
        
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)
        
        result = []
        for index in range(len(targets)):
            e = np.exp(approxes[index])
            p = e / (1 + e)
            der1 = (1 - p) if targets[index] > 0.0 else -p
            der2 = -p * (1 - p)

            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]

            result.append((der1, der2))
        return result

In [23]:
model = CatBoostClassifier(
    iterations=10,
    random_seed=42, 
    loss_function=LoglossObjective(), 
    eval_metric="Logloss"
)
# Fit model
model.fit(train_pool)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

0:	learn: 0.6824625	total: 62.6ms	remaining: 563ms
1:	learn: 0.6715080	total: 113ms	remaining: 453ms
2:	learn: 0.6618445	total: 170ms	remaining: 397ms
3:	learn: 0.6519014	total: 220ms	remaining: 330ms
4:	learn: 0.6430996	total: 272ms	remaining: 272ms
5:	learn: 0.6357833	total: 326ms	remaining: 217ms
6:	learn: 0.6276748	total: 373ms	remaining: 160ms
7:	learn: 0.6197009	total: 421ms	remaining: 105ms
8:	learn: 0.6119674	total: 464ms	remaining: 51.6ms
9:	learn: 0.6045267	total: 514ms	remaining: 0us


### 3.6 User Defined Metric Function
Also it is possible to create your own metric function. Let's create logloss metric function.

In [24]:
class LoglossMetric(object):
    def get_final_error(self, error, weight):
        return error / (weight + 1e-38)

    def is_max_optimal(self):
        return False

    def evaluate(self, approxes, target, weight):
        # approxes is a list of indexed containers
        # (containers with only __len__ and __getitem__ defined),
        # one container per approx dimension.
        # Each container contains floats.
        # weight is a one dimensional indexed container.
        # target is float.
        
        # weight parameter can be None.
        # Returns pair (error, weights sum)
        
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])

        approx = approxes[0]

        error_sum = 0.0
        weight_sum = 0.0

        for i in range(len(approx)):
            w = 1.0 if weight is None else weight[i]
            weight_sum += w
            error_sum += -w * (target[i] * approx[i] - np.log(1 + np.exp(approx[i])))

        return error_sum, weight_sum

In [25]:
model = CatBoostClassifier(
    iterations=10,
    random_seed=42, 
    loss_function="Logloss",
    eval_metric=LoglossMetric()
)
# Fit model
model.fit(train_pool)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

0:	learn: 0.5142670	total: 57.7ms	remaining: 519ms
1:	learn: 0.4622572	total: 103ms	remaining: 410ms
2:	learn: 0.4498954	total: 137ms	remaining: 319ms
3:	learn: 0.4440634	total: 179ms	remaining: 269ms
4:	learn: 0.4437537	total: 194ms	remaining: 194ms
5:	learn: 0.4413266	total: 210ms	remaining: 140ms
6:	learn: 0.4303620	total: 233ms	remaining: 99.7ms
7:	learn: 0.4251345	total: 246ms	remaining: 61.4ms
8:	learn: 0.4117195	total: 301ms	remaining: 33.4ms
9:	learn: 0.4117159	total: 342ms	remaining: 0us


### 3.7 Staged Predict
CatBoost model has `staged_predict` method. It allows you to iteratively get predictions for a given range of trees.

In [None]:
model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)
ntree_start, ntree_end, eval_period = 3, 9, 2
predictions_iterator = model.staged_predict(validate_pool, 'Probability', ntree_start, ntree_end, eval_period)
for preds, tree_count in zip(predictions_iterator, range(ntree_start, ntree_end, eval_period)):
    print('First class probabilities using the first {} trees: {}'.format(tree_count, preds[:5, 1]))

### 3.8 Feature Importances
Sometimes it is very important to understand which feature made the greatest contribution to the final result. To do this, the CatBoost model has a `get_feature_importance` method.

In [29]:
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)
feature_importances = model.get_feature_importance(train_pool)
feature_names = X_train.columns
for score, name in sorted(zip(feature_importances, feature_names), reverse=True):
    print('{}: {}'.format(name, score))

Sex: 31.702614119956866
Pclass: 20.207001310838848
Ticket: 11.973540755448495
Fare: 8.985295393916426
Cabin: 7.145337338432442
SibSp: 6.46811647401467
Parch: 4.987796626323452
Age: 4.718576113910789
Embarked: 3.8117218671580106
PassengerId: 0.0
Name: 0.0


This shows that features **`Sex`** and **`Pclass`** had the biggest influence on the result.

### 3.9 Eval Metrics
The CatBoost has a `eval_metrics` method that allows to calculate a given metrics on a given dataset. And to draw them of course:)

In [30]:
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)
eval_metrics = model.eval_metrics(validate_pool, ['AUC'], plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [31]:
print(eval_metrics['AUC'][:6])

[0.816032198557773, 0.816032198557773, 0.816032198557773, 0.8498239141371793, 0.8584605064564816, 0.8708703672647995]


### 3.10 Learning Processes Comparison
You can also compare different models learning process on a single plot.

In [32]:
model1 = CatBoostClassifier(iterations=10, depth=1, train_dir='model_depth_1/', logging_level='Silent')
model1.fit(train_pool, eval_set=validate_pool)
model2 = CatBoostClassifier(iterations=10, depth=5, train_dir='model_depth_5/', logging_level='Silent')
model2.fit(train_pool, eval_set=validate_pool);

In [33]:
from catboost import MetricVisualizer
widget = MetricVisualizer(['model_depth_1', 'model_depth_5'])
widget.start()

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

### 3.11 Model Saving
It is always really handy to be able to dump your model to disk (especially if training took some time).

In [34]:
model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)
model.save_model('catboost_model.dump')
model = CatBoostClassifier()
model.load_model('catboost_model.dump');

# $$4.\ Parameters\ Tuning$$
While you could always select optimal number of iterations (boosting steps) by cross-validation and learning curve plots, it is also important to play with some of model parameters, and we would like to pay some special attention to `l2_leaf_reg` and `learning_rate`.

In this section, we'll select these parameters using the **`hyperopt`** package.

In [36]:
!pip install hyperopt

Collecting hyperopt
[?25l  Downloading https://files.pythonhosted.org/packages/ce/9f/f6324af3fc43f352e568b5850695c30ed7dd14af06a94f97953ff9187569/hyperopt-0.1.1-py3-none-any.whl (117kB)
[K    100% |████████████████████████████████| 122kB 593kB/s ta 0:00:01
Collecting pymongo (from hyperopt)
[?25l  Downloading https://files.pythonhosted.org/packages/d7/ac/d2e324c1f9bcf653fa106785371a16b4709506a35b04948655de8b961a85/pymongo-3.7.2-cp37-cp37m-macosx_10_9_x86_64.whl (307kB)
[K    100% |████████████████████████████████| 317kB 1.6MB/s ta 0:00:01
Installing collected packages: pymongo, hyperopt
Successfully installed hyperopt-0.1.1 pymongo-3.7.2


In [37]:
import hyperopt

def hyperopt_objective(params):
    model = CatBoostClassifier(
        l2_leaf_reg=int(params['l2_leaf_reg']),
        learning_rate=params['learning_rate'],
        iterations=500,
        eval_metric='Accuracy',
        random_seed=42,
        verbose=False,
        loss_function='Logloss',
    )
    
    cv_data = cv(
        Pool(X, y, cat_features=categorical_features_indices),
        model.get_params()
    )
    best_accuracy = np.max(cv_data['test-Accuracy-mean'])
    
    return 1 - best_accuracy # as hyperopt minimises

In [38]:
from numpy.random import RandomState

params_space = {
    'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1),
    'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),
}

trials = hyperopt.Trials()

best = hyperopt.fmin(
    hyperopt_objective,
    space=params_space,
    algo=hyperopt.tpe.suggest,
    max_evals=50,
    trials=trials,
    rstate=RandomState(123)
)

print(best)

{'l2_leaf_reg': 5.0, 'learning_rate': 0.1147638000846512}


Now let's get all cv data with best parameters:

In [39]:
model = CatBoostClassifier(
    l2_leaf_reg=int(best['l2_leaf_reg']),
    learning_rate=best['learning_rate'],
    iterations=500,
    eval_metric='Accuracy',
    random_seed=42,
    verbose=False,
    loss_function='Logloss',
)
cv_data = cv(Pool(X, y, cat_features=categorical_features_indices), model.get_params())

In [40]:
print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))

Precise validation accuracy score: 0.8338945005611672


Recall that with default parameters out cv score was 0.8283, and thereby we have (probably not statistically significant) some improvement.

### Make submission
Now we would re-train our tuned model on all train data that we have

In [41]:
model.fit(X, y, cat_features=categorical_features_indices)

<catboost.core.CatBoostClassifier at 0x1a2ba995c0>

And finally let's prepare the submission file:

In [42]:
import pandas as pd
submisstion = pd.DataFrame()
submisstion['PassengerId'] = X_test['PassengerId']
submisstion['Survived'] = model.predict(X_test)

In [43]:
submisstion.to_csv('submission.csv', index=False)

Finally you can make submission at [Titanic Kaggle competition](https://www.kaggle.com/c/titanic).

That's it! Now you can play around with CatBoost and win some competitions! :)