# $$CatBoost\ PredictionDiff \ Feature\ Importance\ Tutorial$$

Sometimes it is very important to understand which feature made the greatest contribution to the final result. To do this, the CatBoost model has a get_feature_importance method.

In [1]:
import numpy as np
from catboost import CatBoost, Pool, datasets
from sklearn.model_selection import train_test_split

First, let's prepare the dataset:

In [2]:
train_df, test_df = datasets.msrank_10k()

In [3]:
X_train, y_train, group_id_train = np.array(train_df.drop([0, 1], axis=1)), np.array(train_df[0]), np.array(train_df[1])
X_test, y_test, group_id_test = np.array(test_df.drop([0, 1], axis=1)), np.array(test_df[0]), np.array(test_df[1])
train_pool = Pool(X_train, y_train, group_id=group_id_train)
test_pool = Pool(X_test, y_test, group_id=group_id_test)

Let's train CatBoost:

In [4]:
model = CatBoost({'iterations': 50, 'loss_function': 'YetiRank', 'verbose': False, 'random_seed': 42})
model.fit(train_pool);

Catboost provides several types of feature importances. One of them is PredictionDiff: A vector with contributions of each feature to the RawFormulaVal difference for each pair of objects.

Let's find such pair of objects in 1-st group in test pool that our model ranks in wrong order.

In [5]:
# find 1st group
group_size = 1
while group_id_test[group_size] == group_id_test[0]:
    group_size += 1

# get predictions
target = y_test[:group_size]
prediction = model.predict(X_test[: group_size], prediction_type='RawFormulaVal')
prediction = zip(prediction, target, range(group_size))

In [6]:
# find a wrong ranked pair of objects
wrong_prediction_idxs = [
    int(np.max([(x[0], x[2]) for x in prediction if x[1] == 0])),
    int(np.min([(x[1], x[2]) for x in prediction if x[1] == 3]))
]
test_pool_slice = X_test[wrong_prediction_idxs]

zip(model.predict(test_pool_slice, prediction_type='RawFormulaVal'), target[wrong_prediction_idxs])

[(0.16605430089485623, 0.0), (-0.011438740254469434, 3.0)]

Let's calculate PredictionDiff for these two objects and see most important features.

As you can see, changing in the feature 133 could change model prediction.

In [7]:
prediction_diff = model.get_feature_importance(type='PredictionDiff', data=test_pool_slice, prettified=True)
prediction_diff.head(10)

Unnamed: 0,Feature Id,Importances
0,133,0.297181
1,107,0.141921
2,64,0.08289
3,10,0.065537
4,14,0.061547
5,7,0.061138
6,59,0.058217
7,122,0.043991
8,54,0.042972
9,49,0.042607


In [8]:
model.plot_predictions(
    data=test_pool_slice,
    features_to_change=prediction_diff["Feature Id"][:3],
    plot=True);

