# Catboost R Tutorial
R kernel for Jupyter Notebook: [link](https://irkernel.github.io/installation/)

In [1]:
library(catboost)
library(caret)
library(titanic)

Loading required package: lattice
Loading required package: ggplot2


## Make CatBoost Pool

### From file

Two files are needed to create CatBoost Pool in R:

- File with features
  
```sh
> cat adult_train.1000 | head -1
1	28.0	Private	120135.0	Assoc-voc	11.0	Never-married	Sales	Not-in-family	White	Female	0.0	0.0	40.0	United-States
```

- Column description file

```sh
> cat adult.cd | head -3
0	Target
2	Categ
4	Categ
```

Column indices are 0-based, column types must be one of:

- Target (one column);
- Categ;
- Num (default type).

Indices and description of numeric columns can be omitted.

In [2]:
pool_path = system.file("extdata", "adult_train.1000", package = "catboost")
column_description_path = system.file("extdata", "adult.cd", package = "catboost")
pool <- catboost.load_pool(pool_path, column_description = column_description_path)
head(pool, 1)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,1,28,3.8919580000000003e+36,120135,-1.040168e-34,11,1.2614570000000001e+32,-371032621056,8.078708999999999e-34,-9.782154999999999e+30,-9.047986999999999e-38,0,0,40,1.219625e+24


###  From matrix

Categorical features must be transformed to numeric columns using your own method (e.g. string hash). Indices in **`cat_features`** vector are 0-based and can be different from indices in **`.cd`** file.

In [3]:
pool_path = system.file("extdata", "adult_train.1000", package="catboost")

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
    column_description_vector[i] <- 'factor'

data <- read.table(pool_path, head = F, sep = "\t", colClasses = column_description_vector, na.strings='NAN')

# Transform categorical features to numeric.
for (i in cat_features)
    data[,i] <- as.numeric(factor(data[,i]))

target <- c(1)
data_matrix <- as.matrix(data)
pool <- catboost.load_pool(as.matrix(data[,-target]),
                             label = as.matrix(data[,target]),
                             cat_features = cat_features)
head(pool, 1)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,1,28,4,120135,9,11,5,12,2,5,1,0,0,40,32


### From data.frame

Categorical features must be converted to factors (use as.factor(), colClasses argument of read.table() etc). Numeric features must be presented as type numeric. Target feature must be presented as type numeric.

In [4]:
train_path = system.file("extdata", "adult_train.1000", package="catboost")
test_path = system.file("extdata", "adult_test.1000", package="catboost")

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
    column_description_vector[i] <- 'factor'
    
train <- read.table(train_path, head = F, sep = "\t", colClasses = column_description_vector, na.strings='NAN')
test <- read.table(test_path, head = F, sep = "\t", colClasses = column_description_vector, na.strings='NAN')
target <- c(1)
train_pool <- catboost.load_pool(data=train[,-target], label = train[,target])
test_pool <- catboost.load_pool(data=test[,-target], label = test[,target])
head(train_pool, 1)
head(test_pool, 1)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,1,28,3.8919580000000003e+36,120135,-1.040168e-34,11,1.2614570000000001e+32,-371032621056,8.078708999999999e-34,-9.782154999999999e+30,-9.047986999999999e-38,0,0,40,1.219625e+24


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,1,73,-1220011000000000.0,30958,-40904704,10,-2.326434e-34,-371032621056,9.094553e-37,-9.782154999999999e+30,-3.163861e-08,0,0,25,1.219625e+24


## Explore pool

In [5]:
# number of rows and colls
cat("Nrows: ", nrow(train_pool), ", Ncols: ", ncol(train_pool), "\n")
# first rows of pool
cat("\nFirst row: ")
head(train_pool, n = 1)
cat("\nLast row: ")
tail(train_pool, n = 1)
cat("\nColumn names: ")
colnames(train_pool)

Nrows:  1000 , Ncols:  14 

First row: 

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,1,28,3.8919580000000003e+36,120135,-1.040168e-34,11,1.2614570000000001e+32,-371032621056,8.078708999999999e-34,-9.782154999999999e+30,-9.047986999999999e-38,0,0,40,1.219625e+24



Last row: 

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
-1,1,71,-1.816107e-18,177906,5.92781e-19,13,-2.326434e-34,-1.816107e-18,9.094553e-37,-9.782154999999999e+30,-3.163861e-08,0,0,10,1.219625e+24



Column names: 

## Train model

See **`help(catboost.train)`** for all arguments and description. Loss functions: RMSE, MAE, Logloss, CrossEntropy, Quantile, LogLinQuantile, Poisson, MAPE, SMAPE, MultiClass, AUC.

In [6]:
fit_params <- list(iterations = 100,
                   thread_count = 10,
                   loss_function = 'Logloss',
                   ignored_features = c(4,9),
                   border_count = 32,
                   depth = 5,
                   learning_rate = 0.03,
                   l2_leaf_reg = 3.5,
                   train_dir = 'train_dir',
                   logging_level = 'Silent')
model <- catboost.train(train_pool, test_pool, fit_params)

## Predict and evaluate

In [7]:
calc_accuracy <- function(prediction, expected) {
  labels <- ifelse(prediction > 0.5, 1, -1)
  accuracy <- sum(labels == expected) / length(labels)
  return(accuracy)
}

prediction <- catboost.predict(model, test_pool, prediction_type = 'Probability')
cat("Sample predictions: ", sample(prediction, 5), "\n")

labels <- catboost.predict(model, test_pool, prediction_type = 'Class')
table(labels, test[,target])

# works properly only for Logloss
accuracy <- calc_accuracy(prediction, test[,target])
cat("\nAccuracy: ", accuracy, "\n")

# feature splits importances (not finished)

cat("\nFeature importances", "\n")
catboost.get_feature_importance(model, train_pool)

cat("\nTree count: ", model$tree_count, "\n")

Sample predictions:  0.4286844 0.2480608 0.5529215 0.1709756 0.032663 


      
labels  -1   1
     0 414 102
     1  86 398


Accuracy:  0.812 

Feature importances 



Tree count:  100 


You can also use **`staged_predict`** function.

In [8]:
library(iterators)
staged_predictions <- catboost.staged_predict(model, test_pool, ntree_start = 2, ntree_end = 5,
                                              eval_period = 2, prediction_type = 'Probability')
staged_prediction_2_4 = nextElem(staged_predictions) # 2nd and 3rd trees
staged_prediction_2_5 = nextElem(staged_predictions) # 2nd, 3rd and 4th trees

prediction_2_4 = catboost.predict(model, test_pool, ntree_start = 2, ntree_end = 4, prediction_type = 'Probability')
prediction_2_5 = catboost.predict(model, test_pool, ntree_start = 2, ntree_end = 5, prediction_type = 'Probability')
cat(all(prediction_2_4 == staged_prediction_2_4), '\n')
cat(all(prediction_2_5 == staged_prediction_2_5))

TRUE 
TRUE

## Useful features

If you essentially have a validation set, it's always easier and better to use overfitting detector for more faster training.

In [9]:
params_simple <- list(iterations = 500,
                      loss_function = 'Logloss',
                      train_dir = 'train_dir'
                      logging_level = 'Silent')
model_simple <- catboost.train(train_pool, test_pool, params_simple)

params_with_od <- list(iterations = 500,
                       loss_function = 'Logloss',
                       train_dir = 'train_dir',
                       od_type = 'Iter',
                       od_wait = 30
                       logging_level = 'Silent')
model_with_od <- catboost.train(train_pool, test_pool, params_with_od)

cat('Simple model tree count: ', model_simple$tree_count, '\n')
cat('Model with od tree count: ', model_with_od$tree_count, '\n')

Simple model tree count:  500 
Model with od tree count:  268 


Also you can make predictions using the best model.

In [10]:
params_simple <- list(iterations = 1000,
                      loss_function = 'Logloss',
                      train_dir = 'train_dir'
                      logging_level = 'Silent')
model_simple <- catboost.train(train_pool, test_pool, params_simple)

params_best <- list(iterations = 1000,
                    loss_function = 'Logloss',
                    train_dir = 'train_dir',
                    use_best_model = TRUE,
                    logging_level = 'Silent')
model_best <- catboost.train(train_pool, test_pool, params_best)

prediction_simple <- catboost.predict(model_simple, test_pool, prediction_type = 'Probability')
prediction_best <- catboost.predict(model_best, test_pool, prediction_type = 'Probability')

cat('Simple model accuracy: ', calc_accuracy(prediction_simple, test[,target]), '\n')
cat('The best model accuracy: ', calc_accuracy(prediction_best, test[,target]), '\n')

Simple model accuracy:  0.808 
The best model accuracy:  0.822 


## Catboosting with caret

Load and preprocess the Titanic dataset.

In [11]:
set.seed(12345)

data <- as.data.frame(as.matrix(titanic_train), stringsAsFactors=TRUE)

age_levels <- levels(data$Age)
most_frequent_age <- which.max(table(data$Age))
data$Age[is.na(data$Age)] <- age_levels[most_frequent_age]

drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")
x <- data[,!(names(data) %in% drop_columns)]
y <- data[,c("Survived")]

At training we use 5-fold cross-validation. Also try to find the optimal trees' depth.

In [12]:
fit_control <- trainControl(method = "cv",
                            number = 5,
                            classProbs = TRUE)

grid <- expand.grid(depth = c(4, 6, 8),
                    learning_rate = 0.1,
                    iterations = 100,
                    l2_leaf_reg = 0.1,
                    rsm = 0.95,
                    border_count = 64)

model <- train(x, as.factor(make.names(y)),
                method = catboost.caret,
                logging_level = 'Silent', preProc = NULL,
                tuneGrid = grid, trControl = fit_control)

Print information about model.

In [13]:
print(model)

importance <- varImp(model, scale = FALSE)
print(importance)

Catboost 

891 samples
  7 predictor
  2 classes: 'X0', 'X1' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 714, 712, 713, 713, 712 
Resampling results across tuning parameters:

  depth  Accuracy   Kappa    
  4      0.8147803  0.5921606
  6      0.8035946  0.5740471
  8      0.8136628  0.5961711

Tuning parameter 'learning_rate' was held constant at a value of 0.1

Tuning parameter 'rsm' was held constant at a value of 0.95
Tuning
 parameter 'border_count' was held constant at a value of 64
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were depth = 4, learning_rate =
 0.1, iterations = 100, l2_leaf_reg = 0.1, rsm = 0.95 and border_count = 64.
custom variable importance

         Overall
Sex       26.832
Fare      22.384
Pclass    16.507
Parch     14.540
Age        7.522
Embarked   7.473
SibSp      4.741


And predict the result.

In [14]:
head(predict(model, type = 'prob'))

X0,X1
0.93941009,0.06058991
0.01378653,0.98621347
0.29743151,0.70256849
0.02529839,0.97470161
0.95701379,0.04298621
0.91183457,0.08816543
