{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# $$CatBoost\\ Tutorial$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catboost/tutorials/blob/master/python_tutorial.ipynb)\n", "\n", "In this tutorial we would explore some base cases of using catboost, such as model training, cross-validation and predicting, as well as some useful features like early stopping, snapshot support, feature importances and parameters tuning.\n", " \n", "You could run this tutorial in Google Colaboratory environment with free CPU or GPU. Just click on this link." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## $$Contents$$\n", "* [1. Data Preparation](#$$1.\\-Data\\-Preparation$$)\n", " * [1.1 Data Loading](#1.1-Data-Loading)\n", " * [1.2 Feature Preparation](#1.2-Feature-Preparation)\n", " * [1.3 Data Splitting](#1.3-Data-Splitting)\n", "* [2. CatBoost Basics](#$$2.\\-CatBoost\\-Basics$$)\n", " * [2.1 Model Training](#2.1-Model-Training)\n", " * [2.2 Model Cross-Validation](#2.2-Model-Cross-Validation)\n", " * [2.3 Model Applying](#2.3-Model-Applying)\n", "* [3. CatBoost Features](#$$3.\\-CatBoost\\-Features$$)\n", " * [3.1 Using the best model](#3.1-Using-the-best-model)\n", " * [3.2 Early Stopping](#3.2-Early-Stopping)\n", " * [3.3 Using Baseline](#3.3-Using-Baseline)\n", " * [3.4 Snapshot Support](#3.4-Snapshot-Support)\n", " * [3.5 User Defined Objective Function](#3.5-User-Defined-Objective-Function)\n", " * [3.6 User Defined Metric Function](#3.6-User-Defined-Metric-Function)\n", " * [3.7 Staged Predict](#3.7-Staged-Predict)\n", " * [3.8 Feature Importances](#3.8-Feature-Importances)\n", " * [3.9 Eval Metrics](#3.9-Eval-Metrics)\n", " * [3.10 Learning Processes Comparison](#3.10-Learning-Processes-Comparison)\n", " * [3.11 Model Saving](#3.11-Model-Saving)\n", "* [4. Parameters Tuning](#$$4.\\-Parameters\\-Tuning$$)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## $$1.\\ Data\\ Preparation$$\n", "### 1.1 CatBoost installation\n", "If you have not already installed CatBoost, you can do so by running '!pip install catboost' command. \n", " \n", "Also you should install ipywidgets package and run special command before launching jupyter notebook to draw plots." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting catboost\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/49/d9/898a290d24bfd20a3e0758f4639b4da15fc338aea1e160c91e288c574195/catboost-0.11.2-cp37-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (7.4MB)\n", "\u001b[K 100% |████████████████████████████████| 7.4MB 2.7MB/s ta 0:00:011\n", "\u001b[?25hRequirement already satisfied: pandas>=0.19.1 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from catboost) (0.23.4)\n", "Collecting enum34 (from catboost)\n", " Using cached https://files.pythonhosted.org/packages/af/42/cb9355df32c69b553e72a2e28daee25d1611d2c0d9c272aa1d34204205b2/enum34-1.1.6-py3-none-any.whl\n", "Requirement already satisfied: numpy>=1.11.1 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from catboost) (1.15.4)\n", "Requirement already satisfied: six in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from catboost) (1.12.0)\n", "Requirement already satisfied: python-dateutil>=2.5.0 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from pandas>=0.19.1->catboost) (2.7.5)\n", "Requirement already satisfied: pytz>=2011k in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from pandas>=0.19.1->catboost) (2018.7)\n", "Installing collected packages: enum34, catboost\n", "Successfully installed catboost-0.11.2 enum34-1.1.6\n", "Requirement already satisfied: ipywidgets in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (7.4.2)\n", "Requirement already satisfied: traitlets>=4.3.1 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipywidgets) (4.3.2)\n", "Requirement already satisfied: ipykernel>=4.5.1 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipywidgets) (5.1.0)\n", "Requirement already satisfied: nbformat>=4.2.0 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipywidgets) (4.4.0)\n", "Requirement already satisfied: ipython>=4.0.0; python_version >= \"3.3\" in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipywidgets) (7.2.0)\n", "Requirement already satisfied: widgetsnbextension~=3.4.0 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipywidgets) (3.4.2)\n", "Requirement already satisfied: six in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from traitlets>=4.3.1->ipywidgets) (1.12.0)\n", "Requirement already satisfied: ipython-genutils in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from traitlets>=4.3.1->ipywidgets) (0.2.0)\n", "Requirement already satisfied: decorator in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from traitlets>=4.3.1->ipywidgets) (4.3.0)\n", "Requirement already satisfied: jupyter-client in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets) (5.2.4)\n", "Requirement already satisfied: tornado>=4.2 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets) (5.1.1)\n", "Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets) (2.6.0)\n", "Requirement already satisfied: jupyter-core in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets) (4.4.0)\n", "Requirement already satisfied: pickleshare in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.7.5)\n", "Requirement already satisfied: appnope; sys_platform == \"darwin\" in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.1.0)\n", "Requirement already satisfied: backcall in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.1.0)\n", "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (4.6.0)\n", "Requirement already satisfied: pygments in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (2.3.1)\n", "Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (2.0.7)\n", "Requirement already satisfied: jedi>=0.10 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.13.2)\n", "Requirement already satisfied: setuptools>=18.5 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (40.6.3)\n", "Requirement already satisfied: notebook>=4.4.1 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from widgetsnbextension~=3.4.0->ipywidgets) (5.7.4)\n", "Requirement already satisfied: pyzmq>=13 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets) (17.1.2)\n", "Requirement already satisfied: python-dateutil>=2.1 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets) (2.7.5)\n", "Requirement already satisfied: ptyprocess>=0.5 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from pexpect; sys_platform != \"win32\"->ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.6.0)\n", "Requirement already satisfied: wcwidth in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.1.7)\n", "Requirement already satisfied: parso>=0.3.0 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.3.1)\n", "Requirement already satisfied: prometheus-client in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (0.5.0)\n", "Requirement already satisfied: terminado>=0.8.1 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (0.8.1)\n", "Requirement already satisfied: Send2Trash in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (1.5.0)\n", "Requirement already satisfied: jinja2 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (2.10)\n", "Requirement already satisfied: nbconvert in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (5.4.0)\n", "Requirement already satisfied: MarkupSafe>=0.23 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (1.1.0)\n", "Requirement already satisfied: mistune>=0.8.1 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (0.8.4)\n", "Requirement already satisfied: entrypoints>=0.2.2 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (0.2.3)\n", "Requirement already satisfied: bleach in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (3.0.2)\n", "Requirement already satisfied: pandocfilters>=1.4.1 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (1.4.2)\n", "Requirement already satisfied: testpath in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (0.4.2)\n", "Requirement already satisfied: defusedxml in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (0.5.0)\n", "Requirement already satisfied: webencodings in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets) (0.5.1)\n", "Enabling notebook extension jupyter-js-widgets/extension...\n", " - Validating: \u001b[32mOK\u001b[0m\n" ] } ], "source": [ "!pip install catboost\n", "!pip install ipywidgets\n", "!jupyter nbextension enable --py widgetsnbextension" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Data Loading\n", "The data for this tutorial can be obtained from [this page](https://www.kaggle.com/c/titanic/data) (you would have to register a kaggle account or just login with facebook or google+) or you could use catboost.datasets as in code below." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from catboost.datasets import titanic\n", "import numpy as np\n", "\n", "train_df, test_df = titanic()\n", "\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3 Feature Preparation\n", "First of all let's check how many absent values do we have:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age 177\n", "Cabin 687\n", "Embarked 2\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "null_value_stats = train_df.isnull().sum(axis=0)\n", "null_value_stats[null_value_stats != 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we cat see, **`Age`**, **`Cabin`** and **`Embarked`** indeed have some missing values, so let's fill them with some number way out of their distributions - so the model would be able to easily distinguish between them and take it into account:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "train_df.fillna(-999, inplace=True)\n", "test_df.fillna(-999, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's separate features and label variable:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "X = train_df.drop('Survived', axis=1)\n", "y = train_df.Survived" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pay attention that our features are of differnt types - some of them are numeric, some are categorical, and some are even just strings, which normally should be handled in some specific way (for example encoded with bag-of-words representation). But in our case we could treat these string features just as categorical one - all the heavy lifting is done inside CatBoost. How cool is that? :)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PassengerId int64\n", "Pclass int64\n", "Name object\n", "Sex object\n", "Age float64\n", "SibSp int64\n", "Parch int64\n", "Ticket object\n", "Fare float64\n", "Cabin object\n", "Embarked object\n", "dtype: object\n" ] } ], "source": [ "print(X.dtypes)\n", "\n", "categorical_features_indices = np.where(X.dtypes != np.float)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4 Data Splitting\n", "Let's split the train data into training and validation sets." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/sbrazhnik/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n", " FutureWarning)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)\n", "\n", "X_test = test_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## $$2.\\ CatBoost\\ Basics$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make necessary imports." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from catboost import CatBoostClassifier, Pool, cv\n", "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Model Training\n", "Now let's create the model itself: We would go here with default parameters (as they provide a _really_ good baseline almost all the time), the only thing We would like to specify here is `custom_loss` parameter, as this would give us an ability to see what's going on in terms of this competition metric - accuracy, as well as to be able to watch for logloss, as it would be more smooth on dataset of such size." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "model = CatBoostClassifier(\n", " custom_loss=['Accuracy'],\n", " random_seed=42,\n", " logging_level='Silent'\n", ")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1aa3567ef5cd4ef6b368a394d922c989", "version_major": 2, "version_minor": 0 }, "text/plain": [ "MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model.fit(\n", " X_train, y_train,\n", " cat_features=categorical_features_indices,\n", " eval_set=(X_validation, y_validation),\n", "# logging_level='Verbose', # you can uncomment this for text output\n", " plot=True\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, it is possible to watch our model learn through verbose output or with nice plots (personally I would definately go with the second option - just check out those plots: you can, for example, zoom in areas of interest!)\n", "\n", "With this we can see that the best accuracy value of **0.8340** (on validation set) was acheived on **157** boosting step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Model Cross-Validation\n", "\n", "It is good to validate your model, but to cross-validate it - even better. And also with plots! So with no more words:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a848bdbc33784cd092f7d035b7f62f60", "version_major": 2, "version_minor": 0 }, "text/plain": [ "MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cv_params = model.get_params()\n", "cv_params.update({\n", " 'loss_function': 'Logloss'\n", "})\n", "cv_data = cv(\n", " Pool(X, y, cat_features=categorical_features_indices),\n", " cv_params,\n", " plot=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have values of our loss functions at each boosting step averaged by 3 folds, which should provide us with a more accurate estimation of our model performance:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best validation accuracy score: 0.82±0.02 on step 586\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/sbrazhnik/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:51: FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'\n", "will be corrected to return the positional maximum in the future.\n", "Use 'series.values.argmax' to get the position of the maximum now.\n", " return getattr(obj, method)(*args, **kwds)\n" ] } ], "source": [ "print('Best validation accuracy score: {:.2f}±{:.2f} on step {}'.format(\n", " np.max(cv_data['test-Accuracy-mean']),\n", " cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])],\n", " np.argmax(cv_data['test-Accuracy-mean'])\n", "))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precise validation accuracy score: 0.819304152637486\n" ] } ], "source": [ "print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, our initial estimation of performance on single validation fold was too optimistic - that is why cross-validation is so important!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Model Applying\n", "All you have to do to get predictions is" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0. 0. 0. 0. 1. 0. 1. 0. 1. 0.]\n", "[[0.89483338 0.10516662]\n", " [0.83027766 0.16972234]\n", " [0.89202782 0.10797218]\n", " [0.91357855 0.08642145]\n", " [0.23361409 0.76638591]\n", " [0.92390626 0.07609374]\n", " [0.33580898 0.66419102]\n", " [0.74204312 0.25795688]\n", " [0.37852235 0.62147765]\n", " [0.95868962 0.04131038]]\n" ] } ], "source": [ "predictions = model.predict(X_test)\n", "predictions_probs = model.predict_proba(X_test)\n", "print(predictions[:10])\n", "print(predictions_probs[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But let's try to get a better predictions and Catboost features help us in it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## $$3.\\ CatBoost\\ Features$$\n", "You may have noticed that on model creation step I've specified not only `custom_loss` but also `random_seed` parameter. That was done in order to make this notebook reproducible - by default catboost chooses some random value for seed:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Random seed assigned for this model: 0\n" ] } ], "source": [ "model_without_seed = CatBoostClassifier(iterations=10, logging_level='Silent')\n", "model_without_seed.fit(X, y, cat_features=categorical_features_indices)\n", "\n", "print('Random seed assigned for this model: {}'.format(model_without_seed.random_seed_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's define some params and create `Pool` for more convenience. It stores all information about dataset (features, labeles, categorical features indices, weights and and much more)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "params = {\n", " 'iterations': 500,\n", " 'learning_rate': 0.1,\n", " 'eval_metric': 'Accuracy',\n", " 'random_seed': 42,\n", " 'logging_level': 'Silent',\n", " 'use_best_model': False\n", "}\n", "train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)\n", "validate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Using the best model\n", "If you essentially have a validation set, it's always better to use the `use_best_model` parameter during training. By default, this parameter is enabled. If it is enabled, the resulting trees ensemble is shrinking to the best iteration." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Simple model validation accuracy: 0.8251\n", "\n", "Best model validation accuracy: 0.8386\n" ] } ], "source": [ "model = CatBoostClassifier(**params)\n", "model.fit(train_pool, eval_set=validate_pool)\n", "\n", "best_model_params = params.copy()\n", "best_model_params.update({\n", " 'use_best_model': True\n", "})\n", "best_model = CatBoostClassifier(**best_model_params)\n", "best_model.fit(train_pool, eval_set=validate_pool);\n", "\n", "print('Simple model validation accuracy: {:.4}'.format(\n", " accuracy_score(y_validation, model.predict(X_validation))\n", "))\n", "print('')\n", "\n", "print('Best model validation accuracy: {:.4}'.format(\n", " accuracy_score(y_validation, best_model.predict(X_validation))\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Early Stopping\n", "If you essentially have a validation set, it's always easier and better to use early stopping. This feature is similar to the previous one, but only in addition to improving the quality it still saves time." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 28.3 s, sys: 9.41 s, total: 37.7 s\n", "Wall time: 15.4 s\n" ] } ], "source": [ "%%time\n", "model = CatBoostClassifier(**params)\n", "model.fit(train_pool, eval_set=validate_pool)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.23 s, sys: 630 ms, total: 2.86 s\n", "Wall time: 1.03 s\n" ] } ], "source": [ "%%time\n", "earlystop_params = params.copy()\n", "earlystop_params.update({\n", " 'od_type': 'Iter',\n", " 'od_wait': 40\n", "})\n", "earlystop_model = CatBoostClassifier(**earlystop_params)\n", "earlystop_model.fit(train_pool, eval_set=validate_pool);" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Simple model tree count: 500\n", "Simple model validation accuracy: 0.8251\n", "\n", "Early-stopped model tree count: 53\n", "Early-stopped model validation accuracy: 0.8072\n" ] } ], "source": [ "print('Simple model tree count: {}'.format(model.tree_count_))\n", "print('Simple model validation accuracy: {:.4}'.format(\n", " accuracy_score(y_validation, model.predict(X_validation))\n", "))\n", "print('')\n", "\n", "print('Early-stopped model tree count: {}'.format(earlystop_model.tree_count_))\n", "print('Early-stopped model validation accuracy: {:.4}'.format(\n", " accuracy_score(y_validation, earlystop_model.predict(X_validation))\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we get better quality in a shorter time.\n", "\n", "Though as was shown earlier simple validation scheme does not precisely describes model out-of-train score (may be biased because of dataset split) it is still nice to track model improvement dynamics - and thereby as we can see from this example it is really good to stop boosting process earlier (before the overfitting kicks in)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Using Baseline\n", "It is posible to use pre-training results (baseline) for training." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "current_params = params.copy()\n", "current_params.update({\n", " 'iterations': 10\n", "})\n", "model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices)\n", "# Get baseline (only with prediction_type='RawFormulaVal')\n", "baseline = model.predict(X_train, prediction_type='RawFormulaVal')\n", "# Fit new model\n", "model.fit(X_train, y_train, categorical_features_indices, baseline=baseline);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Snapshot Support\n", "Catboost supports snapshots. You can use it for recovering training after an interruption or for starting training with previous results. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0:\tlearn: 0.7919162\ttest: 0.7847534\tbest: 0.7847534 (0)\ttotal: 50.3ms\tremaining: 201ms\n", "1:\tlearn: 0.8278443\ttest: 0.8206278\tbest: 0.8206278 (1)\ttotal: 109ms\tremaining: 164ms\n", "2:\tlearn: 0.8293413\ttest: 0.8206278\tbest: 0.8206278 (1)\ttotal: 155ms\tremaining: 103ms\n", "3:\tlearn: 0.8338323\ttest: 0.8206278\tbest: 0.8206278 (1)\ttotal: 186ms\tremaining: 46.6ms\n", "4:\tlearn: 0.8368263\ttest: 0.8161435\tbest: 0.8206278 (1)\ttotal: 200ms\tremaining: 0us\n", "\n", "bestTest = 0.8206278027\n", "bestIteration = 1\n", "\n", "5:\tlearn: 0.8383234\ttest: 0.8161435\tbest: 0.8206278 (1)\ttotal: 210ms\tremaining: 41.6ms\n", "6:\tlearn: 0.8398204\ttest: 0.8161435\tbest: 0.8206278 (1)\ttotal: 228ms\tremaining: 42.9ms\n", "7:\tlearn: 0.8398204\ttest: 0.8161435\tbest: 0.8206278 (1)\ttotal: 256ms\tremaining: 37.7ms\n", "8:\tlearn: 0.8458084\ttest: 0.8161435\tbest: 0.8206278 (1)\ttotal: 322ms\tremaining: 30.7ms\n", "9:\tlearn: 0.8443114\ttest: 0.8161435\tbest: 0.8206278 (1)\ttotal: 349ms\tremaining: 0us\n", "\n", "bestTest = 0.8206278027\n", "bestIteration = 1\n", "\n" ] } ], "source": [ "params_with_snapshot = params.copy()\n", "params_with_snapshot.update({\n", " 'iterations': 5,\n", " 'learning_rate': 0.5,\n", " 'logging_level': 'Verbose'\n", "})\n", "model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)\n", "params_with_snapshot.update({\n", " 'iterations': 10,\n", " 'learning_rate': 0.1,\n", "})\n", "model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 User Defined Objective Function\n", "It is possible to create your own objective function. Let's create logloss objective function." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "class LoglossObjective(object):\n", " def calc_ders_range(self, approxes, targets, weights):\n", " # approxes, targets, weights are indexed containers of floats\n", " # (containers which have only __len__ and __getitem__ defined).\n", " # weights parameter can be None.\n", " #\n", " # To understand what these parameters mean, assume that there is\n", " # a subset of your dataset that is currently being processed.\n", " # approxes contains current predictions for this subset,\n", " # targets contains target values you provided with the dataset.\n", " #\n", " # This function should return a list of pairs (der1, der2), where\n", " # der1 is the first derivative of the loss function with respect\n", " # to the predicted value, and der2 is the second derivative.\n", " #\n", " # In our case, logloss is defined by the following formula:\n", " # target * log(sigmoid(approx)) + (1 - target) * (1 - sigmoid(approx))\n", " # where sigmoid(x) = 1 / (1 + e^(-x)).\n", " \n", " assert len(approxes) == len(targets)\n", " if weights is not None:\n", " assert len(weights) == len(approxes)\n", " \n", " result = []\n", " for index in range(len(targets)):\n", " e = np.exp(approxes[index])\n", " p = e / (1 + e)\n", " der1 = (1 - p) if targets[index] > 0.0 else -p\n", " der2 = -p * (1 - p)\n", "\n", " if weights is not None:\n", " der1 *= weights[index]\n", " der2 *= weights[index]\n", "\n", " result.append((der1, der2))\n", " return result" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0:\tlearn: 0.6824625\ttotal: 62.6ms\tremaining: 563ms\n", "1:\tlearn: 0.6715080\ttotal: 113ms\tremaining: 453ms\n", "2:\tlearn: 0.6618445\ttotal: 170ms\tremaining: 397ms\n", "3:\tlearn: 0.6519014\ttotal: 220ms\tremaining: 330ms\n", "4:\tlearn: 0.6430996\ttotal: 272ms\tremaining: 272ms\n", "5:\tlearn: 0.6357833\ttotal: 326ms\tremaining: 217ms\n", "6:\tlearn: 0.6276748\ttotal: 373ms\tremaining: 160ms\n", "7:\tlearn: 0.6197009\ttotal: 421ms\tremaining: 105ms\n", "8:\tlearn: 0.6119674\ttotal: 464ms\tremaining: 51.6ms\n", "9:\tlearn: 0.6045267\ttotal: 514ms\tremaining: 0us\n" ] } ], "source": [ "model = CatBoostClassifier(\n", " iterations=10,\n", " random_seed=42, \n", " loss_function=LoglossObjective(), \n", " eval_metric=\"Logloss\"\n", ")\n", "# Fit model\n", "model.fit(train_pool)\n", "# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`\n", "preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.6 User Defined Metric Function\n", "Also it is possible to create your own metric function. Let's create logloss metric function." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "class LoglossMetric(object):\n", " def get_final_error(self, error, weight):\n", " return error / (weight + 1e-38)\n", "\n", " def is_max_optimal(self):\n", " return False\n", "\n", " def evaluate(self, approxes, target, weight):\n", " # approxes is a list of indexed containers\n", " # (containers with only __len__ and __getitem__ defined),\n", " # one container per approx dimension.\n", " # Each container contains floats.\n", " # weight is a one dimensional indexed container.\n", " # target is float.\n", " \n", " # weight parameter can be None.\n", " # Returns pair (error, weights sum)\n", " \n", " assert len(approxes) == 1\n", " assert len(target) == len(approxes[0])\n", "\n", " approx = approxes[0]\n", "\n", " error_sum = 0.0\n", " weight_sum = 0.0\n", "\n", " for i in range(len(approx)):\n", " w = 1.0 if weight is None else weight[i]\n", " weight_sum += w\n", " error_sum += -w * (target[i] * approx[i] - np.log(1 + np.exp(approx[i])))\n", "\n", " return error_sum, weight_sum" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0:\tlearn: 0.5142670\ttotal: 57.7ms\tremaining: 519ms\n", "1:\tlearn: 0.4622572\ttotal: 103ms\tremaining: 410ms\n", "2:\tlearn: 0.4498954\ttotal: 137ms\tremaining: 319ms\n", "3:\tlearn: 0.4440634\ttotal: 179ms\tremaining: 269ms\n", "4:\tlearn: 0.4437537\ttotal: 194ms\tremaining: 194ms\n", "5:\tlearn: 0.4413266\ttotal: 210ms\tremaining: 140ms\n", "6:\tlearn: 0.4303620\ttotal: 233ms\tremaining: 99.7ms\n", "7:\tlearn: 0.4251345\ttotal: 246ms\tremaining: 61.4ms\n", "8:\tlearn: 0.4117195\ttotal: 301ms\tremaining: 33.4ms\n", "9:\tlearn: 0.4117159\ttotal: 342ms\tremaining: 0us\n" ] } ], "source": [ "model = CatBoostClassifier(\n", " iterations=10,\n", " random_seed=42, \n", " loss_function=\"Logloss\",\n", " eval_metric=LoglossMetric()\n", ")\n", "# Fit model\n", "model.fit(train_pool)\n", "# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`\n", "preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.7 Staged Predict\n", "CatBoost model has `staged_predict` method. It allows you to iteratively get predictions for a given range of trees." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)\n", "ntree_start, ntree_end, eval_period = 3, 9, 2\n", "predictions_iterator = model.staged_predict(validate_pool, 'Probability', ntree_start, ntree_end, eval_period)\n", "for preds, tree_count in zip(predictions_iterator, range(ntree_start, ntree_end, eval_period)):\n", " print('First class probabilities using the first {} trees: {}'.format(tree_count, preds[:5, 1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.8 Feature Importances\n", "Sometimes it is very important to understand which feature made the greatest contribution to the final result. To do this, the CatBoost model has a `get_feature_importance` method." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sex: 31.702614119956866\n", "Pclass: 20.207001310838848\n", "Ticket: 11.973540755448495\n", "Fare: 8.985295393916426\n", "Cabin: 7.145337338432442\n", "SibSp: 6.46811647401467\n", "Parch: 4.987796626323452\n", "Age: 4.718576113910789\n", "Embarked: 3.8117218671580106\n", "PassengerId: 0.0\n", "Name: 0.0\n" ] } ], "source": [ "model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)\n", "feature_importances = model.get_feature_importance(train_pool)\n", "feature_names = X_train.columns\n", "for score, name in sorted(zip(feature_importances, feature_names), reverse=True):\n", " print('{}: {}'.format(name, score))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows that features **`Sex`** and **`Pclass`** had the biggest influence on the result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.9 Eval Metrics\n", "The CatBoost has a `eval_metrics` method that allows to calculate a given metrics on a given dataset. And to draw them of course:)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9187b46258f04f49b9a580e75b5125e9", "version_major": 2, "version_minor": 0 }, "text/plain": [ "MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)\n", "eval_metrics = model.eval_metrics(validate_pool, ['AUC'], plot=True)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.816032198557773, 0.816032198557773, 0.816032198557773, 0.8498239141371793, 0.8584605064564816, 0.8708703672647995]\n" ] } ], "source": [ "print(eval_metrics['AUC'][:6])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.10 Learning Processes Comparison\n", "You can also compare different models learning process on a single plot." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "model1 = CatBoostClassifier(iterations=10, depth=1, train_dir='model_depth_1/', logging_level='Silent')\n", "model1.fit(train_pool, eval_set=validate_pool)\n", "model2 = CatBoostClassifier(iterations=10, depth=5, train_dir='model_depth_5/', logging_level='Silent')\n", "model2.fit(train_pool, eval_set=validate_pool);" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "231151e385da4254b04f981b35907557", "version_major": 2, "version_minor": 0 }, "text/plain": [ "MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from catboost import MetricVisualizer\n", "widget = MetricVisualizer(['model_depth_1', 'model_depth_5'])\n", "widget.start()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.11 Model Saving\n", "It is always really handy to be able to dump your model to disk (especially if training took some time)." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)\n", "model.save_model('catboost_model.dump')\n", "model = CatBoostClassifier()\n", "model.load_model('catboost_model.dump');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# $$4.\\ Parameters\\ Tuning$$\n", "While you could always select optimal number of iterations (boosting steps) by cross-validation and learning curve plots, it is also important to play with some of model parameters, and we would like to pay some special attention to `l2_leaf_reg` and `learning_rate`.\n", "\n", "In this section, we'll select these parameters using the **`hyperopt`** package." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting hyperopt\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/ce/9f/f6324af3fc43f352e568b5850695c30ed7dd14af06a94f97953ff9187569/hyperopt-0.1.1-py3-none-any.whl (117kB)\n", "\u001b[K 100% |████████████████████████████████| 122kB 593kB/s ta 0:00:01\n", "\u001b[?25hRequirement already satisfied: future in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from hyperopt) (0.17.1)\n", "Requirement already satisfied: networkx in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from hyperopt) (2.2)\n", "Requirement already satisfied: numpy in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from hyperopt) (1.15.4)\n", "Requirement already satisfied: scipy in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from hyperopt) (1.1.0)\n", "Collecting pymongo (from hyperopt)\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/d7/ac/d2e324c1f9bcf653fa106785371a16b4709506a35b04948655de8b961a85/pymongo-3.7.2-cp37-cp37m-macosx_10_9_x86_64.whl (307kB)\n", "\u001b[K 100% |████████████████████████████████| 317kB 1.6MB/s ta 0:00:01\n", "\u001b[?25hRequirement already satisfied: six in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from hyperopt) (1.12.0)\n", "Requirement already satisfied: decorator>=4.3.0 in /Users/sbrazhnik/anaconda3/lib/python3.7/site-packages (from networkx->hyperopt) (4.3.0)\n", "Installing collected packages: pymongo, hyperopt\n", "Successfully installed hyperopt-0.1.1 pymongo-3.7.2\n" ] } ], "source": [ "!pip install hyperopt" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "import hyperopt\n", "\n", "def hyperopt_objective(params):\n", " model = CatBoostClassifier(\n", " l2_leaf_reg=int(params['l2_leaf_reg']),\n", " learning_rate=params['learning_rate'],\n", " iterations=500,\n", " eval_metric='Accuracy',\n", " random_seed=42,\n", " verbose=False,\n", " loss_function='Logloss',\n", " )\n", " \n", " cv_data = cv(\n", " Pool(X, y, cat_features=categorical_features_indices),\n", " model.get_params()\n", " )\n", " best_accuracy = np.max(cv_data['test-Accuracy-mean'])\n", " \n", " return 1 - best_accuracy # as hyperopt minimises" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'l2_leaf_reg': 5.0, 'learning_rate': 0.1147638000846512}\n" ] } ], "source": [ "from numpy.random import RandomState\n", "\n", "params_space = {\n", " 'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1),\n", " 'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),\n", "}\n", "\n", "trials = hyperopt.Trials()\n", "\n", "best = hyperopt.fmin(\n", " hyperopt_objective,\n", " space=params_space,\n", " algo=hyperopt.tpe.suggest,\n", " max_evals=50,\n", " trials=trials,\n", " rstate=RandomState(123)\n", ")\n", "\n", "print(best)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's get all cv data with best parameters:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "model = CatBoostClassifier(\n", " l2_leaf_reg=int(best['l2_leaf_reg']),\n", " learning_rate=best['learning_rate'],\n", " iterations=500,\n", " eval_metric='Accuracy',\n", " random_seed=42,\n", " verbose=False,\n", " loss_function='Logloss',\n", ")\n", "cv_data = cv(Pool(X, y, cat_features=categorical_features_indices), model.get_params())" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precise validation accuracy score: 0.8338945005611672\n" ] } ], "source": [ "print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that with default parameters out cv score was 0.8283, and thereby we have (probably not statistically significant) some improvement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Make submission\n", "Now we would re-train our tuned model on all train data that we have" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X, y, cat_features=categorical_features_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally let's prepare the submission file:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "submisstion = pd.DataFrame()\n", "submisstion['PassengerId'] = X_test['PassengerId']\n", "submisstion['Survived'] = model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "submisstion.to_csv('submission.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally you can make submission at [Titanic Kaggle competition](https://www.kaggle.com/c/titanic).\n", "\n", "That's it! Now you can play around with CatBoost and win some competitions! :)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "widgets": { "state": { "c26d03b66add4e078d26695cab837033": { "views": [ { "cell_index": 21 } ] } }, "version": "1.2.0" } }, "nbformat": 4, "nbformat_minor": 2 }