{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tweedie Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In insurance premium prediction problems, the total claim amount for a covered risk usually has a continuous distribution on positive values, except for the possibility of being exact zero when the claim does not occur. One standard approach in actuarial science in modeling such data is using compound Poisson models.\n", "\n", "##### Compound Poisson distribution\n", "\n", "Let $ N $ be a random variable with Poisson distribution and $ Z_1, Z_2, ... $ be independent identically distributed random variables with Gamma distribution. Define a random variable $ Z $ by\n", "\n", "$$ Z = \\begin{cases}0, & \\mbox{if}\\ N = 0\\\\Z_1 + Z_2 + ... + Z_N, & \\mbox{if}\\ N > 0\\end{cases} $$\n", "\n", "The resulting distribution of $ Z $ is called compound Poisson distribution. In the case of insurance premium prediction $ N $ referres to the number of claims, $ Z_i $ reffers to the amount of $i$-th claim. Compound Poisson distribution is a special case of Tweedie model.\n", "\n", "Log-likelihood of compound Poisson distribution can be written as\n", "$$ p(z) = \\frac{1}{\\phi}\\left(z \\frac{\\mu^{1-\\rho}}{1-\\rho} - \\frac{\\mu^{2-\\rho}}{2-\\rho}\\right) + a$$\n", "\n", "where $ a, \\phi, \\mu $ and $ 1 < \\rho < 2 $ are some constants.\n", "\n", "We will apply Tweedie model to an auto insurance claim dataset analyzed in Yip, Yau (2005) and Zhou, Yang, Qian (2019).\n", "\n", "##### Loading dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://cran.r-project.org/src/contrib/cplm_0.7-8.tar.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!tar -xf cplm_0.7-8.tar.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install rdata" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "import rdata\n", "\n", "data = rdata.parser.parse_file('cplm/data/AutoClaim.RData')\n", "df = rdata.conversion.convert(data)['AutoClaim']" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from catboost.utils import eval_metric\n", "from catboost import CatBoostRegressor, Pool" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AGEBLUEBOOKHOMEKIDSKIDSDRIVMVR_PTSNPOLICYRETAINEDTRAVTIMEAREACAR_USECAR_TYPEGENDERJOBCLASSMAX_EDUCMARRIEDREVOLKEDCLM_AMT5
101945148302003631UrbanPrivateSedanMProfessionalMastersYesNo0
5461421377031011424UrbanPrivateSports CarFProfessionalBachelorsYesNo0
722655215200041125UrbanPrivateVanMBlue Collar<High SchoolNoNo6656
623333253800002627UrbanCommercialPanel TruckMBlue CollarHigh SchoolNoNo0
821545226800051624UrbanPrivateSedanMProfessionalMastersNoNo6314
\n", "
" ], "text/plain": [ " AGE BLUEBOOK HOMEKIDS KIDSDRIV MVR_PTS NPOLICY RETAINED TRAVTIME \\\n", "1019 45 14830 2 0 0 3 6 31 \n", "5461 42 13770 3 1 0 1 14 24 \n", "7226 55 21520 0 0 4 1 1 25 \n", "6233 33 25380 0 0 0 2 6 27 \n", "8215 45 22680 0 0 5 1 6 24 \n", "\n", " AREA CAR_USE CAR_TYPE GENDER JOBCLASS MAX_EDUC \\\n", "1019 Urban Private Sedan M Professional Masters \n", "5461 Urban Private Sports Car F Professional Bachelors \n", "7226 Urban Private Van M Blue Collar " ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "_ = df[target].hist(bins=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the distribution has a point mass at 0 and is right skewed. So the use of Tweedie model is well justified.\n", "\n", "#### Tweedie loss\n", "\n", "For computational stability instead of optimizing $ \\mu $ parameter of Tweedie distribution directly, we will optimize $ \\log{\\mu} $. So the Tweedie loss is given by the following formula:\n", "$$L = \\sum_{i=1}^n w_i \\left(-\\frac{y_i \\exp{(F(x_i)(1-\\rho))}}{1 - \\rho} + \\frac{\\exp{(F(x_i)(2-\\rho))}}{2 - \\rho}\\right) $$\n", "where $ w_i $ are object weights, $y_i$ is target, $ F(x_i) $ is current object prediction, $\\rho $ is the obligatory hyperparameter variance power. Variance power must belong to the interval $ (1, 2) $. \n", "\n", "#### Fitting the model\n", "\n", "We will train two CatBoostRegressor models: one trained with Tweedie loss, the other one with RMSE loss. The features are remained unchanged, the categorical ones are specified in Pool's cat_features parameter." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_pool = Pool(df_train[features], label=df_train[target],\n", " cat_features=cat_features)\n", "test_pool = Pool(df_test[features], label=df_test[target],\n", " cat_features=cat_features)\n", "\n", "cb_tweedie = CatBoostRegressor(loss_function='Tweedie:variance_power=1.9', n_estimators=500, silent=True)\n", "cb_tweedie.fit(train_pool, eval_set=test_pool)\n", "\n", "cb_rmse = CatBoostRegressor(loss_function='RMSE', n_estimators=500, silent=True)\n", "cb_rmse.fit(train_pool, eval_set=test_pool)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Evaluating the models\n", "\n", "We will use MSLE as evaluation metric as it works well with quantities that have exponential growth." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "cb_rmse_pred = np.clip(cb_rmse.predict(test_pool), 0, None)\n", "cb_tweedie_pred = cb_tweedie.predict(test_pool)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSLE score:\n", "\ttweedie loss\t [31.676911817518143]\n", "\trmse loss\t [35.72356239317701]\n" ] } ], "source": [ "print('MSLE score:')\n", "print('\\ttweedie loss\\t', eval_metric(df_test[target].to_numpy(), cb_tweedie_pred, 'MSLE'))\n", "print('\\trmse loss\\t', eval_metric(df_test[target].to_numpy(), cb_rmse_pred, 'MSLE'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the model trained with Tweedie loss outperforms the model trained with RMSE loss." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### References\n", "- He Zhou, Yi Yang, Wei Qian (2019), \"Tweedie Gradient Boosting for Extremely Unbalanced Zero-inflated Data\", *arxiv preprint, [arXiv:1811.10192](https://arxiv.org/abs/1811.10192)*\n", "- Yip, K. C. and Yau, K. K. (2005), \"On modeling claim frequency data in general insurance with extra zeros\", *Insurance: Mathematics and Economics*, 36, 153–163." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }