{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introducing text information signals into CatBoost"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catboost/tutorials/blob/master/text_features/text_features_in_catboost.ipynb)\n",
    "\n",
    "## Task description\n",
    "\n",
    "There is a large number of tasks associated with the analysis of textual information. An example is such well-known problems as: spam or not spam email classification, analyzing the tonality of the text, dialog systems, etc. \n",
    "\n",
    "Sometimes the original task can be reduced to the simple ML **classification task** on a given set of observations $D:$<br> $D = \\{(x_i, c_i)\\}$, $x_i = (x_{i,0}, x_{i,1}, ..., x_{i,m})$ is features vector and $c_i$ is a class of $i$-th object.<br>\n",
    "In such task the features may contain not only $x_{i,j}$ as numeric or categorical values, but also the source text (e.g. tweet or question).\n",
    "\n",
    "Here and further by **Text** we imply a sequence of symbols $(a_0, a_1, ..., a_k), a_i \\in A$,<br> $A$ called as alphabet, e.g. $A$ can be a set of English letters, Unicode symbols or $A$ can have a more complicated structure including a sequences of symbols, also called tokens, e.g. $A$ can be a set of English words or Emoji."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example: Rotten tomatoes dataset\n",
    "\n",
    "Let's download real dataset for examples with CatBoost.\n",
    "This one is based on rotten tomatoes movie reviews data source and was taken from kaggle: https://www.kaggle.com/rpnuser8182/rotten-tomatoes <br>\n",
    "\n",
    "Dataset contains features about movie: description, MPAA rating, producer, company, etc, and features about user review: user name, review, rating,...  \n",
    "Here we set the task to predict the rating of the film according to the user's review"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from catboost import Pool, CatBoostClassifier\n",
    "from catboost.datasets import rotten_tomatoes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Feature names:\n",
      "id, synopsis, rating_MPAA, genre, director, writer, theater_date, dvd_date, box_office, runtime, studio, dvd_date_int, theater_date_int, review, rating, fresh, critic, top_critic, publisher, date, date_int, rating_10\n"
     ]
    }
   ],
   "source": [
    "learn, _ = rotten_tomatoes()\n",
    "print('Feature names:\\n' + ', '.join(list(learn)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>synopsis</th>\n",
       "      <th>rating_MPAA</th>\n",
       "      <th>genre</th>\n",
       "      <th>director</th>\n",
       "      <th>writer</th>\n",
       "      <th>theater_date</th>\n",
       "      <th>dvd_date</th>\n",
       "      <th>box_office</th>\n",
       "      <th>runtime</th>\n",
       "      <th>...</th>\n",
       "      <th>theater_date_int</th>\n",
       "      <th>review</th>\n",
       "      <th>rating</th>\n",
       "      <th>fresh</th>\n",
       "      <th>critic</th>\n",
       "      <th>top_critic</th>\n",
       "      <th>publisher</th>\n",
       "      <th>date</th>\n",
       "      <th>date_int</th>\n",
       "      <th>rating_10</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>830.0</td>\n",
       "      <td>A gay New Yorker stages a marriage of convenie...</td>\n",
       "      <td>R</td>\n",
       "      <td>Art House and International | Comedy | Drama |...</td>\n",
       "      <td>Ang Lee</td>\n",
       "      <td>Ang Lee | James Schamus | Neil Peng</td>\n",
       "      <td>1993-08-04</td>\n",
       "      <td>2004-06-15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>111.0</td>\n",
       "      <td>...</td>\n",
       "      <td>19930804</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.800000</td>\n",
       "      <td>fresh</td>\n",
       "      <td>Carol Cling</td>\n",
       "      <td>0</td>\n",
       "      <td>Las Vegas Review-Journal</td>\n",
       "      <td>2004-04-16</td>\n",
       "      <td>20040416.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1161.0</td>\n",
       "      <td>Screenwriter Nimrod Antal makes an impressive ...</td>\n",
       "      <td>R</td>\n",
       "      <td>Action and Adventure | Art House and Internati...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2005-04-01</td>\n",
       "      <td>2005-08-30</td>\n",
       "      <td>116783.0</td>\n",
       "      <td>105.0</td>\n",
       "      <td>...</td>\n",
       "      <td>20050401</td>\n",
       "      <td>One very long, dark ride.</td>\n",
       "      <td>0.647059</td>\n",
       "      <td>rotten</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>E! Online</td>\n",
       "      <td>2005-04-22</td>\n",
       "      <td>20050422.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>596.0</td>\n",
       "      <td>\"Arctic Tale\" is an epic adventure that explor...</td>\n",
       "      <td>G</td>\n",
       "      <td>Documentary | Special Interest</td>\n",
       "      <td>Adam Ravetch | Sarah Robertson</td>\n",
       "      <td>Linda Woolverton | Mose Richards | Kristin Gore</td>\n",
       "      <td>2007-08-17</td>\n",
       "      <td>2017-08-01</td>\n",
       "      <td>598103.0</td>\n",
       "      <td>86.0</td>\n",
       "      <td>...</td>\n",
       "      <td>20070817</td>\n",
       "      <td>I'm no holdout about the reality of global war...</td>\n",
       "      <td>0.625000</td>\n",
       "      <td>rotten</td>\n",
       "      <td>Jack Mathews</td>\n",
       "      <td>1</td>\n",
       "      <td>New York Daily News</td>\n",
       "      <td>2007-07-27</td>\n",
       "      <td>20070727.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1585.0</td>\n",
       "      <td>A dating doctor claims that with his services ...</td>\n",
       "      <td>PG-13</td>\n",
       "      <td>Comedy | Romance</td>\n",
       "      <td>Andrew Tennant | Andy Tennant</td>\n",
       "      <td>Kevin Bisch</td>\n",
       "      <td>2005-02-11</td>\n",
       "      <td>2005-06-14</td>\n",
       "      <td>177575142.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>...</td>\n",
       "      <td>20050211</td>\n",
       "      <td>... Adds up to far more than the formula typic...</td>\n",
       "      <td>0.875000</td>\n",
       "      <td>fresh</td>\n",
       "      <td>Greg Maki</td>\n",
       "      <td>0</td>\n",
       "      <td>Star-Democrat (Easton, MD)</td>\n",
       "      <td>2005-02-11</td>\n",
       "      <td>20050211.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>603.0</td>\n",
       "      <td>R&amp;amp;B; star Janet Jackson made an impressive...</td>\n",
       "      <td>R</td>\n",
       "      <td>Drama</td>\n",
       "      <td>John Singleton</td>\n",
       "      <td>John Singleton</td>\n",
       "      <td>1993-07-23</td>\n",
       "      <td>2001-05-08</td>\n",
       "      <td>NaN</td>\n",
       "      <td>104.0</td>\n",
       "      <td>...</td>\n",
       "      <td>19930723</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.600000</td>\n",
       "      <td>fresh</td>\n",
       "      <td>Clint Morris</td>\n",
       "      <td>0</td>\n",
       "      <td>Film Threat</td>\n",
       "      <td>2005-05-06</td>\n",
       "      <td>20050506.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       id                                           synopsis rating_MPAA  \\\n",
       "0   830.0  A gay New Yorker stages a marriage of convenie...           R   \n",
       "1  1161.0  Screenwriter Nimrod Antal makes an impressive ...           R   \n",
       "2   596.0  \"Arctic Tale\" is an epic adventure that explor...           G   \n",
       "3  1585.0  A dating doctor claims that with his services ...       PG-13   \n",
       "4   603.0  R&amp;B; star Janet Jackson made an impressive...           R   \n",
       "\n",
       "                                               genre  \\\n",
       "0  Art House and International | Comedy | Drama |...   \n",
       "1  Action and Adventure | Art House and Internati...   \n",
       "2                     Documentary | Special Interest   \n",
       "3                                   Comedy | Romance   \n",
       "4                                              Drama   \n",
       "\n",
       "                         director  \\\n",
       "0                         Ang Lee   \n",
       "1                             NaN   \n",
       "2  Adam Ravetch | Sarah Robertson   \n",
       "3   Andrew Tennant | Andy Tennant   \n",
       "4                  John Singleton   \n",
       "\n",
       "                                            writer theater_date    dvd_date  \\\n",
       "0              Ang Lee | James Schamus | Neil Peng   1993-08-04  2004-06-15   \n",
       "1                                              NaN   2005-04-01  2005-08-30   \n",
       "2  Linda Woolverton | Mose Richards | Kristin Gore   2007-08-17  2017-08-01   \n",
       "3                                      Kevin Bisch   2005-02-11  2005-06-14   \n",
       "4                                   John Singleton   1993-07-23  2001-05-08   \n",
       "\n",
       "    box_office  runtime  ... theater_date_int  \\\n",
       "0          NaN    111.0  ...         19930804   \n",
       "1     116783.0    105.0  ...         20050401   \n",
       "2     598103.0     86.0  ...         20070817   \n",
       "3  177575142.0    120.0  ...         20050211   \n",
       "4          NaN    104.0  ...         19930723   \n",
       "\n",
       "                                              review    rating   fresh  \\\n",
       "0                                                NaN  0.800000   fresh   \n",
       "1                          One very long, dark ride.  0.647059  rotten   \n",
       "2  I'm no holdout about the reality of global war...  0.625000  rotten   \n",
       "3  ... Adds up to far more than the formula typic...  0.875000   fresh   \n",
       "4                                                NaN  0.600000   fresh   \n",
       "\n",
       "         critic top_critic                   publisher        date  \\\n",
       "0   Carol Cling          0    Las Vegas Review-Journal  2004-04-16   \n",
       "1           NaN          0                   E! Online  2005-04-22   \n",
       "2  Jack Mathews          1         New York Daily News  2007-07-27   \n",
       "3     Greg Maki          0  Star-Democrat (Easton, MD)  2005-02-11   \n",
       "4  Clint Morris          0                 Film Threat  2005-05-06   \n",
       "\n",
       "     date_int rating_10  \n",
       "0  20040416.0       8.0  \n",
       "1  20050422.0       6.0  \n",
       "2  20070727.0       6.0  \n",
       "3  20050211.0       9.0  \n",
       "4  20050506.0       6.0  \n",
       "\n",
       "[5 rows x 22 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "learn.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Features description \n",
    "\n",
    "|Id | Feature name      |  Description                                                                                 |\n",
    "|---|-------------------|----------------------------------------------------------------------------------------------|\n",
    "| 1 | ``id``            |  unique movie id                                                                             |\n",
    "| 2 | ``synopsis``      |  brief summary of the major points of a movie                                                |\n",
    "| 3 | ``rating_MPAA``   |  film rating by MPAA rating system                                                           |\n",
    "| 4 | ``genre``         |  list of genres that are suitable for this film (e.g. Action, Adventure, Comedy,...          |\n",
    "| 5 | ``director``      |  list of persons who direct the making of a film                                             |\n",
    "| 6 | ``writer``        |  list of persons who write a screenplay                                                      |\n",
    "| 7 | ``theater_date``  |  the date when film was first shown to the public in cinema (string)                         |\n",
    "| 8 | ``dvd_date``      |  the date when film was released on DVD (string)                                             |\n",
    "| 9 | ``box_office``    |  the amount of money raised by ticket sales (revenue)                                        |\n",
    "| 10 | ``runtime``      |  film duration in minutes                                                                    |\n",
    "| 11 | ``studio``       |  is a major entertainment company or motion picture company (20th Century Fox, Sony Pictures)|\n",
    "| 12 | ``dvd_date_int`` |  the date when film was released on DVD (converted to integer)                               |\n",
    "| 13 | ``theater_date_int`` |  the date when film was first shown to the public in cinema (converted to integer)       |\n",
    "| 14 | ``review``       |  review of a movie, that was written by a critic                                             |\n",
    "| 15 | ``rating``       |  float rating from 0 to 1 of the film according to the Rotten tomatoes web site              |\n",
    "| 16 | ``fresh``        |  freshness of review - fresh or rotten                                                       |\n",
    "| 17 | ``critic``       |  name of reviewer                                                                            |\n",
    "| 18 | ``top_critic``   |  binary feature, is reviewer a top critic or not                                             |\n",
    "| 19 | ``publisher``    |  journal or website where the review was published                                           |\n",
    "| 20 | ``date``         |  the date when critic publish review (string)                                                |\n",
    "| 21 | ``date_int``     |  the date when critic publish review (converted to integer)                                  |\n",
    "| 22 | ``rating_10``    |  integer rating from 0 to 10 of the film according to the critic                             |\n",
    "\n",
    "We mark as **auxiliary** columnns 'id' and 'rating', because they can be the reason of overfitting, 'theater_date','dvd_date','date' because we convert them into integers.\n",
    "\n",
    "We mark as **text** features 'synopsis' because it is short *text* description of a film, 'genre' because it is combination of categories (we know that strings have structure where words define categories), for example 'Action | Comedy | Adventure', 'director' and 'writer' features are included to the text features by the same reason, 'review' becuase it is a *text* summary of critic opinion.\n",
    "\n",
    "We mark as **categorical** features 'rating_MPAA', 'studio', 'fresh', 'critic', 'top_critic' and 'publisher' because they can not be splitted into the group of categorical features and feature values can not be compared.\n",
    "\n",
    "The other columns considered as **numeric**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "auxiliary_columns = ['id', 'theater_date', 'dvd_date', 'rating', 'date']\n",
    "cat_features = ['rating_MPAA', 'studio', 'fresh', 'critic', 'top_critic', 'publisher']\n",
    "text_features = ['synopsis', 'genre', 'director', 'writer', 'review']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_processed_rotten_tomatoes():\n",
    "    learn, test = rotten_tomatoes()\n",
    "    \n",
    "    def fill_na(df, features):\n",
    "        for feature in features:\n",
    "            df[feature].fillna('', inplace=True)\n",
    "\n",
    "    def preprocess_data_part(data_part):\n",
    "        data_part = data_part.drop(auxiliary_columns, axis=1)\n",
    "        \n",
    "        fill_na(data_part, cat_features)\n",
    "        fill_na(data_part, text_features)\n",
    "\n",
    "        X = data_part.drop(['rating_10'], axis=1)\n",
    "        y = data_part['rating_10']\n",
    "        return X, y\n",
    "    \n",
    "    X_learn, y_learn = preprocess_data_part(learn)\n",
    "    X_test, y_test = preprocess_data_part(test)\n",
    "\n",
    "    return X_learn, X_test, y_learn, y_test\n",
    "\n",
    "X_train, X_test, y_train, y_test = get_processed_rotten_tomatoes()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>synopsis</th>\n",
       "      <th>rating_MPAA</th>\n",
       "      <th>genre</th>\n",
       "      <th>director</th>\n",
       "      <th>writer</th>\n",
       "      <th>box_office</th>\n",
       "      <th>runtime</th>\n",
       "      <th>studio</th>\n",
       "      <th>dvd_date_int</th>\n",
       "      <th>theater_date_int</th>\n",
       "      <th>review</th>\n",
       "      <th>fresh</th>\n",
       "      <th>critic</th>\n",
       "      <th>top_critic</th>\n",
       "      <th>publisher</th>\n",
       "      <th>date_int</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A gay New Yorker stages a marriage of convenie...</td>\n",
       "      <td>R</td>\n",
       "      <td>Art House and International | Comedy | Drama |...</td>\n",
       "      <td>Ang Lee</td>\n",
       "      <td>Ang Lee | James Schamus | Neil Peng</td>\n",
       "      <td>NaN</td>\n",
       "      <td>111.0</td>\n",
       "      <td></td>\n",
       "      <td>20040615</td>\n",
       "      <td>19930804</td>\n",
       "      <td></td>\n",
       "      <td>fresh</td>\n",
       "      <td>Carol Cling</td>\n",
       "      <td>0</td>\n",
       "      <td>Las Vegas Review-Journal</td>\n",
       "      <td>20040416.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Screenwriter Nimrod Antal makes an impressive ...</td>\n",
       "      <td>R</td>\n",
       "      <td>Action and Adventure | Art House and Internati...</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>116783.0</td>\n",
       "      <td>105.0</td>\n",
       "      <td>ThinkFilm Inc.</td>\n",
       "      <td>20050830</td>\n",
       "      <td>20050401</td>\n",
       "      <td>One very long, dark ride.</td>\n",
       "      <td>rotten</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>E! Online</td>\n",
       "      <td>20050422.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>\"Arctic Tale\" is an epic adventure that explor...</td>\n",
       "      <td>G</td>\n",
       "      <td>Documentary | Special Interest</td>\n",
       "      <td>Adam Ravetch | Sarah Robertson</td>\n",
       "      <td>Linda Woolverton | Mose Richards | Kristin Gore</td>\n",
       "      <td>598103.0</td>\n",
       "      <td>86.0</td>\n",
       "      <td>Paramount Vantage</td>\n",
       "      <td>20170801</td>\n",
       "      <td>20070817</td>\n",
       "      <td>I'm no holdout about the reality of global war...</td>\n",
       "      <td>rotten</td>\n",
       "      <td>Jack Mathews</td>\n",
       "      <td>1</td>\n",
       "      <td>New York Daily News</td>\n",
       "      <td>20070727.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A dating doctor claims that with his services ...</td>\n",
       "      <td>PG-13</td>\n",
       "      <td>Comedy | Romance</td>\n",
       "      <td>Andrew Tennant | Andy Tennant</td>\n",
       "      <td>Kevin Bisch</td>\n",
       "      <td>177575142.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>Sony Pictures</td>\n",
       "      <td>20050614</td>\n",
       "      <td>20050211</td>\n",
       "      <td>... Adds up to far more than the formula typic...</td>\n",
       "      <td>fresh</td>\n",
       "      <td>Greg Maki</td>\n",
       "      <td>0</td>\n",
       "      <td>Star-Democrat (Easton, MD)</td>\n",
       "      <td>20050211.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>R&amp;amp;B; star Janet Jackson made an impressive...</td>\n",
       "      <td>R</td>\n",
       "      <td>Drama</td>\n",
       "      <td>John Singleton</td>\n",
       "      <td>John Singleton</td>\n",
       "      <td>NaN</td>\n",
       "      <td>104.0</td>\n",
       "      <td></td>\n",
       "      <td>20010508</td>\n",
       "      <td>19930723</td>\n",
       "      <td></td>\n",
       "      <td>fresh</td>\n",
       "      <td>Clint Morris</td>\n",
       "      <td>0</td>\n",
       "      <td>Film Threat</td>\n",
       "      <td>20050506.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            synopsis rating_MPAA  \\\n",
       "0  A gay New Yorker stages a marriage of convenie...           R   \n",
       "1  Screenwriter Nimrod Antal makes an impressive ...           R   \n",
       "2  \"Arctic Tale\" is an epic adventure that explor...           G   \n",
       "3  A dating doctor claims that with his services ...       PG-13   \n",
       "4  R&amp;B; star Janet Jackson made an impressive...           R   \n",
       "\n",
       "                                               genre  \\\n",
       "0  Art House and International | Comedy | Drama |...   \n",
       "1  Action and Adventure | Art House and Internati...   \n",
       "2                     Documentary | Special Interest   \n",
       "3                                   Comedy | Romance   \n",
       "4                                              Drama   \n",
       "\n",
       "                         director  \\\n",
       "0                         Ang Lee   \n",
       "1                                   \n",
       "2  Adam Ravetch | Sarah Robertson   \n",
       "3   Andrew Tennant | Andy Tennant   \n",
       "4                  John Singleton   \n",
       "\n",
       "                                            writer   box_office  runtime  \\\n",
       "0              Ang Lee | James Schamus | Neil Peng          NaN    111.0   \n",
       "1                                                      116783.0    105.0   \n",
       "2  Linda Woolverton | Mose Richards | Kristin Gore     598103.0     86.0   \n",
       "3                                      Kevin Bisch  177575142.0    120.0   \n",
       "4                                   John Singleton          NaN    104.0   \n",
       "\n",
       "              studio  dvd_date_int  theater_date_int  \\\n",
       "0                         20040615          19930804   \n",
       "1     ThinkFilm Inc.      20050830          20050401   \n",
       "2  Paramount Vantage      20170801          20070817   \n",
       "3      Sony Pictures      20050614          20050211   \n",
       "4                         20010508          19930723   \n",
       "\n",
       "                                              review   fresh        critic  \\\n",
       "0                                                      fresh   Carol Cling   \n",
       "1                          One very long, dark ride.  rotten                 \n",
       "2  I'm no holdout about the reality of global war...  rotten  Jack Mathews   \n",
       "3  ... Adds up to far more than the formula typic...   fresh     Greg Maki   \n",
       "4                                                      fresh  Clint Morris   \n",
       "\n",
       "   top_critic                   publisher    date_int  \n",
       "0           0    Las Vegas Review-Journal  20040416.0  \n",
       "1           0                   E! Online  20050422.0  \n",
       "2           1         New York Daily News  20070727.0  \n",
       "3           0  Star-Democrat (Easton, MD)  20050211.0  \n",
       "4           0                 Film Threat  20050506.0  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Previously you can specify only categorical features and perform training like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:\tlearn: 0.3937699\ttest: 0.4022497\tbest: 0.4022497 (0)\ttotal: 53.4ms\tremaining: 53.3s\n",
      "50:\tlearn: 0.4163304\ttest: 0.4213229\tbest: 0.4225455 (33)\ttotal: 1.27s\tremaining: 23.6s\n",
      "100:\tlearn: 0.4280692\ttest: 0.4290256\tbest: 0.4298814 (97)\ttotal: 2.47s\tremaining: 22s\n",
      "150:\tlearn: 0.4360174\ttest: 0.4325712\tbest: 0.4329380 (140)\ttotal: 3.62s\tremaining: 20.3s\n",
      "200:\tlearn: 0.4423147\ttest: 0.4368505\tbest: 0.4372173 (182)\ttotal: 4.78s\tremaining: 19s\n",
      "250:\tlearn: 0.4474505\ttest: 0.4374618\tbest: 0.4381954 (240)\ttotal: 5.96s\tremaining: 17.8s\n",
      "300:\tlearn: 0.4526168\ttest: 0.4410075\tbest: 0.4411297 (299)\ttotal: 7.13s\tremaining: 16.6s\n",
      "350:\tlearn: 0.4575385\ttest: 0.4413743\tbest: 0.4418633 (312)\ttotal: 8.25s\tremaining: 15.3s\n",
      "400:\tlearn: 0.4627965\ttest: 0.4446754\tbest: 0.4446754 (399)\ttotal: 9.38s\tremaining: 14s\n",
      "450:\tlearn: 0.4666178\ttest: 0.4456535\tbest: 0.4462648 (448)\ttotal: 10.5s\tremaining: 12.8s\n",
      "500:\tlearn: 0.4697359\ttest: 0.4480988\tbest: 0.4482211 (498)\ttotal: 11.6s\tremaining: 11.6s\n",
      "550:\tlearn: 0.4736794\ttest: 0.4476097\tbest: 0.4482211 (498)\ttotal: 12.7s\tremaining: 10.4s\n",
      "600:\tlearn: 0.4782649\ttest: 0.4480988\tbest: 0.4483433 (578)\ttotal: 13.9s\tremaining: 9.21s\n",
      "650:\tlearn: 0.4812913\ttest: 0.4490769\tbest: 0.4493214 (647)\ttotal: 15.1s\tremaining: 8.07s\n",
      "700:\tlearn: 0.4852042\ttest: 0.4489546\tbest: 0.4498105 (695)\ttotal: 16.2s\tremaining: 6.93s\n",
      "750:\tlearn: 0.4882612\ttest: 0.4499328\tbest: 0.4507886 (708)\ttotal: 17.4s\tremaining: 5.77s\n",
      "800:\tlearn: 0.4915322\ttest: 0.4496882\tbest: 0.4507886 (708)\ttotal: 18.5s\tremaining: 4.61s\n",
      "850:\tlearn: 0.4947726\ttest: 0.4498105\tbest: 0.4507886 (708)\ttotal: 19.7s\tremaining: 3.44s\n",
      "900:\tlearn: 0.4980741\ttest: 0.4509109\tbest: 0.4512777 (898)\ttotal: 20.8s\tremaining: 2.28s\n",
      "950:\tlearn: 0.5011005\ttest: 0.4498105\tbest: 0.4513999 (902)\ttotal: 22s\tremaining: 1.13s\n",
      "999:\tlearn: 0.5046772\ttest: 0.4510331\tbest: 0.4516445 (985)\ttotal: 23.1s\tremaining: 0us\n",
      "bestTest = 0.4516444553\n",
      "bestIteration = 985\n",
      "Shrink model to first 986 iterations.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<catboost.core.CatBoostClassifier at 0x7f41048bafd0>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train_no_text = X_train.drop(text_features, axis=1)\n",
    "X_test_no_text = X_test.drop(text_features, axis=1)\n",
    "\n",
    "learn_pool = Pool(\n",
    "    X_train_no_text, \n",
    "    y_train, \n",
    "    cat_features=cat_features, \n",
    "    feature_names=list(X_train_no_text)\n",
    ")\n",
    "test_pool = Pool(\n",
    "    X_test_no_text, \n",
    "    y_test, \n",
    "    cat_features=cat_features, \n",
    "    feature_names=list(X_train_no_text)\n",
    ")\n",
    "\n",
    "model = CatBoostClassifier(iterations=1000, learning_rate=0.03, eval_metric='Accuracy', task_type='GPU')\n",
    "model.fit(learn_pool, eval_set=test_pool, verbose=50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Now you can specify also text features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_catboost_on_rotten_tomatoes(X_train, X_test, y_train, y_test, catboost_params={}, verbose=100):\n",
    "    learn_pool = Pool(\n",
    "        X_train, \n",
    "        y_train, \n",
    "        cat_features=cat_features,\n",
    "        text_features=text_features,\n",
    "        feature_names=list(X_train)\n",
    "    )\n",
    "    test_pool = Pool(\n",
    "        X_test, \n",
    "        y_test, \n",
    "        cat_features=cat_features,\n",
    "        text_features=text_features,\n",
    "        feature_names=list(X_train)\n",
    "    )\n",
    "    \n",
    "    catboost_default_params = {\n",
    "        'iterations': 1000,\n",
    "        'learning_rate': 0.03,\n",
    "        'eval_metric': 'Accuracy',\n",
    "        'task_type': 'GPU'\n",
    "    }\n",
    "    \n",
    "    catboost_default_params.update(catboost_params)\n",
    "    \n",
    "    model = CatBoostClassifier(**catboost_default_params)\n",
    "    model.fit(learn_pool, eval_set=test_pool, verbose=verbose)\n",
    "\n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:\tlearn: 0.4001895\ttest: 0.4054285\tbest: 0.4054285 (0)\ttotal: 184ms\tremaining: 3m 3s\n",
      "100:\tlearn: 0.4465640\ttest: 0.4553124\tbest: 0.4553124 (100)\ttotal: 15.6s\tremaining: 2m 18s\n",
      "200:\tlearn: 0.4576914\ttest: 0.4627705\tbest: 0.4628928 (197)\ttotal: 28.1s\tremaining: 1m 51s\n",
      "300:\tlearn: 0.4645696\ttest: 0.4670498\tbest: 0.4670498 (285)\ttotal: 39.3s\tremaining: 1m 31s\n",
      "400:\tlearn: 0.4725177\ttest: 0.4690060\tbest: 0.4691283 (399)\ttotal: 50.4s\tremaining: 1m 15s\n",
      "500:\tlearn: 0.4791514\ttest: 0.4688837\tbest: 0.4703509 (448)\ttotal: 1m 1s\tremaining: 1m 1s\n",
      "600:\tlearn: 0.4836146\ttest: 0.4685169\tbest: 0.4703509 (448)\ttotal: 1m 13s\tremaining: 48.5s\n",
      "700:\tlearn: 0.4891477\ttest: 0.4703509\tbest: 0.4705954 (623)\ttotal: 1m 24s\tremaining: 35.9s\n",
      "800:\tlearn: 0.4943140\ttest: 0.4699841\tbest: 0.4714513 (785)\ttotal: 1m 35s\tremaining: 23.7s\n",
      "900:\tlearn: 0.4993886\ttest: 0.4707177\tbest: 0.4714513 (785)\ttotal: 1m 46s\tremaining: 11.7s\n",
      "999:\tlearn: 0.5043103\ttest: 0.4709622\tbest: 0.4725517 (965)\ttotal: 1m 56s\tremaining: 0us\n",
      "bestTest = 0.4725516567\n",
      "bestIteration = 965\n",
      "Shrink model to first 966 iterations.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<catboost.core.CatBoostClassifier at 0x7f4104933910>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fit_catboost_on_rotten_tomatoes(X_train, X_test, y_train, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Accuracy was 0.4563 → currently it is 0.4707<br>Increased by 3% due to adding signals from text features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also you can create Pool object with text features using file pathes. Then you need to specify all text features with type `Text` in column description file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:red;font-size:20px\">Attention!</span>\n",
    "\n",
    "<span style=\"font-size:18px\">\n",
    "\n",
    "1. Text features also cannot contain NaN values, so we converted them into strings manually.\n",
    "2. Training with text features supported only on GPU.\n",
    "3. The training may be performed only with classification losses and targets.\n",
    "</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How does it work?\n",
    "## Text preprocessing\n",
    "\n",
    "Usually we get our text as a sequence of Unicode symbols. So, if the task isn't a DNA classification we don't need such granularity, moreover, we need to extract more complicated entities, e.g. words. The process of extraction tokens -- words, numbers, punctuation symbols or special symbols which defines emoji from a sequence is called **tokenization**.<br>\n",
    "\n",
    "Tokenization is the first part of text preprocessing in CatBoost and performed as a simple splitting a sequence on a string pattern (e.g. space).\n",
    "### Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_small = [\n",
    "    \"Cats are so cute :)\",\n",
    "    \"Mouse skare...\",\n",
    "    \"The cat defeated the mouse\",\n",
    "    \"Cute: Mice gather an army!\",\n",
    "    \"Army of mice defeated the cat :(\",\n",
    "    \"Cat offers peace\",\n",
    "    \"Cat is skared :(\",\n",
    "    \"Cat and mouse live in peace :)\"\n",
    "]\n",
    "\n",
    "target_small = [1, 0, 1, 1, 0, 1, 0, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Cats', 'are', 'so', 'cute', ':)'],\n",
       " ['Mouse', 'skare...'],\n",
       " ['The', 'cat', 'defeated', 'the', 'mouse'],\n",
       " ['Cute:', 'Mice', 'gather', 'an', 'army!'],\n",
       " ['Army', 'of', 'mice', 'defeated', 'the', 'cat', ':('],\n",
       " ['Cat', 'offers', 'peace'],\n",
       " ['Cat', 'is', 'skared', ':('],\n",
       " ['Cat', 'and', 'mouse', 'live', 'in', 'peace', ':)']]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from catboost.text_processing import Tokenizer\n",
    "\n",
    "simple_tokenizer = Tokenizer()\n",
    "\n",
    "def tokenize_texts(texts):\n",
    "    return [simple_tokenizer.tokenize(text) for text in texts]\n",
    "\n",
    "tokenized_text = tokenize_texts(text_small)\n",
    "tokenized_text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Punctuation handling, lowercasing, lemmatization\n",
    "\n",
    "Lets take a closer look on the tokenization result of small text example -- the tokens contains a lot of mistakes:\n",
    "\n",
    "1. They are glued with punctuation 'Cute:', 'army!', 'skare...'.\n",
    "2. The words 'Cat' and 'cat', 'Mice' and 'mice' seems to have same meaning, perhaps they should be the same tokens.\n",
    "3. The same problem with tokens 'are'/'is' -- they are inflected forms of same token 'be'.\n",
    "\n",
    "**Punctuation handling** and **lemmatization** processes help to overcome these problems.\n",
    "\n",
    "### Punctuation handling\n",
    "\n",
    "Depending on the task, punctuation handling process may:\n",
    "1. Completely remove all punctuation.\n",
    "2. Escape it with spaces.\n",
    "3. Just left as is (e.g. for more complicated tokens collection).\n",
    "\n",
    "### Examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['cats are so cute :)',\n",
       " 'mouse skare ...',\n",
       " 'the cat defeated the mouse',\n",
       " 'cute : mice gather an army !',\n",
       " 'army of mice defeated the cat :(',\n",
       " 'cat offers peace',\n",
       " 'cat is skared :(',\n",
       " 'cat and mouse live in peace :)']"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer = Tokenizer(\n",
    "    lowercasing=True,\n",
    "    separator_type='BySense',\n",
    "    token_types=['Word', 'Number', 'Punctuation']\n",
    ")\n",
    "\n",
    "text_small_spaced = [' '.join(tokenizer.tokenize(text)) for text in text_small]\n",
    "text_small_spaced"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Removing stop words\n",
    "\n",
    "**Stop words** -- the words that are considered to be uninformative in this task, e.g. function words such as *the, is, at, which, on*.<br>\n",
    "Usually stop words are removed during text preprocessing to reduce the amount of information that is considered for further algorithms.<br>\n",
    "Stop words are collected manually (in dictionary form) or automatically, for example taking the most frequent words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['cats so cute :)',\n",
       " 'mouse skare ...',\n",
       " 'cat defeated mouse',\n",
       " 'cute : mice gather army !',\n",
       " 'army mice defeated cat :(',\n",
       " 'cat offers peace',\n",
       " 'cat skared :(',\n",
       " 'cat mouse live peace :)']"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stop_words = ['be', 'is', 'are', 'the', 'an', 'of', 'and', 'in']\n",
    "\n",
    "def remove_words(texts, words):\n",
    "    texts_copy = []\n",
    "    words_set = set(words)\n",
    "\n",
    "    for text in tokenize_texts(texts):\n",
    "        text_copy = []\n",
    "        for token in text:\n",
    "            if token not in words_set:\n",
    "                text_copy.append(token)\n",
    "        texts_copy.append(' '.join(text_copy))\n",
    "            \n",
    "    return texts_copy\n",
    "    \n",
    "text_small_no_stop = remove_words(text_small_spaced, stop_words)\n",
    "text_small_no_stop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lemmatization\n",
    "\n",
    "Lemma (Wikipedia) -- is the canonical form, dictionary form, or citation form of a set of words.<br>\n",
    "For example, the lemma \"go\" represents the inflected forms \"go\", \"goes\", \"going\", \"went\", and \"gone\".<br>\n",
    "The process of convertation word to its lemma called **lemmatization**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['cat', 'so', 'cute', ':)'],\n",
       " ['mouse', 'skare', '...'],\n",
       " ['cat', 'defeat', 'mouse'],\n",
       " ['cute', ':', 'mice', 'gather', 'army', '!'],\n",
       " ['army', 'mice', 'defeat', 'cat', ':('],\n",
       " ['cat', 'offer', 'peace'],\n",
       " ['cat', 'skare', ':('],\n",
       " ['cat', 'mouse', 'live', 'peace', ':)']]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pattern.en import lemma\n",
    "\n",
    "def lemmatize_text(text):\n",
    "    return \" \".join([lemma(word) for word in text.decode('utf-8').split()])\n",
    "\n",
    "def lemmatize_texts(texts):\n",
    "    return [lemmatize_text(text) for text in texts]\n",
    "\n",
    "text_small_lemmatized = lemmatize_texts(text_small_no_stop)\n",
    "text_small_lemmatized = tokenize_texts(text_small_lemmatized)\n",
    "text_small_lemmatized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now words with same meaning represented by the same token, tokens are not glued with punctuation.\n",
    "\n",
    "<span style=\"color:red\">Be carefull.</span> You should verify for your own task:<br>\n",
    "Is it realy necessary to remove punctuation, lowercasing sentences or performing a lemmatization and/or by word tokenization?<br>\n",
    "\n",
    "There is still problems with tokens 'mice'/'mouse', gensim lemmatizer can handle it, we provide code in the cell below, but for the sake of simplicity and speed in this tutorial we will use the first one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.utils import lemmatize\n",
    "\n",
    "def lemmatize_text_gensim(text):\n",
    "    result = []\n",
    "    for token in simple_tokenizer.tokenize(text):\n",
    "        lemmas = lemmatize(token)\n",
    "        if len(lemmas) == 0:\n",
    "            lemma = token.lower()\n",
    "        else:\n",
    "            lemma = lemmas[0].decode('utf-8').split('/')[0]\n",
    "            \n",
    "        result.append(lemma)\n",
    "    return ' '.join(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's check up accuracy with new text preprocessing\n",
    "\n",
    "Since CatBoost doesn't perform spacing punctuation, lowercasing letters and lemmatization, we need to preprocess text manually and then pass it to learning algorithm.\n",
    "\n",
    "Since the natural text features is only synopsis and review, we will preprocess only them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "def preprocess_data(X):\n",
    "    X_preprocessed = X.copy()\n",
    "    for feature in ['synopsis', 'review']:\n",
    "        X_preprocessed[feature] = X[feature].apply(lambda x: lemmatize_text(' '.join(tokenizer.tokenize(x))))\n",
    "    return X_preprocessed\n",
    "\n",
    "X_preprocessed_train = preprocess_data(X_train)\n",
    "X_preprocessed_test = preprocess_data(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    a gay new yorker stage a marriage of convenien...\n",
       "1    screenwriter nimrod antal make an impressive d...\n",
       "2    \" arctic tale \" be an epic adventure that expl...\n",
       "3    a date doctor claim that with hi service he ca...\n",
       "4    r & amp ; b ; star janet jackson make an impre...\n",
       "5    for four year , the courageou crew of the nsea...\n",
       "6    thi holiday season , acclaim filmmaker cameron...\n",
       "7    three billboard outside ebb , missouri be a da...\n",
       "8    alfie elkin be a philosophical womanizer who b...\n",
       "9    george cukor' remake of the 1940 film gaslight...\n",
       "Name: synopsis, dtype: object"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_preprocessed_train['synopsis'].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:\tlearn: 0.4000367\ttest: 0.4045727\tbest: 0.4045727 (0)\ttotal: 181ms\tremaining: 3m\n",
      "100:\tlearn: 0.4474810\ttest: 0.4549456\tbest: 0.4549456 (100)\ttotal: 15.4s\tremaining: 2m 16s\n",
      "200:\tlearn: 0.4596784\ttest: 0.4611811\tbest: 0.4615479 (195)\ttotal: 28s\tremaining: 1m 51s\n",
      "300:\tlearn: 0.4677183\ttest: 0.4661939\tbest: 0.4669275 (287)\ttotal: 39.5s\tremaining: 1m 31s\n",
      "400:\tlearn: 0.4745659\ttest: 0.4668052\tbest: 0.4675388 (394)\ttotal: 51s\tremaining: 1m 16s\n",
      "500:\tlearn: 0.4830949\ttest: 0.4708400\tbest: 0.4710845 (445)\ttotal: 1m 2s\tremaining: 1m 2s\n",
      "600:\tlearn: 0.4893006\ttest: 0.4730407\tbest: 0.4742634 (586)\ttotal: 1m 13s\tremaining: 49.1s\n",
      "700:\tlearn: 0.4949865\ttest: 0.4735298\tbest: 0.4746302 (696)\ttotal: 1m 25s\tremaining: 36.4s\n",
      "800:\tlearn: 0.5013145\ttest: 0.4741411\tbest: 0.4752415 (779)\ttotal: 1m 36s\tremaining: 23.9s\n",
      "900:\tlearn: 0.5069393\ttest: 0.4743856\tbest: 0.4752415 (779)\ttotal: 1m 47s\tremaining: 11.8s\n",
      "999:\tlearn: 0.5115248\ttest: 0.4741411\tbest: 0.4754860 (934)\ttotal: 1m 58s\tremaining: 0us\n",
      "bestTest = 0.4754860007\n",
      "bestIteration = 934\n",
      "Shrink model to first 935 iterations.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<catboost.core.CatBoostClassifier at 0x7f41048ba890>"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fit_catboost_on_rotten_tomatoes(X_preprocessed_train, X_preprocessed_test, y_train, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dictionary\n",
    "\n",
    "After the first stage, preprocessing of text and tokenization, the second stage starts. The second stage uses the prepared text to select a set of units, which will be used for building new numerical features.\n",
    "\n",
    "A set of selected units is called dictionary. It might contain words, word bigramms, or character n-gramms.\n",
    "\n",
    "### Examples\n",
    "Lets build a dictionary for our small text example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "word=\"cat\" has id=\"0\"\n",
      "word=\"so\" has id=\"1\"\n",
      "word=\"cute\" has id=\"2\"\n",
      "word=\":)\" has id=\"3\"\n",
      "word=\"mouse\" has id=\"4\"\n",
      "...\n"
     ]
    }
   ],
   "source": [
    "def build_dictionary(tokenized_texts):\n",
    "    dictionary = {}\n",
    "\n",
    "    for text in tokenized_texts:\n",
    "        for token in text:\n",
    "            if token not in dictionary:\n",
    "                size = len(dictionary)\n",
    "                dictionary[token] = size\n",
    "\n",
    "    return dictionary\n",
    "\n",
    "def print_dictionary(dictionary, n_items=5):\n",
    "    dict_items = sorted(dictionary.items(), key=lambda x: x[1])\n",
    "\n",
    "    for i in range(n_items):\n",
    "        word, word_id = dict_items[i]\n",
    "        print('word=\"{}\" has id=\"{}\"'.format(word, word_id))\n",
    "    \n",
    "    print('...')\n",
    "\n",
    "dictionary = build_dictionary(text_small_lemmatized)\n",
    "print_dictionary(dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convertation into fixed size vectors\n",
    "\n",
    "The majority of classic ML algorithms are computing and performing predictions on a fixed number of features $F$.<br>\n",
    "That means that learning set $X = \\{x_i\\}$ contains vectors $x_i = (a_0, a_1, ..., a_F)$ where $F$ is constant.\n",
    "\n",
    "Since text object $x$ is not a fixed length vector, we need to perform preprocessing of the origin set $D$.<br>\n",
    "One of the simplest text to vector encoding technique is **Bag of words (BoW)**.\n",
    "\n",
    "### Bag of words algorithm\n",
    "\n",
    "The algorithm takes in a dictionary and a text.<br>\n",
    "During the algorithm text $x = (a_0, a_1, ..., a_k)$ converted into vector $\\tilde x = (b_0, b_1, ..., b_F)$,<br> where $b_i$ is number of occurences word with id=$i$ from dictionary into text $x$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   cat  so  cute  :)  mouse  skare  ...  defeat  :  mice  gather  army  !  :(  \\\n",
      "0    1   1     1   1      0      0    0       0  0     0       0     0  0   0   \n",
      "1    0   0     0   0      1      1    1       0  0     0       0     0  0   0   \n",
      "2    1   0     0   0      1      0    0       1  0     0       0     0  0   0   \n",
      "3    0   0     1   0      0      0    0       0  1     1       1     1  1   0   \n",
      "4    1   0     0   0      0      0    0       1  0     1       0     1  0   1   \n",
      "5    1   0     0   0      0      0    0       0  0     0       0     0  0   0   \n",
      "6    1   0     0   0      0      1    0       0  0     0       0     0  0   1   \n",
      "7    1   0     0   1      1      0    0       0  0     0       0     0  0   0   \n",
      "\n",
      "   offer  peace  live  \n",
      "0      0      0     0  \n",
      "1      0      0     0  \n",
      "2      0      0     0  \n",
      "3      0      0     0  \n",
      "4      0      0     0  \n",
      "5      1      1     0  \n",
      "6      0      0     0  \n",
      "7      0      1     1  \n"
     ]
    }
   ],
   "source": [
    "def bag_of_words(texts, dictionary):\n",
    "    encoded_vectors = []\n",
    "    dictionary_size = len(dictionary)\n",
    "\n",
    "    for text in texts:\n",
    "        vector = [0] * dictionary_size\n",
    "\n",
    "        for token in text:\n",
    "            if token in dictionary:\n",
    "                token_id = dictionary[token]\n",
    "                vector[token_id] = 1\n",
    "    \n",
    "        encoded_vectors.append(vector)\n",
    "    \n",
    "    return encoded_vectors\n",
    "\n",
    "def print_bow_features(bag_of_words, dictionary):\n",
    "    sorted_dict = sorted(dictionary.items(), key=lambda x: x[1])\n",
    "    keys = [x[0] for x in sorted_dict]\n",
    "    bow_df = pd.DataFrame(data=bag_of_words, columns=keys)\n",
    "    print(bow_df)\n",
    "    \n",
    "\n",
    "bow_features = bag_of_words(text_small_lemmatized, dictionary)\n",
    "\n",
    "print_bow_features(bow_features, dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, having such vectors we can fit Linear or Naive bayes model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/d-kruchinin/.local/lib/python2.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n",
      "  FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from scipy.sparse import csr_matrix\n",
    "\n",
    "def fit_linear_model(X, c):\n",
    "    model = LogisticRegression()\n",
    "    model.fit(X, c)\n",
    "    return model\n",
    "\n",
    "def fit_naive_bayes(X, c):\n",
    "    clf = MultinomialNB()\n",
    "    if isinstance(X, csr_matrix):\n",
    "        X.eliminate_zeros()\n",
    "    clf.fit(X, c)\n",
    "    return clf\n",
    "\n",
    "linear_model = fit_linear_model(bow_features, target_small)\n",
    "naive_bayes = fit_naive_bayes(bow_features, target_small)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Linear model\n",
      "Logloss: 0.3314294362422291\n",
      "Naive bayes\n",
      "Logloss: 0.1667380176962438\n",
      "Comparing to constant prediction\n",
      "Logloss: 0.6931471805599453\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import log_loss\n",
    "\n",
    "def evaluate_model_logloss(model, X, c):\n",
    "    c_pred = model.predict_proba(X)[:,1]\n",
    "    metric = log_loss(c, c_pred)\n",
    "    print('Logloss: ' + str(metric))\n",
    "\n",
    "print('Linear model')\n",
    "evaluate_model_logloss(linear_model, bow_features, target_small)\n",
    "print('Naive bayes')\n",
    "evaluate_model_logloss(naive_bayes, bow_features, target_small)\n",
    "print('Comparing to constant prediction')\n",
    "logloss_constant_prediction = log_loss(target_small, np.ones(shape=(len(text_small), 2)) * 0.5)\n",
    "print('Logloss: ' + str(logloss_constant_prediction))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Looking at sequences of letters / words\n",
    "\n",
    "Let's look at the example: texts 'The cat defeated the mouse' and 'Army of mice defeated the cat :('<br>\n",
    "Simplifying it we have three tokens in each sentence 'cat defeat mouse' and 'mouse defeat cat'.<br>\n",
    "After applying BoW we get two equal vectors with the opposite meaning:\n",
    "\n",
    "| cat | mouse | defeat |\n",
    "|-----|-------|--------|\n",
    "| 1   | 1     | 1      |\n",
    "| 1   | 1     | 1      |\n",
    "\n",
    "How to distinguish them?\n",
    "Lets add sequences of words as a single tokens into our dictionary:\n",
    "\n",
    "| cat | mouse | defeat | cat_defeat | mouse_defeat | defeat_cat | defeat_mouse |\n",
    "|-----|-------|--------|------------|--------------|------------|--------------|\n",
    "| 1   | 1     | 1      | 1          | 0            | 0          | 1            |\n",
    "| 1   | 1     | 1      | 0          | 1            | 1          | 0            |\n",
    "\n",
    "**N-gram** is a continguous sequence of $n$ items from a given sample of text or speech (Wikipedia).<br>\n",
    "In example above Bi-gram (Bigram) = 2-gram of words.\n",
    "\n",
    "Ngrams help to add into vectors more information about text structure, moreover there are n-grams has no meanings in separation, for example, 'Mickey Mouse company'.\n",
    "\n",
    "## Examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "word=\"cat so\" has id=\"0\"\n",
      "word=\"so cute\" has id=\"1\"\n",
      "word=\"cute :)\" has id=\"2\"\n",
      "word=\"mouse skare\" has id=\"3\"\n",
      "word=\"skare ...\" has id=\"4\"\n",
      "...\n"
     ]
    }
   ],
   "source": [
    "def build_bigram_dictionary(tokenized_texts):\n",
    "    dictionary = {}\n",
    "    for text in tokenized_texts:\n",
    "        for i in range(len(text) - 1):\n",
    "            token1, token2 = text[i], text[i + 1]\n",
    "            bigram = token1 + ' ' + token2\n",
    "            \n",
    "            if bigram not in dictionary:\n",
    "                dictionary_size = len(dictionary)\n",
    "                dictionary[bigram] = dictionary_size\n",
    "\n",
    "    return dictionary\n",
    "\n",
    "bigram_word_dictionary = build_bigram_dictionary(text_small_lemmatized)\n",
    "print_dictionary(bigram_word_dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dictionaries in CatBoost\n",
    "\n",
    "To specify which type of dictionary to create in CatBoost you need to pass `dictionaries` parameter. This parameter specifies all dictionaries that are computed during text preprocessing procedure.\n",
    "\n",
    "Dictionaries parameters specified as list of strings, each string is a description of dictionary in the following format:<br>\n",
    "``'DictionaryName:[Param1=Value1,[Param2=Value2]]'``\n",
    "\n",
    "Here is a list of all parameters:<br>\n",
    "``min_token_occurrence`` -- number; minimal token occurence to enter to dictionary<br>\n",
    "``max_dict_size`` -- number; maximum dictionary size<br>\n",
    "``token_level_type`` -- string: ``Word`` or ``Letter``; use letter or word tokens<br>\n",
    "``gram_order`` -- number; build n-gram dictionary.\n",
    "\n",
    "### Let's see how these parameters affect the quality\n",
    "\n",
    "``min_token_occurrence`` -- parameter can be very useful for filtering too rare tokens, this helps to avoid overfitting<br>\n",
    "``max_dict_size`` -- parameter can help to control the size of model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:\tlearn: 0.3855466\ttest: 0.3940580\tbest: 0.3940580 (0)\ttotal: 107ms\tremaining: 1m 46s\n",
      "100:\tlearn: 0.4497432\ttest: 0.4521335\tbest: 0.4529894 (97)\ttotal: 4.49s\tremaining: 39.9s\n",
      "200:\tlearn: 0.4622463\ttest: 0.4624037\tbest: 0.4637486 (189)\ttotal: 8.5s\tremaining: 33.8s\n",
      "300:\tlearn: 0.4705307\ttest: 0.4636264\tbest: 0.4639932 (299)\ttotal: 12.5s\tremaining: 29.1s\n",
      "400:\tlearn: 0.4780509\ttest: 0.4653381\tbest: 0.4671720 (339)\ttotal: 16.6s\tremaining: 24.8s\n",
      "500:\tlearn: 0.4839203\ttest: 0.4666830\tbest: 0.4680279 (466)\ttotal: 20.7s\tremaining: 20.6s\n",
      "600:\tlearn: 0.4906151\ttest: 0.4702286\tbest: 0.4707177 (592)\ttotal: 24.9s\tremaining: 16.5s\n",
      "700:\tlearn: 0.4963928\ttest: 0.4703509\tbest: 0.4714513 (651)\ttotal: 29.2s\tremaining: 12.5s\n",
      "800:\tlearn: 0.5022316\ttest: 0.4724294\tbest: 0.4729184 (795)\ttotal: 33.6s\tremaining: 8.34s\n",
      "900:\tlearn: 0.5072450\ttest: 0.4740188\tbest: 0.4746302 (882)\ttotal: 38s\tremaining: 4.17s\n",
      "999:\tlearn: 0.5120751\ttest: 0.4749969\tbest: 0.4758528 (946)\ttotal: 42.2s\tremaining: 0us\n",
      "bestTest = 0.4758527937\n",
      "bestIteration = 946\n",
      "Shrink model to first 947 iterations.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<catboost.core.CatBoostClassifier at 0x7f9544ec0910>"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fit_catboost_on_rotten_tomatoes(\n",
    "    X_preprocessed_train,\n",
    "    X_preprocessed_test, \n",
    "    y_train, \n",
    "    y_test,\n",
    "    catboost_params={\n",
    "        'dictionaries': [\n",
    "            'Word:min_token_occurrence=5',\n",
    "            'BiGram:gram_order=2'\n",
    "        ],\n",
    "        'text_processing': [\n",
    "            'NaiveBayes+Word|BoW+Word,BiGram'\n",
    "        ]\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature calculation in CatBoost\n",
    "\n",
    "Since the text is converted into a sequence of token indices, this information allows CatBoost to compute different numeric features:\n",
    "\n",
    "1. Bag of words: 0/1 features (text sample has or not token_id), number of produced numeric features = dictionary size.\n",
    "2. NaiveBayes: [Multinomial naive bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes) model, produce number of features equal to number of classes.<br> To avoid target leakage this model is computed online on a several dataset permutations like we do with [estimation of CTRs](https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html).\n",
    "3. [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) also computed online, it's a function that used for ranking purposes by search engines to estimate the relevance of documents.\n",
    "\n",
    "You can specify features to compute in parameter ``text_processing``."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parameter ``text_processing``\n",
    "\n",
    "Parameter **text_processing** specifies how text features are preprocessed.\n",
    "\n",
    "Text processing parameters specified as a list of strings, each string is a description of feature preprocessing in the following format:<br>\n",
    "``'FeatureId~[FeatureEstimator1+DictionaryName1[|FeatureEstimator2+DictionaryName2]]'``\n",
    "\n",
    "Example: ``'0~BoW+Word|NaiveBayes+Word,Bigram'``,\n",
    "this means that on 0-th text feature ``BoW`` and ``NaiveBayes`` features will be computed using ``Word`` and ``Word,Bigram`` dictionaries correspondingly.<br>\n",
    "Also it may be specified ``default~...`` text feature (or empty FeatureId), it means that all text features will be preprocessed with the same procedure, specified in parameters.\n",
    "\n",
    "Dictionaries names are taken from ``dictionaries`` parameter.\n",
    "\n",
    "Also you can specify parameters for estimators, e.g. for Bag of words you may specify parameter ``top_tokens_count``, this parameter sets maximum number of tokens that will be used for vectorization in bag of words, the most frequent $n$ tokens are taken. Parameter ``top_tokens_count`` **highly affect both on CPU ang GPU RAM usage** in BoW estimator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:\tlearn: 0.3985388\ttest: 0.4054285\tbest: 0.4054285 (0)\ttotal: 94.5ms\tremaining: 1m 34s\n",
      "100:\tlearn: 0.4494987\ttest: 0.4534784\tbest: 0.4534784 (100)\ttotal: 4.22s\tremaining: 37.5s\n",
      "200:\tlearn: 0.4620934\ttest: 0.4587358\tbest: 0.4593471 (196)\ttotal: 8.16s\tremaining: 32.5s\n",
      "300:\tlearn: 0.4711727\ttest: 0.4635041\tbest: 0.4639932 (265)\ttotal: 12.1s\tremaining: 28.1s\n",
      "400:\tlearn: 0.4797628\ttest: 0.4675388\tbest: 0.4681501 (399)\ttotal: 16.1s\tremaining: 24s\n",
      "500:\tlearn: 0.4874052\ttest: 0.4691283\tbest: 0.4696173 (496)\ttotal: 20.1s\tremaining: 20s\n",
      "600:\tlearn: 0.4942834\ttest: 0.4710845\tbest: 0.4719403 (596)\ttotal: 24s\tremaining: 15.9s\n",
      "700:\tlearn: 0.5011617\ttest: 0.4709622\tbest: 0.4719403 (596)\ttotal: 27.8s\tremaining: 11.9s\n",
      "800:\tlearn: 0.5069393\ttest: 0.4708400\tbest: 0.4719403 (596)\ttotal: 31.8s\tremaining: 7.89s\n",
      "900:\tlearn: 0.5133284\ttest: 0.4716958\tbest: 0.4723071 (836)\ttotal: 35.7s\tremaining: 3.92s\n",
      "999:\tlearn: 0.5190144\ttest: 0.4721849\tbest: 0.4732852 (950)\ttotal: 39.6s\tremaining: 0us\n",
      "bestTest = 0.4732852427\n",
      "bestIteration = 950\n",
      "Shrink model to first 951 iterations.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<catboost.core.CatBoostClassifier at 0x7f9544e818d0>"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fit_catboost_on_rotten_tomatoes(\n",
    "    X_preprocessed_train,\n",
    "    X_preprocessed_test, \n",
    "    y_train, \n",
    "    y_test,\n",
    "    catboost_params={\n",
    "        'dictionaries': [\n",
    "            'Word:min_token_occurrence=5',\n",
    "            'BiGram:gram_order=2'\n",
    "        ],\n",
    "        'text_processing': [\n",
    "            'NaiveBayes+Word|BoW:top_tokens_count=1000+Word,BiGram|BM25+Word'\n",
    "        ]\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary: Text features in CatBoost\n",
    "\n",
    "### The algorithm:\n",
    "1. Input text is loaded as a usual column. ``text_column: [string]``.\n",
    "2. Each text sample is tokenized via splitting by space. ``tokenized_column: [[string]]``.\n",
    "3. Dictionary estimation.\n",
    "4. Each string in tokenized column is converted into token_id from dictionary. ``text: [[token_id]]``.\n",
    "5. Depending on the parameters CatBoost produce features basing on the resulting text column: Bag of words, Multinomial naive bayes or Bm25.\n",
    "6. Computed float features are passed into the usual CatBoost learning algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summarizing table:\n",
    "\n",
    "| method description                                                | Accuracy |\n",
    "|-------------------------------------------------------------------|----------|\n",
    "| Without text features                                             | 0.4562   |\n",
    "| With unpreprocessed text features                                 | 0.4707   |\n",
    "| After punctuation handling and lemmatization (only review column) | 0.4719   |\n",
    "| After adding bigrams                                              | 0.4759   |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simplified comparison with classic methods: Naive Bayes and Logistic regression\n",
    "\n",
    "Let's take only one text column to compare quality of text classification.<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_one_column = pd.DataFrame(X_preprocessed_train['review'])\n",
    "X_test_one_column = pd.DataFrame(X_preprocessed_test['review'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_catboost_one_column(X_train, X_test, y_train, y_test, catboost_params={}, verbose=0):\n",
    "    learn_pool = Pool(X_train, y_train, text_features=[0])\n",
    "    test_pool = Pool(X_test, y_test, text_features=[0])\n",
    "    \n",
    "    catboost_default_params = {\n",
    "        'iterations': 1000,\n",
    "        'learning_rate': 0.03,\n",
    "        'eval_metric': 'Accuracy',\n",
    "        'task_type': 'GPU'\n",
    "    }\n",
    "    \n",
    "    catboost_default_params.update(catboost_params)\n",
    "    \n",
    "    model = CatBoostClassifier(**catboost_default_params)\n",
    "    model.fit(learn_pool, eval_set=test_pool, verbose=verbose)\n",
    "\n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "def vectorize(X, params):\n",
    "    vectorizer = CountVectorizer(**params)\n",
    "    vectorizer.fit(X)\n",
    "    return vectorizer.transform(X), vectorizer\n",
    "\n",
    "def eval_accuracy(model, X, c):\n",
    "    c_pred = model.predict(X)\n",
    "    return accuracy_score(c_pred, c)\n",
    "\n",
    "def fit_and_compute_accuracy(X_train, X_test, y_train, y_test, vectorizer_params={}, catboost_params={}):\n",
    "    X_train_bow, vectorizer = vectorize(X_train.iloc[:,0], vectorizer_params)\n",
    "    X_test_bow = vectorizer.transform(X_test.iloc[:,0])\n",
    "    \n",
    "    print('fitting linear model')\n",
    "    linear_model = fit_linear_model(X_train_bow, y_train)\n",
    "    \n",
    "    print('fitting naive bayes model')\n",
    "    naive_bayes = fit_naive_bayes(X_train_bow, y_train)\n",
    "    \n",
    "    print('fitting catboost model')\n",
    "    cb_model = fit_catboost_one_column(X_train, X_test, y_train, y_test, catboost_params)\n",
    "\n",
    "    linear_accuracy = eval_accuracy(linear_model, X_test_bow, y_test)\n",
    "    naive_bayes_accuracy = eval_accuracy(naive_bayes, X_test_bow, y_test)\n",
    "    cb_accuracy = eval_accuracy(cb_model, X_test, y_test)\n",
    "\n",
    "    results = pd.DataFrame(\n",
    "        data=[linear_accuracy, naive_bayes_accuracy, cb_accuracy], \n",
    "        index=['Linear model', 'Naive bayes', 'CatBoost'],\n",
    "        columns=['Accuracy']\n",
    "    )\n",
    "    print(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Experiment without bigrams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fitting linear model\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/d-kruchinin/.local/lib/python2.7/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.\n",
      "  \"this warning.\", FutureWarning)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fitting naive bayes model\n",
      "fitting catboost model\n",
      "              Accuracy\n",
      "Linear model  0.292945\n",
      "Naive bayes   0.301871\n",
      "CatBoost      0.325223\n"
     ]
    }
   ],
   "source": [
    "fit_and_compute_accuracy(\n",
    "    X_train_one_column, \n",
    "    X_test_one_column, \n",
    "    y_train, \n",
    "    y_test,\n",
    "    catboost_params = {\n",
    "        'dictionaries': ['Word:token_level_type=Word,min_token_occurrence=5'],\n",
    "        'text_processing': ['NaiveBayes+Word|BoW+Word']\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Experiment with bigrams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fitting linear model\n",
      "fitting naive bayes model\n",
      "fitting catboost model\n",
      "              Accuracy\n",
      "Linear model  0.302604\n",
      "Naive bayes   0.295146\n",
      "CatBoost      0.329747\n"
     ]
    }
   ],
   "source": [
    "fit_and_compute_accuracy(\n",
    "    X_train_one_column,\n",
    "    X_test_one_column, \n",
    "    y_train, \n",
    "    y_test, \n",
    "    vectorizer_params = {'ngram_range': (1, 2)},\n",
    "    catboost_params = {\n",
    "        'dictionaries': [\n",
    "            'Word:token_level_type=Word,min_token_occurrence=5', \n",
    "            'BiGram:gram_order=2,min_token_occurrence=4'\n",
    "        ],\n",
    "        'text_processing': ['NaiveBayes+Word,BiGram|BoW+Word,BiGram']\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| method description      | Linear model | Naive bayes | CatBoost |\n",
    "|-------------------------|--------------|-------------|----------|\n",
    "| Without bigrams         | 0.2929       | 0.3019      | 0.3252   |\n",
    "| With bigrams            | 0.3026       | 0.2951      | 0.3294   |\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}