{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introducing text information signals into CatBoost"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[](https://colab.research.google.com/github/catboost/tutorials/blob/master/text_features/text_features_in_catboost.ipynb)\n",
"\n",
"## Task description\n",
"\n",
"There is a large number of tasks associated with the analysis of textual information. An example is such well-known problems as: spam or not spam email classification, analyzing the tonality of the text, dialog systems, etc. \n",
"\n",
"Sometimes the original task can be reduced to the simple ML **classification task** on a given set of observations $D:$
$D = \\{(x_i, c_i)\\}$, $x_i = (x_{i,0}, x_{i,1}, ..., x_{i,m})$ is features vector and $c_i$ is a class of $i$-th object.
\n",
"In such task the features may contain not only $x_{i,j}$ as numeric or categorical values, but also the source text (e.g. tweet or question).\n",
"\n",
"Here and further by **Text** we imply a sequence of symbols $(a_0, a_1, ..., a_k), a_i \\in A$,
$A$ called as alphabet, e.g. $A$ can be a set of English letters, Unicode symbols or $A$ can have a more complicated structure including a sequences of symbols, also called tokens, e.g. $A$ can be a set of English words or Emoji."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Rotten tomatoes dataset\n",
"\n",
"Let's download real dataset for examples with CatBoost.\n",
"This one is based on rotten tomatoes movie reviews data source and was taken from kaggle: https://www.kaggle.com/rpnuser8182/rotten-tomatoes
\n",
"\n",
"Dataset contains features about movie: description, MPAA rating, producer, company, etc, and features about user review: user name, review, rating,... \n",
"Here we set the task to predict the rating of the film according to the user's review"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from catboost import Pool, CatBoostClassifier\n",
"from catboost.datasets import rotten_tomatoes"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Feature names:\n",
"id, synopsis, rating_MPAA, genre, director, writer, theater_date, dvd_date, box_office, runtime, studio, dvd_date_int, theater_date_int, review, rating, fresh, critic, top_critic, publisher, date, date_int, rating_10\n"
]
}
],
"source": [
"learn, _ = rotten_tomatoes()\n",
"print('Feature names:\\n' + ', '.join(list(learn)))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | id | \n", "synopsis | \n", "rating_MPAA | \n", "genre | \n", "director | \n", "writer | \n", "theater_date | \n", "dvd_date | \n", "box_office | \n", "runtime | \n", "... | \n", "theater_date_int | \n", "review | \n", "rating | \n", "fresh | \n", "critic | \n", "top_critic | \n", "publisher | \n", "date | \n", "date_int | \n", "rating_10 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "830.0 | \n", "A gay New Yorker stages a marriage of convenie... | \n", "R | \n", "Art House and International | Comedy | Drama |... | \n", "Ang Lee | \n", "Ang Lee | James Schamus | Neil Peng | \n", "1993-08-04 | \n", "2004-06-15 | \n", "NaN | \n", "111.0 | \n", "... | \n", "19930804 | \n", "NaN | \n", "0.800000 | \n", "fresh | \n", "Carol Cling | \n", "0 | \n", "Las Vegas Review-Journal | \n", "2004-04-16 | \n", "20040416.0 | \n", "8.0 | \n", "
1 | \n", "1161.0 | \n", "Screenwriter Nimrod Antal makes an impressive ... | \n", "R | \n", "Action and Adventure | Art House and Internati... | \n", "NaN | \n", "NaN | \n", "2005-04-01 | \n", "2005-08-30 | \n", "116783.0 | \n", "105.0 | \n", "... | \n", "20050401 | \n", "One very long, dark ride. | \n", "0.647059 | \n", "rotten | \n", "NaN | \n", "0 | \n", "E! Online | \n", "2005-04-22 | \n", "20050422.0 | \n", "6.0 | \n", "
2 | \n", "596.0 | \n", "\"Arctic Tale\" is an epic adventure that explor... | \n", "G | \n", "Documentary | Special Interest | \n", "Adam Ravetch | Sarah Robertson | \n", "Linda Woolverton | Mose Richards | Kristin Gore | \n", "2007-08-17 | \n", "2017-08-01 | \n", "598103.0 | \n", "86.0 | \n", "... | \n", "20070817 | \n", "I'm no holdout about the reality of global war... | \n", "0.625000 | \n", "rotten | \n", "Jack Mathews | \n", "1 | \n", "New York Daily News | \n", "2007-07-27 | \n", "20070727.0 | \n", "6.0 | \n", "
3 | \n", "1585.0 | \n", "A dating doctor claims that with his services ... | \n", "PG-13 | \n", "Comedy | Romance | \n", "Andrew Tennant | Andy Tennant | \n", "Kevin Bisch | \n", "2005-02-11 | \n", "2005-06-14 | \n", "177575142.0 | \n", "120.0 | \n", "... | \n", "20050211 | \n", "... Adds up to far more than the formula typic... | \n", "0.875000 | \n", "fresh | \n", "Greg Maki | \n", "0 | \n", "Star-Democrat (Easton, MD) | \n", "2005-02-11 | \n", "20050211.0 | \n", "9.0 | \n", "
4 | \n", "603.0 | \n", "R&B; star Janet Jackson made an impressive... | \n", "R | \n", "Drama | \n", "John Singleton | \n", "John Singleton | \n", "1993-07-23 | \n", "2001-05-08 | \n", "NaN | \n", "104.0 | \n", "... | \n", "19930723 | \n", "NaN | \n", "0.600000 | \n", "fresh | \n", "Clint Morris | \n", "0 | \n", "Film Threat | \n", "2005-05-06 | \n", "20050506.0 | \n", "6.0 | \n", "
5 rows × 22 columns
\n", "\n", " | synopsis | \n", "rating_MPAA | \n", "genre | \n", "director | \n", "writer | \n", "box_office | \n", "runtime | \n", "studio | \n", "dvd_date_int | \n", "theater_date_int | \n", "review | \n", "fresh | \n", "critic | \n", "top_critic | \n", "publisher | \n", "date_int | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "A gay New Yorker stages a marriage of convenie... | \n", "R | \n", "Art House and International | Comedy | Drama |... | \n", "Ang Lee | \n", "Ang Lee | James Schamus | Neil Peng | \n", "NaN | \n", "111.0 | \n", "\n", " | 20040615 | \n", "19930804 | \n", "\n", " | fresh | \n", "Carol Cling | \n", "0 | \n", "Las Vegas Review-Journal | \n", "20040416.0 | \n", "
1 | \n", "Screenwriter Nimrod Antal makes an impressive ... | \n", "R | \n", "Action and Adventure | Art House and Internati... | \n", "\n", " | \n", " | 116783.0 | \n", "105.0 | \n", "ThinkFilm Inc. | \n", "20050830 | \n", "20050401 | \n", "One very long, dark ride. | \n", "rotten | \n", "\n", " | 0 | \n", "E! Online | \n", "20050422.0 | \n", "
2 | \n", "\"Arctic Tale\" is an epic adventure that explor... | \n", "G | \n", "Documentary | Special Interest | \n", "Adam Ravetch | Sarah Robertson | \n", "Linda Woolverton | Mose Richards | Kristin Gore | \n", "598103.0 | \n", "86.0 | \n", "Paramount Vantage | \n", "20170801 | \n", "20070817 | \n", "I'm no holdout about the reality of global war... | \n", "rotten | \n", "Jack Mathews | \n", "1 | \n", "New York Daily News | \n", "20070727.0 | \n", "
3 | \n", "A dating doctor claims that with his services ... | \n", "PG-13 | \n", "Comedy | Romance | \n", "Andrew Tennant | Andy Tennant | \n", "Kevin Bisch | \n", "177575142.0 | \n", "120.0 | \n", "Sony Pictures | \n", "20050614 | \n", "20050211 | \n", "... Adds up to far more than the formula typic... | \n", "fresh | \n", "Greg Maki | \n", "0 | \n", "Star-Democrat (Easton, MD) | \n", "20050211.0 | \n", "
4 | \n", "R&B; star Janet Jackson made an impressive... | \n", "R | \n", "Drama | \n", "John Singleton | \n", "John Singleton | \n", "NaN | \n", "104.0 | \n", "\n", " | 20010508 | \n", "19930723 | \n", "\n", " | fresh | \n", "Clint Morris | \n", "0 | \n", "Film Threat | \n", "20050506.0 | \n", "