{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Instruction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain the datasets KDD Appetency, Churn and Upselling used for algorithms comparison:\n", "\n", "1) Download `orange_small_train.data.zip` file from http://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data and extract the file `orange_small_train.data`. This file contains the features shared between all the three datasets.\n", "\n", "2) Download files with labels: \n", "* `orange_small_train_appetency.labels` from http://www.kdd.org/cupfiles/KDDCupData/2009/orange_small_train_appetency.labels \n", "* `orange_small_train_churn.labels` from http://www.kdd.org/cupfiles/KDDCupData/2009/orange_small_train_churn.labels\n", "* `orange_small_train_upselling.labels` from http://www.kdd.org/cupfiles/KDDCupData/2009/orange_small_train_upselling.labels\n", "\n", "3) Put the files to the same directory as this notebook.\n", "\n", "4) Run all the cells of this notebook successively to produce files for training and testing - they will appear in corresponding folders." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "resulting_train_filename = \"train\"\n", "resulting_test_filename = \"test\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = pd.read_csv(\"./orange_small_train.data\", sep = \"\\t\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Var1 | \n", "Var2 | \n", "Var3 | \n", "Var4 | \n", "Var5 | \n", "Var6 | \n", "Var7 | \n", "Var8 | \n", "Var9 | \n", "Var10 | \n", "... | \n", "Var221 | \n", "Var222 | \n", "Var223 | \n", "Var224 | \n", "Var225 | \n", "Var226 | \n", "Var227 | \n", "Var228 | \n", "Var229 | \n", "Var230 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1526 | \n", "7 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "oslk | \n", "fXVEsaq | \n", "jySVZNlOJy | \n", "NaN | \n", "NaN | \n", "xb3V | \n", "RAYp | \n", "F2FyR07IdsN7I | \n", "NaN | \n", "NaN | \n", "
1 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "525 | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "oslk | \n", "2Kb5FSF | \n", "LM8l689qOp | \n", "NaN | \n", "NaN | \n", "fKCe | \n", "RAYp | \n", "F2FyR07IdsN7I | \n", "NaN | \n", "NaN | \n", "
2 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "5236 | \n", "7 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "Al6ZaUT | \n", "NKv4yOc | \n", "jySVZNlOJy | \n", "NaN | \n", "kG3k | \n", "Qu4f | \n", "02N6s8f | \n", "ib5G6X1eUxUn6 | \n", "am7c | \n", "NaN | \n", "
3 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "oslk | \n", "CE7uk3u | \n", "LM8l689qOp | \n", "NaN | \n", "NaN | \n", "FSa2 | \n", "RAYp | \n", "F2FyR07IdsN7I | \n", "NaN | \n", "NaN | \n", "
4 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1029 | \n", "7 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "oslk | \n", "1J2cvxe | \n", "LM8l689qOp | \n", "NaN | \n", "kG3k | \n", "FSa2 | \n", "RAYp | \n", "F2FyR07IdsN7I | \n", "mj86 | \n", "NaN | \n", "
5 rows × 230 columns
\n", "