{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Instruction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain the dataset Internet used for algorithms comparison:\n", "\n", "1) Download `kdd_internet_usage.arff` file from http://www.cs.odu.edu/~mukka/cs795sum10dm/datasets/uci-20070111/nominal/kdd_internet_usage.arff.\n", "\n", "2) Put it to the same directory as this notebook.\n", "\n", "3) Run all the cells of this notebook successively to produce files for training and testing." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "resulting_train_filename = \"train\"\n", "resulting_test_filename = \"test\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "import scipy.io.arff" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "with open(\"kdd_internet_usage.arff\", \"rb\") as fin:\n", " data, meta = scipy.io.arff.loadarff(fin)\n", " data = pd.DataFrame(data)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Actual_Time | \n", "Age | \n", "Community_Building | \n", "Community_Membership_Family | \n", "Community_Membership_Hobbies | \n", "Community_Membership_None | \n", "Community_Membership_Other | \n", "Community_Membership_Political | \n", "Community_Membership_Professional | \n", "Community_Membership_Religious | \n", "... | \n", "Web_Page_Creation | \n", "Who_Pays_for_Access_Dont_Know | \n", "Who_Pays_for_Access_Other | \n", "Who_Pays_for_Access_Parents | \n", "Who_Pays_for_Access_School | \n", "Who_Pays_for_Access_Self | \n", "Who_Pays_for_Access_Work | \n", "Willingness_to_Pay_Fees | \n", "Years_on_Internet | \n", "who | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "Consultant | \n", "41 | \n", "Equally | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "Yes | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "Other_sources | \n", "1-3_yr | \n", "93819 | \n", "
1 | \n", "College_Student | \n", "28 | \n", "Equally | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "No | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "Already_paying | \n", "Under_6_mo | \n", "95708 | \n", "
2 | \n", "Other | \n", "25 | \n", "More | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "Yes | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "Other_sources | \n", "1-3_yr | \n", "97218 | \n", "
3 | \n", "Salesperson | \n", "28 | \n", "More | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "Yes | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "Already_paying | \n", "1-3_yr | \n", "91627 | \n", "
4 | \n", "K-12_Student | \n", "17 | \n", "More | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "... | \n", "Yes | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "Already_paying | \n", "1-3_yr | \n", "49906 | \n", "
5 rows × 72 columns
\n", "