Data Sets

Classification

Bank Marketing Data related to marketing campaign by a Portugese banking institution.
The Dataset is taken from UCI repository [https://archive.ics.uci.edu/ml/datasets/Bank+Marketing].

The dataset has 17 attibutes and 45211 instances from the marketing campaign. The attributes are as follows:
General

  • age (numeric)
  • job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
  • marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
  • education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
  • default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  • housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
  • loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

Related with the last contact of the current campaign:

  • contact: contact communication type (categorical: ‘cellular’,‘telephone’)
  • month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
  • day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
  • duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

  • campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  • previous: number of contacts performed before this campaign and for this client (numeric)
  • poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
  • y: customer is subscribed to the term deposit or not (“yes”,“no”)

To Classify: Whether the customer is subscribed for term deposit or not.

Regression

Data related to the volume of traffic in Minneapolis-St Paul, MN. It is hourly timeseries data which includes weather and holiday features from 2012-2018.
The Dataset is taken from UCI repository [https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume].

The dataset has 9 attibutes and 48204 instances from the marketing campaign. The attributes are as follows:

  • holiday Categorical US National holidays plus regional holiday, Minnesota State Fair
  • temp Numeric Average temp in kelvin
  • rain_1h Numeric Amount in mm of rain that occurred in the hour
  • snow_1h Numeric Amount in mm of snow that occurred in the hour
  • clouds_all Numeric Percentage of cloud cover
  • weather_main Categorical Short textual description of the current weather
  • weather_description Categorical Longer textual description of the current weather
  • date_time DateTime Hour of the data collected in local CST time
  • traffic_volume Numeric Hourly I-94 ATR 301 reported westbound traffic volume

Classification

Data Cleaning

Loading the data

age education job marital balance y
58 tertiary management married 2143 no
44 secondary technician single 29 no
33 secondary entrepreneur married 2 no
47 unknown blue-collar married 1506 no
33 unknown unknown single 1 no
35 tertiary management married 231 no

Distribution of output of dataset

Creating a Function plotting the correlation for a column with the given y-value (yes or no)

Analysing the continuous variables in the dataset and converting it to a categorical variable based on the relation between the variable and the output.

  1. Age

Both the term deposit subscribed and non subscribed customers is normally distributed with peak from age 25-50. Hence categorizing age into 0 [<25], 1[25-60],2[>60] and making it to a factor type.

  1. Balance

Customers of both classes are concentrated more in between the balance -2000 and 25000. Categorizing into five categories <-2000, (-2000,0), 0, (0,25000), >25000

  1. Day

As the values are distributed throughout, standardizing the day column to unit standard deviation

  1. Duration As there is no common pattern in duration column, no changes has been made.

  2. Campaign As the graph is so flat, this column does not have effect in the output, hence removing the campaign column.

  3. Pdays

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -1.0    -1.0    -1.0    40.2    -1.0   871.0 

Same as campaign column, removing pdays column also.

  1. Previous As the graph is zoomed to get the pattern, the column is divided into 4 categories of 0[<5],1[5,10],2[10,20],3[>20].

As the graph is so flat, this column does not have effect in the output, hence removing the campaign column.

Leaving the categorical variable as is and removing the old columns of newly mutated columns.

The clean_bank_data variable is now ready for model creation.

Decision Tree

Decision tree is a flowchart based supervised machine learning algorithm that shows the various outcomes from a series of decisions. It can be used as for both regression and classification problems. A primary advantage for using a decision tree is that it is easy to follow and understand. Decision tree is constructed with leaf node and decision node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.

From the clean_bank_data, training and testing dataset is prepared for the Decision tree model.

Decision tree model is created with rpart function with method "Class" which is used for classification.

[1] 0.9023338

Classification tree:
rpart(formula = y ~ ., data = train.data, method = "class")

Variables actually used in tree construction:
[1] duration poutcome

Root node error: 4232/36170 = 0.117

n= 36170 

        CP nsplit rel error  xerror     xstd
1 0.036547      0   1.00000 1.00000 0.014445
2 0.025284      3   0.89036 0.91848 0.013918
3 0.014887      4   0.86508 0.87405 0.013617
4 0.010000      5   0.85019 0.86626 0.013563

Pruning the Decision tree model with the least cp value from the model.

[1] 0.9023338

KNN Algorithm

K-Nearest Neighbors is a supervised machine learning algorithm used in regression and classification problems. KNN algorithm classifies the data based on the similarity measures and identifies the class for new data. Similarity measures is identified by distance function [e.g. Euclidean Distance]. The data is assigned to the class which has the k nearest neighbors where k depends on the dataset.

From clean_bank_data,
age education job marital balance y
58 tertiary management married 2143 no
44 secondary technician single 29 no
33 secondary entrepreneur married 2 no
47 unknown blue-collar married 1506 no
33 unknown unknown single 1 no
35 tertiary management married 231 no

Converting the categorical variables to one hot encoding and updating the column names.

One hotted dataset for the KNN algorithm.
age_cat_0 education_primary job_admin. marital_divorced balance y
0 0 0 0 2143 0
0 0 0 0 29 0
0 0 0 0 2 0
0 0 0 0 1506 0
0 0 0 0 1 0
0 0 0 0 231 0

Defining Normalize function and Applying to the Dataset.

Splitting Training and Testing Dataset. Giving 80% of data to training and 20% data to testing.

Best and simple way to start with defining k is setting the value to the square root of total number of rows of the dataset. Defining K with square root of total number rows.

[1] "Total Rows - 36169"
[1] "Square root of Total Rows - 190.181492264626"

Defining the Function to calculate accuracy of the model and Applying knn algorithm for the dataset with k = 190. Two models are being created with testing dataset as training data and testing data to know the accuracy of model in trained dataset and testing dataset.

It takes around 2 mins 30 secs for the model to get created with training data as testing dataset. And It takes around 35 secs for the model to get created with test data from partition as testing dataset. Added only testing data creation as the cross validation following much longer time in building and knitting.

[1] "Accuracy of testing dataset is - 88.8077858880779"
Simple confusion matrix of the resultant model:
0 1
0 7974 990
1 22 56
0 1 Total n
0 89.0 11.0 100 8964
1 28.2 71.8 100 78
All 88.4 11.6 100 9042
0 1 All
0 99.7 94.6 99.1
1 0.3 5.4 0.9
Total 100.0 100.0 100.0
n 7996.0 1046.0 9042.0
0 1
0 0.53 -1.46
1 -5.66 15.64

X-squared = 273.0641, df = 1, p = < 2.2e-16

To test the accuracy, Cross validation is been done on the dataset to know that if the accuracy can be improved by changing K in the KNN. Creating a Function to define folds and cross validation for the given dataset.

fold val.acc neighbors
1 89.21758 220
2 88.63699 220
3 88.11169 220
4 88.94111 220
5 88.66464 220
6 87.88717 220
7 88.96876 220
8 88.16699 220
9 88.13934 220
10 89.13464 220

For all the k's in from 190 to 230, accuracy value does not change much. All the accuracy value are around 88.7%.

Regression

Data Cleaning

Loading the data

holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
None 288.28 0 0 40 Clouds scattered clouds 2012-10-02 09:00:00 5545
None 289.36 0 0 75 Clouds broken clouds 2012-10-02 10:00:00 4516
None 289.58 0 0 90 Clouds overcast clouds 2012-10-02 11:00:00 4767
None 290.13 0 0 90 Clouds overcast clouds 2012-10-02 12:00:00 5026
None 291.14 0 0 75 Clouds broken clouds 2012-10-02 13:00:00 4918
None 291.72 0 0 1 Clear sky is clear 2012-10-02 14:00:00 5181

Analysing continuous variables with the output

1.Temperature

Temperature is in kelvin scale. 0 K value makes no sense. Hence converting 0K to minimum temperature recorded above 0K

 [1]   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00
[11] 243.39 243.62 244.22 244.82 244.82

  1. Rain
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
24873 None 302.11 9831.3 0 75 Rain very heavy rain 2016-07-11 17:00:00 5535
[1] 9831.30   55.63   44.45   31.75   28.70

Rainfall is in mm scale. 9800+ mm of rainfall is not possible for a given time frame of one hour. This can be a mistake. Hence removing that one column.

  1. Snow

Data points are distributed throughout. Hence no changes needed.

  1. Clouds

Data points are distributed throughout. Hence no changes needed.

Analysing categorical variables

  1. Holiday
mean(traffic_volume)
865.4426

values in holidays falls between 500 and 1300, Hence converting the holiday - None as 0 and holiday days as 1.

4.Weather_main and Weather Description

Converting weather main and weather description to one_hot encoding as the column with more than 2 numerical values will be taken as continuous variable.

As the graph is so flat, this column does not have effect in the output, hence removing the campaign column.

  1. Date and Time column
    Converting time column to numeric of 0-23 for each hour

Removing the old columns

The traffic variable is now ready for model creation.

Linear Regression

Creating training and testing datasets from the traffic variable.

Creating Linear model for the training set.

Summary of the linear model created.


Call:
lm(formula = traffic_volume ~ ., data = training_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-5331.7  -302.6    75.4   510.4  2592.1 

Coefficients: (12 not defined because of singularities)
                                                            Estimate Std. Error
(Intercept)                                               -1834.0967   300.2753
temp                                                          7.8732     0.4328
rain_1h                                                     -44.1774     5.8587
snow_1h                                                     356.8622   555.6198
clouds_all                                                   -1.3751     0.2431
hol1                                                         51.5585   137.8993
weather_main.Clouds                                         636.1517   272.9984
weather_main.Clear                                          622.0619   273.2993
weather_main.Rain                                           688.2536   406.4399
weather_main.Drizzle                                        545.8281   501.2342
weather_main.Mist                                           617.6563   272.1679
weather_main.Haze                                           590.3310   273.3686
weather_main.Fog                                            454.4332   274.0912
weather_main.Thunderstorm                                 -1503.4340   719.5210
weather_main.Snow                                           106.6959   607.6785
                                                          t value Pr(>|t|)    
(Intercept)                                                -6.108 1.02e-09 ***
temp                                                       18.192  < 2e-16 ***
rain_1h                                                    -7.540 4.79e-14 ***
snow_1h                                                     0.642 0.520697    
clouds_all                                                 -5.656 1.56e-08 ***
hol1                                                        0.374 0.708492    
weather_main.Clouds                                         2.330 0.019799 *  
weather_main.Clear                                          2.276 0.022844 *  
weather_main.Rain                                           1.693 0.090393 .  
weather_main.Drizzle                                        1.089 0.276175    
weather_main.Mist                                           2.269 0.023250 *  
weather_main.Haze                                           2.159 0.030820 *  
weather_main.Fog                                            1.658 0.097333 .  
weather_main.Thunderstorm                                  -2.089 0.036670 *  
weather_main.Snow                                           0.176 0.860625    
 [ reached getOption("max.print") -- omitted 65 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 941.1 on 38497 degrees of freedom
Multiple R-squared:  0.7761,    Adjusted R-squared:  0.7757 
F-statistic:  1991 on 67 and 38497 DF,  p-value: < 2.2e-16

Predicting testing dataset with the model created from the training dataset.Accuracy of the predicted model is found by the correlation factor between actual and predicted

Discussion

Learnings and Challenges

  • Before using caret library for data partitioning and model fit functions, i tried to use RWeka library. But the installation process was arduous. Faced many issues in Java package and finally downgraded the java to solve the issue. Later switched to caret library functions for all dataset related functions.
  • Inf keyword is r refers to Infinity. It took significant time to realize that in the linear regression, to calculate Mean absolute percentage error (MAPE).The output responded with Inf. Later realized that some of the values in actual and predicted values has 0 [zero] in it. Then calculated MAPE for non-zero values.
  • Accuracy calculation methods for classification and regression differs a lot. At first, i used same accuracy_calc method created for classification for linear model. The accuracy percentage resulted was around 0.07%. I had an idea that the process of finding classification and regression. But after getting to see the results helped me understand that the process of accuracy finding in classification makes no sense in regression.
  • I was facing major challenge when performing Cross validation for knn algorithm with 30 different k values in loop. It took more than 30 minutes to complete, consuming almost 70% of the machines RAM. Then changed to 10 different k values.