Introduction

Name : Siva Subramanian Lakshmi Narayanan
[http://bit.ly/dtanlyst-siva]

library(ggplot2)
library(tidyr)
library(dplyr)
library(stringr)
library(caret)
library(rpart)
library(rpart.plot)
library(data.table)
library(mltools)
library(class)
library(gmodels)
library(lubridate)

Data Sets

Classification

Bank Marketing Data related to marketing campaign by a Portugese banking institution.
The Dataset is taken from UCI repository [https://archive.ics.uci.edu/ml/datasets/Bank+Marketing].

The dataset has 17 attibutes and 45211 instances from the marketing campaign. The attributes are as follows:
General

age (numeric)
job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

Related with the last contact of the current campaign:

contact: contact communication type (categorical: ‘cellular’,‘telephone’)
month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
y: customer is subscribed to the term deposit or not (“yes”,“no”)

To Classify: Whether the customer is subscribed for term deposit or not.

Regression

Data related to the volume of traffic in Minneapolis-St Paul, MN. It is hourly timeseries data which includes weather and holiday features from 2012-2018.
The Dataset is taken from UCI repository [https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume].

The dataset has 9 attibutes and 48204 instances from the marketing campaign. The attributes are as follows:

holiday Categorical US National holidays plus regional holiday, Minnesota State Fair
temp Numeric Average temp in kelvin
rain_1h Numeric Amount in mm of rain that occurred in the hour
snow_1h Numeric Amount in mm of snow that occurred in the hour
clouds_all Numeric Percentage of cloud cover
weather_main Categorical Short textual description of the current weather
weather_description Categorical Longer textual description of the current weather
date_time DateTime Hour of the data collected in local CST time
traffic_volume Numeric Hourly I-94 ATR 301 reported westbound traffic volume

Classification

Data Cleaning

Loading the data

setwd("~/r_projects/Advanced_Analytics/Classification_new")
bankData <- read.csv(file = 'data/bank-full.csv',header = TRUE, sep = ";")
bankData <- na.omit(bankData)

age	education	job	marital	balance	y
58	tertiary	management	married	2143	no
44	secondary	technician	single	29	no
33	secondary	entrepreneur	married	2	no
47	unknown	blue-collar	married	1506	no
33	unknown	unknown	single	1	no
35	tertiary	management	married	231	no

Distribution of output of dataset

Creating a Function plotting the correlation for a column with the given y-value (yes or no)

relation_bw_col_and_output <- function (column_name) {
  column_name_enq <- enquo(column_name)
  y_val_data <- bankData %>%
    group_by(!!column_name_enq) %>%
    summarise(yes_value = sum(y=="yes"), no_value = sum(y=="no"))
  return(y_val_data)
}

Analysing the continuous variables in the dataset and converting it to a categorical variable based on the relation between the variable and the output.

Both the term deposit subscribed and non subscribed customers is normally distributed with peak from age 25-50. Hence categorizing age into 0 [<25], 1[25-60],2[>60] and making it to a factor type.

clean_bank_data <- bankData %>% mutate(age_cat = case_when(age < 25 ~ 0,
                                                           age >= 25 & age <= 60 ~ 1,
                                                           age > 60 ~ 2))
clean_bank_data$age_cat = factor(clean_bank_data$age_cat)

Balance

Customers of both classes are concentrated more in between the balance -2000 and 25000. Categorizing into five categories <-2000, (-2000,0), 0, (0,25000), >25000

clean_bank_data <- clean_bank_data %>% mutate(bal_cat = case_when(balance < -2000 ~ 0,
                                                                  balance >= -2000 & balance < 0 ~ 1,
                                                                  balance == 0 ~ 2,
                                                                  balance > 0 & balance < 25000 ~ 3,
                                                                  balance > 25000 ~ 4))
clean_bank_data$bal_cat = factor(clean_bank_data$bal_cat)

As the values are distributed throughout, standardizing the day column to unit standard deviation

clean_bank_data$day <- (clean_bank_data$day - mean(clean_bank_data$day)) / sd(clean_bank_data$day)

Duration As there is no common pattern in duration column, no changes has been made.
Campaign As the graph is so flat, this column does not have effect in the output, hence removing the campaign column.
Pdays

summary(clean_bank_data$pdays)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -1.0    -1.0    -1.0    40.2    -1.0   871.0

y_val_data_pdays <- relation_bw_col_and_output(pdays)
ggplot(y_val_data_pdays, aes_string(x = "pdays")) +
  geom_point(aes_string(y="yes_value"), color = "steelblue")+
  geom_point(aes_string(y = "no_value"), color="darkred") + xlim(0,65)

clean_bank_data$pdays <- NULL

Same as campaign column, removing pdays column also.

Previous As the graph is zoomed to get the pattern, the column is divided into 4 categories of 0[<5],1[5,10],2[10,20],3[>20].

clean_bank_data <- clean_bank_data %>% mutate(pre_cat = case_when(previous < 5 ~ 0,
                                                                  previous >= 5 & previous < 10 ~ 1,
                                                                  previous >= 10 & previous < 20 ~ 2,
                                                                  previous >= 20 ~ 3))

As the graph is so flat, this column does not have effect in the output, hence removing the campaign column.

Leaving the categorical variable as is and removing the old columns of newly mutated columns.

clean_bank_data <- clean_bank_data[, !(colnames(clean_bank_data) %in% c("age","bal", "previous"))]
clean_bank_data <- clean_bank_data%>%select(-y,y)
clean_bank_data <- na.omit(clean_bank_data)

The clean_bank_data variable is now ready for model creation.

Decision Tree

Decision tree is a flowchart based supervised machine learning algorithm that shows the various outcomes from a series of decisions. It can be used as for both regression and classification problems. A primary advantage for using a decision tree is that it is easy to follow and understand. Decision tree is constructed with leaf node and decision node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.

From the clean_bank_data, training and testing dataset is prepared for the Decision tree model.

set.seed(12)
training.samples <- clean_bank_data$y %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data <- clean_bank_data[training.samples, ]
test.data <- clean_bank_data[-training.samples, ]

Decision tree model is created with rpart function with method "Class" which is used for classification.

set.seed(12)
model1 <- rpart(y ~., data = train.data, method = "class")
par(xpd = NA) # Avoid clipping the text in some device
rpart.plot(model1,digits = 4, fallen.leaves = TRUE,
           type = 3, extra = 101)

predicted.classes <- model1 %>%
  predict(test.data, type = "class")
mean(predicted.classes == test.data$y)

[1] 0.9023338

printcp(model1)


Classification tree:
rpart(formula = y ~ ., data = train.data, method = "class")

Variables actually used in tree construction:
[1] duration poutcome

Root node error: 4232/36170 = 0.117

n= 36170 

        CP nsplit rel error  xerror     xstd
1 0.036547      0   1.00000 1.00000 0.014445
2 0.025284      3   0.89036 0.91848 0.013918
3 0.014887      4   0.86508 0.87405 0.013617
4 0.010000      5   0.85019 0.86626 0.013563

Pruning the Decision tree model with the least cp value from the model.

cp <- which.min(model1$cptable[,'xerror'])
cpt <- model1$cptable[cp,'CP']

model1.pruned <- prune(model1,cpt)
plot(model1.pruned)
text(model1.pruned)

predicted.pruned <- predict(model1.pruned,test.data,type = 'class')
mean(predicted.pruned==test.data$y)

[1] 0.9023338

KNN Algorithm

K-Nearest Neighbors is a supervised machine learning algorithm used in regression and classification problems. KNN algorithm classifies the data based on the similarity measures and identifies the class for new data. Similarity measures is identified by distance function [e.g. Euclidean Distance]. The data is assigned to the class which has the k nearest neighbors where k depends on the dataset.

From clean_bank_data,

age	education	job	marital	balance	y
58	tertiary	management	married	2143	no
44	secondary	technician	single	29	no
33	secondary	entrepreneur	married	2	no
47	unknown	blue-collar	married	1506	no
33	unknown	unknown	single	1	no
35	tertiary	management	married	231	no

Converting the categorical variables to one hot encoding and updating the column names.

bankData.onehot = one_hot(as.data.table(clean_bank_data))
bankData.onehot <- subset(bankData.onehot,select = -c(loan_no,housing_no,default_no,y_no))
colnames(bankData.onehot)[colnames(bankData.onehot) == 'loan_yes'] <- 'loan'
colnames(bankData.onehot)[colnames(bankData.onehot) == 'housing_yes'] <- 'housing'
colnames(bankData.onehot)[colnames(bankData.onehot) == 'default_yes'] <- 'default'
colnames(bankData.onehot)[colnames(bankData.onehot) == 'y_yes'] <- 'y'

One hotted dataset for the KNN algorithm.

age_cat_0	education_primary	job_admin.	marital_divorced	balance	y
0	0	0	0	2143	0
0	0	0	0	29	0
0	0	0	0	2	0
0	0	0	0	1506	0
0	0	0	0	1	0
0	0	0	0	231	0

Defining Normalize function and Applying to the Dataset.

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
bankData.onehot <- as.data.frame(lapply(bankData.onehot,normalize))

Splitting Training and Testing Dataset. Giving 80% of data to training and 20% data to testing.

set.seed(123)
training.samples <- bankData.onehot$y %>%
  createDataPartition(p = 0.8, list = FALSE)
training_set <- bankData.onehot[training.samples, ]
testing_set <- bankData.onehot[-training.samples, ]

train.y <- training_set$y
train.data <- subset(training_set,select = -y)

test.y <- testing_set$y
test.data <- subset(testing_set, select = -y)

Best and simple way to start with defining k is setting the value to the square root of total number of rows of the dataset. Defining K with square root of total number rows.

[1] "Total Rows - 36169"

[1] "Square root of Total Rows - 190.181492264626"

Defining the Function to calculate accuracy of the model and Applying knn algorithm for the dataset with k = 190. Two models are being created with testing dataset as training data and testing data to know the accuracy of model in trained dataset and testing dataset.

#Function to calculate accuracy
accuracy_calc <- function(prediction, true_val){
  acc <- 100 * sum(true_val == prediction)/NROW(true_val)
  return (acc)
}

knn.k190_test <- knn(train= train.data, test = test.data, cl=train.y, k=190) #35Sec7ms

It takes around 2 mins 30 secs for the model to get created with training data as testing dataset. And It takes around 35 secs for the model to get created with test data from partition as testing dataset. Added only testing data creation as the cross validation following much longer time in building and knitting.

[1] "Accuracy of testing dataset is - 88.8077858880779"

Simple confusion matrix of the resultant model:

	0	1
0	7974	990
1	22	56

	0	1	Total	n
0	89.0	11.0	100	8964
1	28.2	71.8	100	78
All	88.4	11.6	100	9042

	0	1	All
0	99.7	94.6	99.1
1	0.3	5.4	0.9
Total	100.0	100.0	100.0
n	7996.0	1046.0	9042.0

	0	1
0	0.53	-1.46
1	-5.66	15.64

X-squared = 273.0641, df = 1, p = < 2.2e-16

To test the accuracy, Cross validation is been done on the dataset to know that if the accuracy can be improved by changing K in the KNN. Creating a Function to define folds and cross validation for the given dataset.

set.seed(123)
folds = seq.int(nrow(training_set)) %>%
  cut(breaks = 10, labels=FALSE) %>%  # cut() slices the ranges into equal intervals
  sample
cross_validate_folds <- function(chunkid, folddef, Xdat, Ydat, k){ 
  train = (folddef!=chunkid)# training index
  Xtr = Xdat[train,] # training set by the index
  Ytr = Ydat[train] # true label in training set
  Xvl = Xdat[!train,] # test set
  Yvl = Ydat[!train] # true label in test set
  predYvl = knn(train = Xtr, test = Xvl, cl = Ytr, k = k) # predict test labels
  data.frame(fold = chunkid, # k folds 
             val.acc = accuracy_calc(predYvl, Yvl), neighbors = k) # test error per fold
}

fold	val.acc	neighbors
1	89.21758	220
2	88.63699	220
3	88.11169	220
4	88.94111	220
5	88.66464	220
6	87.88717	220
7	88.96876	220
8	88.16699	220
9	88.13934	220
10	89.13464	220

For all the k's in from 190 to 230, accuracy value does not change much. All the accuracy value are around 88.7%.

Regression

Data Cleaning

Loading the data

setwd("~/r_projects/Advanced_Analytics/regression")
traffic <- read.csv(file = 'data.csv',header = TRUE, sep = ",")

holiday	temp	clouds_all	weather_main	weather_description	date_time	traffic_volume
None	288.28	40	Clouds	scattered clouds	2012-10-02 09:00:00	5545
None	289.36	75	Clouds	broken clouds	2012-10-02 10:00:00	4516
None	289.58	90	Clouds	overcast clouds	2012-10-02 11:00:00	4767
None	290.13	90	Clouds	overcast clouds	2012-10-02 12:00:00	5026
None	291.14	75	Clouds	broken clouds	2012-10-02 13:00:00	4918
None	291.72	1	Clear	sky is clear	2012-10-02 14:00:00	5181

Analysing continuous variables with the output

1.Temperature

Temperature is in kelvin scale. 0 K value makes no sense. Hence converting 0K to minimum temperature recorded above 0K

head(sort(traffic$temp), 15)

 [1]   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00
[11] 243.39 243.62 244.22 244.82 244.82

#Converting 0K to 243.39
traffic$temp[traffic$temp == 0] <- 243.39
ggplot(traffic, aes(x = temp, y = traffic_volume)) +
  geom_point() +
  stat_smooth()

Rain

	holiday	temp	rain_1h	snow_1h	clouds_all	weather_main	weather_description	date_time	traffic_volume
24873	None	302.11	9831.3	0	75	Rain	very heavy rain	2016-07-11 17:00:00	5535

[1] 9831.30   55.63   44.45   31.75   28.70

Rainfall is in mm scale. 9800+ mm of rainfall is not possible for a given time frame of one hour. This can be a mistake. Hence removing that one column.

traffic<-traffic[!(traffic$rain_1h > 100),]
ggplot(traffic, aes(x = rain_1h, y = traffic_volume)) +
  geom_point() +
  stat_smooth()

Snow

Data points are distributed throughout. Hence no changes needed.

Clouds

ggplot(traffic, aes(x = clouds_all, y = traffic_volume)) +
  geom_point() +
  stat_smooth()

Data points are distributed throughout. Hence no changes needed.

Analysing categorical variables

Holiday

mean(traffic_volume)
865.4426

values in holidays falls between 500 and 1300, Hence converting the holiday - None as 0 and holiday days as 1.

4.Weather_main and Weather Description

Converting weather main and weather description to one_hot encoding as the column with more than 2 numerical values will be taken as continuous variable.

for(unique_value in unique(traffic$weather_main)){
  traffic[paste("weather_main", unique_value, sep = ".")] <- ifelse(traffic$weather_main == unique_value, 1, 0)
}
for(unique_value in unique(traffic$weather_description)){
  traffic[paste("weather_description", unique_value, sep = ".")] <- ifelse(traffic$weather_description == unique_value, 1, 0)
}

As the graph is so flat, this column does not have effect in the output, hence removing the campaign column.

Date and Time column
Converting time column to numeric of 0-23 for each hour

traffic$date_time <- parse_date_time(traffic$date_time,'ymd HMS!')
traffic$Date <- as.Date(traffic$date_time)
traffic$Time <- format(traffic$date_time,"%H:%M:%S")
traffic$month = month(as.POSIXlt(traffic$Date, format="%d/%m/%Y"))
traffic$day = day(as.POSIXlt(traffic$Date, format="%d/%m/%Y"))
traffic$hour = format(strptime(traffic$Time,"%H:%M:%S"),'%H')

Removing the old columns

The traffic variable is now ready for model creation.

Linear Regression

Creating training and testing datasets from the traffic variable.

training.samples <- traffic$traffic_volume %>%
  createDataPartition(p = 0.8, list = FALSE)
training_set <- traffic[training.samples, ]
testing_set <- traffic[-training.samples, ]

test.y <- testing_set$traffic_volume
test.data <- subset(testing_set, select = -traffic_volume)

Creating Linear model for the training set.

traffic.model1 <- lm(traffic_volume ~ ., data = training_set)

Summary of the linear model created.

summary(traffic.model1)


Call:
lm(formula = traffic_volume ~ ., data = training_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-5331.7  -302.6    75.4   510.4  2592.1 

Coefficients: (12 not defined because of singularities)
                                                            Estimate Std. Error
(Intercept)                                               -1834.0967   300.2753
temp                                                          7.8732     0.4328
rain_1h                                                     -44.1774     5.8587
snow_1h                                                     356.8622   555.6198
clouds_all                                                   -1.3751     0.2431
hol1                                                         51.5585   137.8993
weather_main.Clouds                                         636.1517   272.9984
weather_main.Clear                                          622.0619   273.2993
weather_main.Rain                                           688.2536   406.4399
weather_main.Drizzle                                        545.8281   501.2342
weather_main.Mist                                           617.6563   272.1679
weather_main.Haze                                           590.3310   273.3686
weather_main.Fog                                            454.4332   274.0912
weather_main.Thunderstorm                                 -1503.4340   719.5210
weather_main.Snow                                           106.6959   607.6785
                                                          t value Pr(>|t|)    
(Intercept)                                                -6.108 1.02e-09 ***
temp                                                       18.192  < 2e-16 ***
rain_1h                                                    -7.540 4.79e-14 ***
snow_1h                                                     0.642 0.520697    
clouds_all                                                 -5.656 1.56e-08 ***
hol1                                                        0.374 0.708492    
weather_main.Clouds                                         2.330 0.019799 *  
weather_main.Clear                                          2.276 0.022844 *  
weather_main.Rain                                           1.693 0.090393 .  
weather_main.Drizzle                                        1.089 0.276175    
weather_main.Mist                                           2.269 0.023250 *  
weather_main.Haze                                           2.159 0.030820 *  
weather_main.Fog                                            1.658 0.097333 .  
weather_main.Thunderstorm                                  -2.089 0.036670 *  
weather_main.Snow                                           0.176 0.860625    
 [ reached getOption("max.print") -- omitted 65 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 941.1 on 38497 degrees of freedom
Multiple R-squared:  0.7761,    Adjusted R-squared:  0.7757 
F-statistic:  1991 on 67 and 38497 DF,  p-value: < 2.2e-16

Predicting testing dataset with the model created from the training dataset.Accuracy of the predicted model is found by the correlation factor between actual and predicted

predictions <- predict(traffic.model1, test.data)
actuals_preds <- data.frame(cbind(actuals=test.y, predicteds=predictions))
correlation_accuracy <- cor(actuals_preds)

Discussion

Learnings and Challenges

Before using caret library for data partitioning and model fit functions, i tried to use RWeka library. But the installation process was arduous. Faced many issues in Java package and finally downgraded the java to solve the issue. Later switched to caret library functions for all dataset related functions.
Inf keyword is r refers to Infinity. It took significant time to realize that in the linear regression, to calculate Mean absolute percentage error (MAPE).The output responded with Inf. Later realized that some of the values in actual and predicted values has 0 [zero] in it. Then calculated MAPE for non-zero values.
Accuracy calculation methods for classification and regression differs a lot. At first, i used same accuracy_calc method created for classification for linear model. The accuracy percentage resulted was around 0.07%. I had an idea that the process of finding classification and regression. But after getting to see the results helped me understand that the process of accuracy finding in classification makes no sense in regression.
I was facing major challenge when performing Cross validation for knn algorithm with 30 different k values in loop. It took more than 30 minutes to complete, consuming almost 70% of the machines RAM. Then changed to 10 different k values.

Machine Learning with Material RMD

Introduction

Data Sets

Classification

Regression

Classification

Data Cleaning

Decision Tree

KNN Algorithm

Regression

Data Cleaning

Linear Regression

Discussion