Name : Siva Subramanian Lakshmi Narayanan
[http://bit.ly/dtanlyst-siva]
Bank Marketing Data related to marketing campaign by a Portugese banking institution.
The Dataset is taken from UCI repository [https://archive.ics.uci.edu/ml/datasets/Bank+Marketing].
The dataset has 17 attibutes and 45211 instances from the marketing campaign. The attributes are as follows:
General
age (numeric)job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)Related with the last contact of the current campaign:
contact: contact communication type (categorical: ‘cellular’,‘telephone’)month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.other attributes:
campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)previous: number of contacts performed before this campaign and for this client (numeric)poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)y: customer is subscribed to the term deposit or not (“yes”,“no”)To Classify: Whether the customer is subscribed for term deposit or not.
Data related to the volume of traffic in Minneapolis-St Paul, MN. It is hourly timeseries data which includes weather and holiday features from 2012-2018.
The Dataset is taken from UCI repository [https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume].
The dataset has 9 attibutes and 48204 instances from the marketing campaign. The attributes are as follows:
holiday Categorical US National holidays plus regional holiday, Minnesota State Fairtemp Numeric Average temp in kelvinrain_1h Numeric Amount in mm of rain that occurred in the hoursnow_1h Numeric Amount in mm of snow that occurred in the hourclouds_all Numeric Percentage of cloud coverweather_main Categorical Short textual description of the current weatherweather_description Categorical Longer textual description of the current weatherdate_time DateTime Hour of the data collected in local CST timetraffic_volume Numeric Hourly I-94 ATR 301 reported westbound traffic volumeLoading the data
setwd("~/r_projects/Advanced_Analytics/Classification_new")
bankData <- read.csv(file = 'data/bank-full.csv',header = TRUE, sep = ";")
bankData <- na.omit(bankData)| age | education | job | marital | balance | y |
|---|---|---|---|---|---|
| 58 | tertiary | management | married | 2143 | no |
| 44 | secondary | technician | single | 29 | no |
| 33 | secondary | entrepreneur | married | 2 | no |
| 47 | unknown | blue-collar | married | 1506 | no |
| 33 | unknown | unknown | single | 1 | no |
| 35 | tertiary | management | married | 231 | no |
Distribution of output of dataset
Creating a Function plotting the correlation for a column with the given y-value (yes or no)
relation_bw_col_and_output <- function (column_name) {
column_name_enq <- enquo(column_name)
y_val_data <- bankData %>%
group_by(!!column_name_enq) %>%
summarise(yes_value = sum(y=="yes"), no_value = sum(y=="no"))
return(y_val_data)
}Analysing the continuous variables in the dataset and converting it to a categorical variable based on the relation between the variable and the output.
Both the term deposit subscribed and non subscribed customers is normally distributed with peak from age 25-50. Hence categorizing age into 0 [<25], 1[25-60],2[>60] and making it to a factor type.
clean_bank_data <- bankData %>% mutate(age_cat = case_when(age < 25 ~ 0,
age >= 25 & age <= 60 ~ 1,
age > 60 ~ 2))
clean_bank_data$age_cat = factor(clean_bank_data$age_cat)Customers of both classes are concentrated more in between the balance -2000 and 25000. Categorizing into five categories <-2000, (-2000,0), 0, (0,25000), >25000
clean_bank_data <- clean_bank_data %>% mutate(bal_cat = case_when(balance < -2000 ~ 0,
balance >= -2000 & balance < 0 ~ 1,
balance == 0 ~ 2,
balance > 0 & balance < 25000 ~ 3,
balance > 25000 ~ 4))
clean_bank_data$bal_cat = factor(clean_bank_data$bal_cat)As the values are distributed throughout, standardizing the day column to unit standard deviation
Duration As there is no common pattern in duration column, no changes has been made.
Campaign As the graph is so flat, this column does not have effect in the output, hence removing the campaign column.
Pdays
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.0 -1.0 -1.0 40.2 -1.0 871.0
y_val_data_pdays <- relation_bw_col_and_output(pdays)
ggplot(y_val_data_pdays, aes_string(x = "pdays")) +
geom_point(aes_string(y="yes_value"), color = "steelblue")+
geom_point(aes_string(y = "no_value"), color="darkred") + xlim(0,65)Same as campaign column, removing pdays column also.
clean_bank_data <- clean_bank_data %>% mutate(pre_cat = case_when(previous < 5 ~ 0,
previous >= 5 & previous < 10 ~ 1,
previous >= 10 & previous < 20 ~ 2,
previous >= 20 ~ 3))As the graph is so flat, this column does not have effect in the output, hence removing the campaign column.
Leaving the categorical variable as is and removing the old columns of newly mutated columns.
clean_bank_data <- clean_bank_data[, !(colnames(clean_bank_data) %in% c("age","bal", "previous"))]
clean_bank_data <- clean_bank_data%>%select(-y,y)
clean_bank_data <- na.omit(clean_bank_data)The clean_bank_data variable is now ready for model creation.
Decision tree is a flowchart based supervised machine learning algorithm that shows the various outcomes from a series of decisions. It can be used as for both regression and classification problems. A primary advantage for using a decision tree is that it is easy to follow and understand. Decision tree is constructed with leaf node and decision node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.
From the clean_bank_data, training and testing dataset is prepared for the Decision tree model.
set.seed(12)
training.samples <- clean_bank_data$y %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- clean_bank_data[training.samples, ]
test.data <- clean_bank_data[-training.samples, ]Decision tree model is created with rpart function with method "Class" which is used for classification.
set.seed(12)
model1 <- rpart(y ~., data = train.data, method = "class")
par(xpd = NA) # Avoid clipping the text in some device
rpart.plot(model1,digits = 4, fallen.leaves = TRUE,
type = 3, extra = 101)predicted.classes <- model1 %>%
predict(test.data, type = "class")
mean(predicted.classes == test.data$y)[1] 0.9023338
Classification tree:
rpart(formula = y ~ ., data = train.data, method = "class")
Variables actually used in tree construction:
[1] duration poutcome
Root node error: 4232/36170 = 0.117
n= 36170
CP nsplit rel error xerror xstd
1 0.036547 0 1.00000 1.00000 0.014445
2 0.025284 3 0.89036 0.91848 0.013918
3 0.014887 4 0.86508 0.87405 0.013617
4 0.010000 5 0.85019 0.86626 0.013563
Pruning the Decision tree model with the least cp value from the model.
cp <- which.min(model1$cptable[,'xerror'])
cpt <- model1$cptable[cp,'CP']
model1.pruned <- prune(model1,cpt)
plot(model1.pruned)
text(model1.pruned)predicted.pruned <- predict(model1.pruned,test.data,type = 'class')
mean(predicted.pruned==test.data$y) [1] 0.9023338
K-Nearest Neighbors is a supervised machine learning algorithm used in regression and classification problems. KNN algorithm classifies the data based on the similarity measures and identifies the class for new data. Similarity measures is identified by distance function [e.g. Euclidean Distance]. The data is assigned to the class which has the k nearest neighbors where k depends on the dataset.
clean_bank_data,
| age | education | job | marital | balance | y |
|---|---|---|---|---|---|
| 58 | tertiary | management | married | 2143 | no |
| 44 | secondary | technician | single | 29 | no |
| 33 | secondary | entrepreneur | married | 2 | no |
| 47 | unknown | blue-collar | married | 1506 | no |
| 33 | unknown | unknown | single | 1 | no |
| 35 | tertiary | management | married | 231 | no |
Converting the categorical variables to one hot encoding and updating the column names.
bankData.onehot = one_hot(as.data.table(clean_bank_data))
bankData.onehot <- subset(bankData.onehot,select = -c(loan_no,housing_no,default_no,y_no))
colnames(bankData.onehot)[colnames(bankData.onehot) == 'loan_yes'] <- 'loan'
colnames(bankData.onehot)[colnames(bankData.onehot) == 'housing_yes'] <- 'housing'
colnames(bankData.onehot)[colnames(bankData.onehot) == 'default_yes'] <- 'default'
colnames(bankData.onehot)[colnames(bankData.onehot) == 'y_yes'] <- 'y'| age_cat_0 | education_primary | job_admin. | marital_divorced | balance | y |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 2143 | 0 |
| 0 | 0 | 0 | 0 | 29 | 0 |
| 0 | 0 | 0 | 0 | 2 | 0 |
| 0 | 0 | 0 | 0 | 1506 | 0 |
| 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 231 | 0 |
Defining Normalize function and Applying to the Dataset.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
bankData.onehot <- as.data.frame(lapply(bankData.onehot,normalize))Splitting Training and Testing Dataset. Giving 80% of data to training and 20% data to testing.
set.seed(123)
training.samples <- bankData.onehot$y %>%
createDataPartition(p = 0.8, list = FALSE)
training_set <- bankData.onehot[training.samples, ]
testing_set <- bankData.onehot[-training.samples, ]
train.y <- training_set$y
train.data <- subset(training_set,select = -y)
test.y <- testing_set$y
test.data <- subset(testing_set, select = -y)Best and simple way to start with defining k is setting the value to the square root of total number of rows of the dataset. Defining K with square root of total number rows.
[1] "Total Rows - 36169"
[1] "Square root of Total Rows - 190.181492264626"
Defining the Function to calculate accuracy of the model and Applying knn algorithm for the dataset with k = 190. Two models are being created with testing dataset as training data and testing data to know the accuracy of model in trained dataset and testing dataset.
#Function to calculate accuracy
accuracy_calc <- function(prediction, true_val){
acc <- 100 * sum(true_val == prediction)/NROW(true_val)
return (acc)
}
knn.k190_test <- knn(train= train.data, test = test.data, cl=train.y, k=190) #35Sec7msIt takes around 2 mins 30 secs for the model to get created with training data as testing dataset. And It takes around 35 secs for the model to get created with test data from partition as testing dataset. Added only testing data creation as the cross validation following much longer time in building and knitting.
[1] "Accuracy of testing dataset is - 88.8077858880779"
Simple confusion matrix of the resultant model:
| 0 | 1 | |
|---|---|---|
| 0 | 7974 | 990 |
| 1 | 22 | 56 |
| 0 | 1 | Total | n | |
|---|---|---|---|---|
| 0 | 89.0 | 11.0 | 100 | 8964 |
| 1 | 28.2 | 71.8 | 100 | 78 |
| All | 88.4 | 11.6 | 100 | 9042 |
| 0 | 1 | All | |
|---|---|---|---|
| 0 | 99.7 | 94.6 | 99.1 |
| 1 | 0.3 | 5.4 | 0.9 |
| Total | 100.0 | 100.0 | 100.0 |
| n | 7996.0 | 1046.0 | 9042.0 |
| 0 | 1 | |
|---|---|---|
| 0 | 0.53 | -1.46 |
| 1 | -5.66 | 15.64 |
X-squared = 273.0641, df = 1, p = < 2.2e-16
To test the accuracy, Cross validation is been done on the dataset to know that if the accuracy can be improved by changing K in the KNN. Creating a Function to define folds and cross validation for the given dataset.
set.seed(123)
folds = seq.int(nrow(training_set)) %>%
cut(breaks = 10, labels=FALSE) %>% # cut() slices the ranges into equal intervals
sample
cross_validate_folds <- function(chunkid, folddef, Xdat, Ydat, k){
train = (folddef!=chunkid)# training index
Xtr = Xdat[train,] # training set by the index
Ytr = Ydat[train] # true label in training set
Xvl = Xdat[!train,] # test set
Yvl = Ydat[!train] # true label in test set
predYvl = knn(train = Xtr, test = Xvl, cl = Ytr, k = k) # predict test labels
data.frame(fold = chunkid, # k folds
val.acc = accuracy_calc(predYvl, Yvl), neighbors = k) # test error per fold
}| fold | val.acc | neighbors |
|---|---|---|
| 1 | 89.21758 | 220 |
| 2 | 88.63699 | 220 |
| 3 | 88.11169 | 220 |
| 4 | 88.94111 | 220 |
| 5 | 88.66464 | 220 |
| 6 | 87.88717 | 220 |
| 7 | 88.96876 | 220 |
| 8 | 88.16699 | 220 |
| 9 | 88.13934 | 220 |
| 10 | 89.13464 | 220 |
For all the k's in from 190 to 230, accuracy value does not change much. All the accuracy value are around 88.7%.
Loading the data
setwd("~/r_projects/Advanced_Analytics/regression")
traffic <- read.csv(file = 'data.csv',header = TRUE, sep = ",")| holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume |
|---|---|---|---|---|---|---|---|---|
| None | 288.28 | 0 | 0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
| None | 289.36 | 0 | 0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
| None | 289.58 | 0 | 0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
| None | 290.13 | 0 | 0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
| None | 291.14 | 0 | 0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
| None | 291.72 | 0 | 0 | 1 | Clear | sky is clear | 2012-10-02 14:00:00 | 5181 |
Analysing continuous variables with the output
1.Temperature
Temperature is in kelvin scale. 0 K value makes no sense. Hence converting 0K to minimum temperature recorded above 0K
[1] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[11] 243.39 243.62 244.22 244.82 244.82
#Converting 0K to 243.39
traffic$temp[traffic$temp == 0] <- 243.39
ggplot(traffic, aes(x = temp, y = traffic_volume)) +
geom_point() +
stat_smooth()| holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
|---|---|---|---|---|---|---|---|---|---|
| 24873 | None | 302.11 | 9831.3 | 0 | 75 | Rain | very heavy rain | 2016-07-11 17:00:00 | 5535 |
[1] 9831.30 55.63 44.45 31.75 28.70
Rainfall is in mm scale. 9800+ mm of rainfall is not possible for a given time frame of one hour. This can be a mistake. Hence removing that one column.
traffic<-traffic[!(traffic$rain_1h > 100),]
ggplot(traffic, aes(x = rain_1h, y = traffic_volume)) +
geom_point() +
stat_smooth()Data points are distributed throughout. Hence no changes needed.
Data points are distributed throughout. Hence no changes needed.
Analysing categorical variables
| mean(traffic_volume) |
|---|
| 865.4426 |
values in holidays falls between 500 and 1300, Hence converting the holiday - None as 0 and holiday days as 1.
4.Weather_main and Weather Description
Converting weather main and weather description to one_hot encoding as the column with more than 2 numerical values will be taken as continuous variable.
for(unique_value in unique(traffic$weather_main)){
traffic[paste("weather_main", unique_value, sep = ".")] <- ifelse(traffic$weather_main == unique_value, 1, 0)
}
for(unique_value in unique(traffic$weather_description)){
traffic[paste("weather_description", unique_value, sep = ".")] <- ifelse(traffic$weather_description == unique_value, 1, 0)
}As the graph is so flat, this column does not have effect in the output, hence removing the campaign column.
traffic$date_time <- parse_date_time(traffic$date_time,'ymd HMS!')
traffic$Date <- as.Date(traffic$date_time)
traffic$Time <- format(traffic$date_time,"%H:%M:%S")
traffic$month = month(as.POSIXlt(traffic$Date, format="%d/%m/%Y"))
traffic$day = day(as.POSIXlt(traffic$Date, format="%d/%m/%Y"))
traffic$hour = format(strptime(traffic$Time,"%H:%M:%S"),'%H')Removing the old columns
The traffic variable is now ready for model creation.
Creating training and testing datasets from the traffic variable.
training.samples <- traffic$traffic_volume %>%
createDataPartition(p = 0.8, list = FALSE)
training_set <- traffic[training.samples, ]
testing_set <- traffic[-training.samples, ]
test.y <- testing_set$traffic_volume
test.data <- subset(testing_set, select = -traffic_volume)Creating Linear model for the training set.
Summary of the linear model created.
Call:
lm(formula = traffic_volume ~ ., data = training_set)
Residuals:
Min 1Q Median 3Q Max
-5331.7 -302.6 75.4 510.4 2592.1
Coefficients: (12 not defined because of singularities)
Estimate Std. Error
(Intercept) -1834.0967 300.2753
temp 7.8732 0.4328
rain_1h -44.1774 5.8587
snow_1h 356.8622 555.6198
clouds_all -1.3751 0.2431
hol1 51.5585 137.8993
weather_main.Clouds 636.1517 272.9984
weather_main.Clear 622.0619 273.2993
weather_main.Rain 688.2536 406.4399
weather_main.Drizzle 545.8281 501.2342
weather_main.Mist 617.6563 272.1679
weather_main.Haze 590.3310 273.3686
weather_main.Fog 454.4332 274.0912
weather_main.Thunderstorm -1503.4340 719.5210
weather_main.Snow 106.6959 607.6785
t value Pr(>|t|)
(Intercept) -6.108 1.02e-09 ***
temp 18.192 < 2e-16 ***
rain_1h -7.540 4.79e-14 ***
snow_1h 0.642 0.520697
clouds_all -5.656 1.56e-08 ***
hol1 0.374 0.708492
weather_main.Clouds 2.330 0.019799 *
weather_main.Clear 2.276 0.022844 *
weather_main.Rain 1.693 0.090393 .
weather_main.Drizzle 1.089 0.276175
weather_main.Mist 2.269 0.023250 *
weather_main.Haze 2.159 0.030820 *
weather_main.Fog 1.658 0.097333 .
weather_main.Thunderstorm -2.089 0.036670 *
weather_main.Snow 0.176 0.860625
[ reached getOption("max.print") -- omitted 65 rows ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 941.1 on 38497 degrees of freedom
Multiple R-squared: 0.7761, Adjusted R-squared: 0.7757
F-statistic: 1991 on 67 and 38497 DF, p-value: < 2.2e-16
Predicting testing dataset with the model created from the training dataset.Accuracy of the predicted model is found by the correlation factor between actual and predicted
Learnings and Challenges
caret library for data partitioning and model fit functions, i tried to use RWeka library. But the installation process was arduous. Faced many issues in Java package and finally downgraded the java to solve the issue. Later switched to caret library functions for all dataset related functions.Inf keyword is r refers to Infinity. It took significant time to realize that in the linear regression, to calculate Mean absolute percentage error (MAPE).The output responded with Inf. Later realized that some of the values in actual and predicted values has 0 [zero] in it. Then calculated MAPE for non-zero values.Accuracy calculation methods for classification and regression differs a lot. At first, i used same accuracy_calc method created for classification for linear model. The accuracy percentage resulted was around 0.07%. I had an idea that the process of finding classification and regression. But after getting to see the results helped me understand that the process of accuracy finding in classification makes no sense in regression.Cross validation for knn algorithm with 30 different k values in loop. It took more than 30 minutes to complete, consuming almost 70% of the machines RAM. Then changed to 10 different k values.