| title: “Project 4-Cars” |
| author: “Tobi Ilesanmi” |
| date: “9/22/2020” |
| output: pdf_document |
This is aR Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
This project requires us to understand what mode of transport employees prefers to commute to their office. The dataset “Cars-dataset” includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision.
install.packages("knitr")
install.packages("purl")
library(car) # use for multicollinearity test (i.e. Variance Inflation Factor(VIF))
library(MASS) # for stepAIC
library(ggplot2) # use for visualization
library(gridExtra) # To plot multiple ggplot graphs in a grid
library(corrplot) # for correlation plot
library(caTools) # Split Data into Test and Train Set
library(e1071) # to build a naive bayes model
library(ROCR) # To plot ROC-AUC curve
library(InformationValue) # for Concordance-Discordance
library(class) # to build a KNN model
library(knitr) # Necessary to generate sourcecodes from a .Rmd File
setwd("/cloud/project")
library(readr)
Cars_data <- read_csv("Cars-dataset.csv")
## Parsed with column specification:
## cols(
## Age = col_double(),
## Gender = col_character(),
## Engineer = col_double(),
## MBA = col_double(),
## `Work Exp` = col_double(),
## Salary = col_double(),
## Distance = col_double(),
## license = col_double(),
## Transport = col_character()
## )
# Look at the first and last few rows to ensure that the data is read in properly
head(Cars_data)
## # A tibble: 6 x 9
## Age Gender Engineer MBA `Work Exp` Salary Distance license Transport
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 28 Male 1 0 5 14.4 5.1 0 2Wheeler
## 2 24 Male 1 0 6 10.6 6.1 0 2Wheeler
## 3 27 Female 1 0 9 15.5 6.1 0 2Wheeler
## 4 25 Male 0 0 1 7.6 6.3 0 2Wheeler
## 5 25 Female 0 0 3 9.6 6.7 0 2Wheeler
## 6 21 Male 0 0 3 9.5 7.1 0 2Wheeler
tail(Cars_data)
## # A tibble: 6 x 9
## Age Gender Engineer MBA `Work Exp` Salary Distance license Transport
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 29 Female 1 0 6 14.9 17 0 Public Transpo~
## 2 29 Male 1 1 8 13.9 17.1 0 Public Transpo~
## 3 25 Male 1 0 3 9.9 17.2 0 Public Transpo~
## 4 27 Female 0 0 4 13.9 17.3 0 Public Transpo~
## 5 26 Male 1 1 2 9.9 17.7 0 Public Transpo~
## 6 23 Male 0 0 3 9.9 17.9 0 Public Transpo~
dim(Cars_data)
## [1] 418 9
*The dataset has 418 rows and 9 columns of data
# lets look at the structure of the data, to properly understand our dataset
str(Cars_data)
## tibble [418 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : num [1:418] 28 24 27 25 25 21 23 23 24 28 ...
## $ Gender : chr [1:418] "Male" "Male" "Female" "Male" ...
## $ Engineer : num [1:418] 1 1 1 0 0 0 1 0 1 1 ...
## $ MBA : num [1:418] 0 0 0 0 0 0 1 0 0 0 ...
## $ Work Exp : num [1:418] 5 6 9 1 3 3 3 0 4 6 ...
## $ Salary : num [1:418] 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
## $ Distance : num [1:418] 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
## $ license : num [1:418] 0 0 0 0 0 0 0 0 0 1 ...
## $ Transport: chr [1:418] "2Wheeler" "2Wheeler" "2Wheeler" "2Wheeler" ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Gender = col_character(),
## .. Engineer = col_double(),
## .. MBA = col_double(),
## .. `Work Exp` = col_double(),
## .. Salary = col_double(),
## .. Distance = col_double(),
## .. license = col_double(),
## .. Transport = col_character()
## .. )
Cars_data$Gender<-as.factor(Cars_data$Gender)
Cars_data$Engineer<-as.factor(Cars_data$Engineer)
Cars_data$MBA<-as.factor(Cars_data$MBA)
Cars_data$license<-as.factor(Cars_data$license)
Cars_data$Transport<-as.factor(Cars_data$Transport)
str(Cars_data)
## tibble [418 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : num [1:418] 28 24 27 25 25 21 23 23 24 28 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
## $ Engineer : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
## $ MBA : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ Work Exp : num [1:418] 5 6 9 1 3 3 3 0 4 6 ...
## $ Salary : num [1:418] 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
## $ Distance : num [1:418] 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
## $ license : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Gender = col_character(),
## .. Engineer = col_double(),
## .. MBA = col_double(),
## .. `Work Exp` = col_double(),
## .. Salary = col_double(),
## .. Distance = col_double(),
## .. license = col_double(),
## .. Transport = col_character()
## .. )
sum(is.na(Cars_data))
## [1] 1
colSums(is.na(Cars_data))
## Age Gender Engineer MBA Work Exp Salary Distance license
## 0 0 0 1 0 0 0 0
## Transport
## 0
Cars_data <-na.omit(Cars_data)
Cars_data <-as.data.frame(Cars_data)
dim(Cars_data)
## [1] 417 9
*Having removed the missing data, we are now left with 417 rows and still 9 variables
colnames(Cars_data) = make.names(colnames(Cars_data))
head(Cars_data)
## Age Gender Engineer MBA Work.Exp Salary Distance license Transport
## 1 28 Male 1 0 5 14.4 5.1 0 2Wheeler
## 2 24 Male 1 0 6 10.6 6.1 0 2Wheeler
## 3 27 Female 1 0 9 15.5 6.1 0 2Wheeler
## 4 25 Male 0 0 1 7.6 6.3 0 2Wheeler
## 5 25 Female 0 0 3 9.6 6.7 0 2Wheeler
## 6 21 Male 0 0 3 9.5 7.1 0 2Wheeler
summary(Cars_data)
## Age Gender Engineer MBA Work.Exp Salary
## Min. :18.00 Female:120 0:104 0:308 Min. : 0.000 Min. : 6.50
## 1st Qu.:25.00 Male :297 1:313 1:109 1st Qu.: 3.000 1st Qu.: 9.60
## Median :27.00 Median : 5.000 Median :13.00
## Mean :27.33 Mean : 5.873 Mean :15.42
## 3rd Qu.:29.00 3rd Qu.: 8.000 3rd Qu.:14.90
## Max. :43.00 Max. :24.000 Max. :57.00
## Distance license Transport
## Min. : 3.2 0:332 2Wheeler : 83
## 1st Qu.: 8.6 1: 85 Car : 35
## Median :10.9 Public Transport:299
## Mean :11.3
## 3rd Qu.:13.6
## Max. :23.4
table(Cars_data$Gender)
##
## Female Male
## 120 297
plot_histogram_n_boxplot (Cars_data$Age, 'Age', 1)
* The age is normally distributed * The age distribution has rigjht outliers * The Median age is 27 years
plot_histogram_n_boxplot (Cars_data$Work.Exp, 'Work Experience', 1)
* As expected work experience is skewed to the left with right outliers whihc are exteme values to the righ * The median age is 6 years, the lowest is zero and the highest is 25 years
plot_histogram_n_boxplot (Cars_data$Salary, 'Salary', 1)
* The Salary distribution is skewed to the left, with right outliers * As expected most employees are within the low income range
plot_histogram_n_boxplot (Cars_data$Distance, 'Distance', 1)
* The distance traveled to work is normally distributed * Few extreme values on the right * The median distance traveled is 10.9
plot_stacked_barchart = function(variable, variableNameString){
ggplot(Cars_data, aes(fill = Transport, x = variable)) +
geom_bar(position="fill")+
labs(title = variableNameString, y = '', x = '')+
scale_fill_manual(values=c("RED", "YELLOW", "BLUE"))
}
# Plot Bar chart
plot_stacked_barchart(Cars_data$Gender, 'Gender')
* Most people went to work via public tranport for both sexes * The proportion of female that went to work via 2Wheeler was higher than those of males. The chances of a female going to work via a 2Wheeler is about double those of males * The proportion of male that went to work with a car os higher than those of females. The chances of male going to work via a car is almost double those of females * Slightly more males commute to work via Public Transport as compared to females
# Plot Bar chart
plot_stacked_barchart(Cars_data$Engineer, 'Engineer')
# Plot Bar chart
plot_stacked_barchart(Cars_data$MBA, 'MBA')
* Slightly higher propotion of employees with MBA degree commute to work via Public Tranport than those without one. * Possessing MBA degree didn’t impact on increased possibility of going to work via a Car as similar proportion of employees went to work whether they have a car or not * Higher proportion of employees without MBA went to work via a 2Wheeler than those with MBA
# Plot Bar chart
plot_stacked_barchart(Cars_data$license, 'License')
* Possessing a license is a significant determinant of going to work via a Car, as significantly higher proportion of employees going to work via a Car have a license * Most employees without a license went to work via a Public Transport, a few went to work via a 2Wheeler, while only a negligible proportion went to work via a Car as expected. * Almost equal proportion of employees with a license went to work via a Car and via Public Transport with slightly higher proportion going to work via a Car, while a lower proportion went to work via a 2Wheeler
*Lets check the relationship between Mode of Transport and numeric variables (Age, Work Experience, Salary and Distance)
ggplot(Cars_data,aes_string(x=Cars_data$Age,fill="Transport")) + geom_histogram(bins=50,alpha=0.5,colour='black')
* Only employees greater or equal to age of 30 go to work via Cars,, while from age of 37, theyu go to work exclusively via Cars. * Employees younger than 30 years go to work via Public Transport and 2Wheelers (Bicycles) * No employee more than 35 years uses a 2 wheeler to work
ggplot(Cars_data,aes_string(x=Cars_data$Work.Exp,fill="Transport")) + geom_histogram(bins=50,alpha=0.5,colour='black')
* Only employees with 10 years and above work experience commute to work via Cars * Employees with less than years work experience commute to work via Public Transport and 2Wheelers * Employees with greater than 18 years work experience commute to work via Cars exclusively
ggplot(Cars_data,aes_string(x=Cars_data$Salary,fill="Transport")) + geom_histogram(bins=50,alpha=0.5,colour='black')
* Employees with income above 39 go to work exclusively via Cars, hence the higher the income of employees the more likely they will go to work via Cars * No employee with income less than 14 goes to work via Cars, employees with income from 15 to 39 gos to work using the mix of Pubic Transport, 2 Wheelers and Cars
ggplot(Cars_data,aes_string(x=Cars_data$Distance ,fill="Transport")) + geom_histogram(bins=50,alpha=0.5,colour='black')
* The more the distance of the employee the higher the likelihood of going to work via Cars, as non of them with distance less than 14 goes to work with Cars * The predominant mode of transportation from distance above 18 is Cars
DataExplorer::plot_correlation(Cars_data)
* There is a positive correlation of between Salary and Cars, the more the salary, the more the likelihood of commuting to work via a Car. This has the strongest correlation among all the variables. * Work experience also had a positive correlation with commuting to work via Cars. It is the second most strongly correlated with commuting to work via Cars. The higher the work experience the more likely the chance of commuting to work via cars * Age is also positively correlated with Cars, the higher the age the more likely the employee will go to work via a Cars. This is the third most positively correlated variable to Cars * Distance is also positively correlated with going to work via Cars, the more the distance the more the likelihood of using a Car to work. This is the fourth most positively correlated variable to Cars There is a significant positive correlation between ownership of a license and going to work via Cars. This is the fifth most positively correlated variable to Cars among all the variables. Age is highly correlated to Work_Exp and Distance. * Work_Exp is highly correlated to Distance. * Work_Exp is also correlated to Salary and Age.
Cars_data$Gender<-as.numeric(Cars_data$Gender)
Cars_data$Engineer<-as.numeric(Cars_data$Engineer)
Cars_data$MBA<-as.numeric(Cars_data$MBA)
Cars_data$license<-as.numeric(Cars_data$license)
str(Cars_data)
## 'data.frame': 417 obs. of 9 variables:
## $ Age : num 28 24 27 25 25 21 23 23 24 28 ...
## $ Gender : num 2 2 1 2 1 2 2 2 2 2 ...
## $ Engineer : num 2 2 2 1 1 1 2 1 2 2 ...
## $ MBA : num 1 1 1 1 1 1 2 1 1 1 ...
## $ Work.Exp : num 5 6 9 1 3 3 3 0 4 6 ...
## $ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
## $ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
## $ license : num 1 1 1 1 1 1 1 1 1 2 ...
## $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int 243
## ..- attr(*, "names")= chr "243"
Cars_data$Transport<-ifelse(Cars_data$Transport=='Car',1,0)
table(Cars_data$Transport)
##
## 0 1
## 382 35
Cars_data$Transport<-as.factor(Cars_data$Transport)
summary(Cars_data)
## Age Gender Engineer MBA
## Min. :18.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:25.00 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000
## Median :27.00 Median :2.000 Median :2.000 Median :1.000
## Mean :27.33 Mean :1.712 Mean :1.751 Mean :1.261
## 3rd Qu.:29.00 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :43.00 Max. :2.000 Max. :2.000 Max. :2.000
## Work.Exp Salary Distance license Transport
## Min. : 0.000 Min. : 6.50 Min. : 3.2 Min. :1.000 0:382
## 1st Qu.: 3.000 1st Qu.: 9.60 1st Qu.: 8.6 1st Qu.:1.000 1: 35
## Median : 5.000 Median :13.00 Median :10.9 Median :1.000
## Mean : 5.873 Mean :15.42 Mean :11.3 Mean :1.204
## 3rd Qu.: 8.000 3rd Qu.:14.90 3rd Qu.:13.6 3rd Qu.:1.000
## Max. :24.000 Max. :57.00 Max. :23.4 Max. :2.000
library(usdm)
## Loading required package: sp
## Loading required package: raster
##
## Attaching package: 'raster'
## The following object is masked from 'package:e1071':
##
## interpolate
## The following objects are masked from 'package:MASS':
##
## area, select
##
## Attaching package: 'usdm'
## The following object is masked from 'package:car':
##
## vif
library(VIF)
##
## Attaching package: 'VIF'
## The following object is masked from 'package:usdm':
##
## vif
## The following object is masked from 'package:car':
##
## vif
vifcor(Cars_data[-9])
## 1 variables from the 8 input variables have collinearity problem:
##
## Work.Exp
##
## After excluding the collinear variables, the linear correlation coefficients ranges between:
## min correlation ( MBA ~ Age ): -0.001752158
## max correlation ( Salary ~ Age ): 0.8579114
##
## ---------- VIFs of the remained variables --------
## Variables VIF
## 1 Age 3.827422
## 2 Gender 1.067936
## 3 Engineer 1.012862
## 4 MBA 1.019179
## 5 Salary 4.482439
## 6 Distance 1.320710
## 7 license 1.339501
Cars_data<-Cars_data[-5]
names(Cars_data)
## [1] "Age" "Gender" "Engineer" "MBA" "Salary" "Distance"
## [7] "license" "Transport"
boxplot(Cars_data$Age)
* There are outliers in the Age variable *Lets check Distance for presence of outliers
boxplot(Cars_data$Distance)
* There are outliers in the Distance variables.
boxplot(Cars_data$Salary)
* There are extreme values in the Salary variables
quantile(Cars_data$Age, c(0.95))
## 95%
## 37
*All ages above 37 are outliers and will be removed
quantile(Cars_data$Distance, c(0.95))
## 95%
## 17.92
17.92 is the cutoff point here,anything above this will be treated as an outlier and removed
Removing Outliers in the Salary variable
quantile(Cars_data$Salary, c(0.95))
## 95%
## 41.92
table(Cars_data$Transport)
##
## 0 1
## 382 35
prop.table(table(Cars_data$Transport))
##
## 0 1
## 0.91606715 0.08393285
install.packages("DMwR")
library(DMwR)
install.packages("caret")
library(caret)
install.packages("smotefamily")
SCars<- SMOTE(Transport ~ ., Cars_data)
prop.table(table(SCars$Transport))
##
## 0 1
## 0.5714286 0.4285714
SCars_data <- rbind(Cars_data,SCars)
prop.table(table(SCars_data$Transport))
##
## 0 1
## 0.7885196 0.2114804
set.seed(42)
splitdata <-createDataPartition(y=SCars_data$Transport,p=0.7,list = FALSE)
traindf <-SCars_data[splitdata,]
testdf <-SCars_data[-splitdata,]
prop.table(table(traindf$Transport))
##
## 0 1
## 0.7887931 0.2112069
prop.table(table(testdf$Transport))
##
## 0 1
## 0.7878788 0.2121212
table(traindf$Transport)
##
## 0 1
## 366 98
table(testdf$Transport)
##
## 0 1
## 156 42
############MODELLING################### # 9. Logistic Regression
lgmodel <- glm(formula= Transport ~.,traindf , family=binomial)
summary(lgmodel)
##
## Call:
## glm(formula = Transport ~ ., family = binomial, data = traindf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.84786 -0.02027 -0.00378 -0.00033 1.90079
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.01591 11.81867 -3.386 0.00071 ***
## Age 0.73976 0.31160 2.374 0.01759 *
## Gender -0.66409 1.12789 -0.589 0.55600
## Engineer -0.21421 1.43686 -0.149 0.88149
## MBA -1.83607 1.26915 -1.447 0.14798
## Salary 0.09900 0.07121 1.390 0.16445
## Distance 0.94419 0.23512 4.016 5.93e-05 ***
## license 2.07172 1.51681 1.366 0.17199
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 478.432 on 463 degrees of freedom
## Residual deviance: 35.827 on 456 degrees of freedom
## AIC: 51.827
##
## Number of Fisher Scoring iterations: 10
lg_predictions <- predict(lgmodel,testdf, type="response")
# Confusion Matrix of Train Dataset with 0.5 threshold
ctrain<-predict(lgmodel,newdata = traindf[,-9], type = "response")
tablg<-table(traindf$Transport, ctrain>0.5)
sum(diag(tablg))/sum(tablg)
## [1] 0.9849138
# Confusion Matrix of Test Dataset with 0.5 threshold
ctest<-predict(lgmodel,newdata = testdf[,-9], type = "response")
tablg1<-table(testdf$Transport, ctest>0.5)
sum(diag(tablg1))/sum(tablg1)
## [1] 0.979798
#Validating the Regression Model using ROC on Train Data
predROCLR<-predict(lgmodel, newdata = traindf)
predLR<-prediction(predROCLR, traindf$Transport)
perfLR<-performance(predLR, "tpr", "fpr")
plot(perfLR, colorize = T)
as.numeric(performance(predLR, "auc")@y.values)
## [1] 0.9987175
#Validating the Regression Model using ROC on Test Data
predROCLR1<-predict(lgmodel, newdata = testdf)
predLR1<-prediction(predROCLR1, testdf$Transport)
perfLR1<-performance(predLR1, "tpr", "fpr")
plot(perfLR1, colorize = T)
as.numeric(performance(predLR1, "auc")@y.values)
## [1] 0.9978632
# K-S Chart of the Regression Model on the Train Data
ks.TrainLR = max(attr(perfLR,'y.values')[[1]] - attr(perfLR,'x.values')[[1]])
ks.TrainLR
## [1] 0.9781421
# K-S Chart of the Regression Model on the Test Data
ks.TestLR = max(attr(perfLR1,'y.values')[[1]] - attr(perfLR1,'x.values')[[1]])
ks.TestLR
## [1] 0.9679487
install.packages("ineq")
library(ineq)
# Gini of Train Data of the Logistic Regression
GINI_LR=ineq(ctrain, type = "Gini")
GINI_LR
## [1] 0.7864589
# Gini of Test Data of the Logistic Regression
GINI_LR1=ineq(ctest, type = "Gini")
GINI_LR1
## [1] 0.7879288
library(e1071)
model<-naiveBayes(Transport~.,data=traindf)
model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.7887931 0.2112069
##
## Conditional probabilities:
## Age
## Y [,1] [,2]
## 0 26.70765 3.018536
## 1 37.07432 3.052954
##
## Gender
## Y [,1] [,2]
## 0 1.699454 0.4591233
## 1 1.830044 0.3663931
##
## Engineer
## Y [,1] [,2]
## 0 1.767760 0.4228396
## 1 1.853016 0.3524049
##
## MBA
## Y [,1] [,2]
## 0 1.240437 0.4279340
## 1 1.234694 0.4259863
##
## Salary
## Y [,1] [,2]
## 0 13.27404 5.047811
## 1 42.06508 8.796744
##
## Distance
## Y [,1] [,2]
## 0 10.55601 3.161239
## 1 18.06979 2.497271
##
## license
## Y [,1] [,2]
## 0 1.150273 0.3578282
## 1 1.826056 0.3726245
# generating the probabilities in prediction
ypred<-predict(model, newdata = testdf, type="raw")
plot(testdf$Transport,ypred[,2])
# generating the class in prediction
pred<-predict(model,newdata=testdf)
plot(pred)
caret::confusionMatrix(pred,reference=testdf$Transport)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 153 4
## 1 3 38
##
## Accuracy : 0.9646
## 95% CI : (0.9285, 0.9857)
## No Information Rate : 0.7879
## P-Value [Acc > NIR] : 7.929e-13
##
## Kappa : 0.8933
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9808
## Specificity : 0.9048
## Pos Pred Value : 0.9745
## Neg Pred Value : 0.9268
## Prevalence : 0.7879
## Detection Rate : 0.7727
## Detection Prevalence : 0.7929
## Balanced Accuracy : 0.9428
##
## 'Positive' Class : 0
##
We were able to classify 153 out of 156 “No” (Non usage of Cars) (A probability of 98.08% which is our sensitivity)
We were able to correctly classify 39 out of “42” Yes (Usage of Cars) correctly. THis is a probability of 92.86%, which is our specificity
This means the ability of Naives Bayes algorithm to predict “Non usage of Car” is about 98.08%, and about 92.86% for “Usage of Cars”
The overall accuracy is 96.97%, that is the ability of the Naive Bayes model to predict the use of Cars is 97%
Hence the model has very good performance
Accuracy : 0.9697
Sensitivity : 0.9808
Specificity : 0.9286
p_train<-predict(model, newdata = traindf, type = "class")
table.model<-table(traindf$Transport, p_train)
sum(diag(table.model))/sum(table.model)
## [1] 0.9719828
p_test<-predict(model, newdata = testdf, type = "class")
table.model1<-table(testdf$Transport, p_test)
sum(diag(table.model1))/sum(table.model1)
## [1] 0.9646465
library(ROCR)
*ROC of the Naive Bayes Model using the Train Data
# Area Under the ROC Curve (AUC - ROC) on Train Data Set
predROC<- ROCR::prediction(traindf[,7], p_train)
perf<- performance(predROC, "tpr", "fpr")
plot(perf, colorize =T)
as.numeric(performance(predROC, "auc")@y.values)
## [1] 0.8667082
*ROC of the Naive Bayes model using the Test Data
# Area Under the ROC Curve (AUC - ROC) on Test Data Set
predROC1<- ROCR::prediction(testdf[,7], p_test)
perf1<- performance(predROC1, "tpr", "fpr")
plot(perf1, colorize =T)
as.numeric(performance(predROC1, "auc")@y.values)
## [1] 0.8723008
ks.Train = max(attr(perf,'y.values')[[1]] - attr(perf,'x.values')[[1]])
ks.Train
## [1] 0.7390619
ks.Test = max(attr(perf1,'y.values')[[1]] - attr(perf1,'x.values')[[1]])
ks.Test
## [1] 0.7495728
GINI_NB=2*(performance(predROC, "auc")@y.values)[[1]]-1
GINI_NB
## [1] 0.7334164
We can use the caret library to train our KNN model which sets up a grid of tuning parameters.
method : A string specifying which classification or regression model to use.
metric : A string that specifies what summary metric will be used to select the optimal model.
preProcess : A string vector that defines a pre-processing of the predictor data,“scale” would scale our data before building the knn model which is an important step in building a KNN model.
tuneGrid : is a parameter used to provide all possible tuning values.
library(class)
train<-traindf
test<-testdf
set.seed(42)
trControl <- trainControl(method = "cv", number = 10)
knnmod <- caret::train(Transport ~ .,
method = "knn",
tuneGrid = expand.grid(k = 2:20),
trControl = trControl,
metric = "Accuracy",
preProcess = c("center","scale"),
data = train)
knnmod
## k-Nearest Neighbors
##
## 464 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 418, 418, 418, 417, 417, 418, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 2 0.9849214 0.9554630
## 3 0.9805735 0.9416906
## 4 0.9827012 0.9483343
## 5 0.9848289 0.9544621
## 6 0.9804810 0.9424577
## 7 0.9804810 0.9412523
## 8 0.9740056 0.9221807
## 9 0.9783534 0.9340320
## 10 0.9740518 0.9202959
## 11 0.9718779 0.9137143
## 12 0.9718779 0.9137143
## 13 0.9697040 0.9059914
## 14 0.9675301 0.8982686
## 15 0.9697040 0.9059914
## 16 0.9675301 0.8982686
## 17 0.9653562 0.8898938
## 18 0.9653562 0.8898938
## 19 0.9631822 0.8832900
## 20 0.9631822 0.8832900
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 2.
knn_predictions1 <- predict(knnmod,train)
table(knn_predictions1)
## knn_predictions1
## 0 1
## 367 97
knn_predictions <- predict(knnmod,test)
table(knn_predictions)
## knn_predictions
## 0 1
## 155 43
# Confusion Matrix on Train Data
caret::confusionMatrix(knn_predictions1, train$Transport)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 366 1
## 1 0 97
##
## Accuracy : 0.9978
## 95% CI : (0.9881, 0.9999)
## No Information Rate : 0.7888
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9935
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9898
## Pos Pred Value : 0.9973
## Neg Pred Value : 1.0000
## Prevalence : 0.7888
## Detection Rate : 0.7888
## Detection Prevalence : 0.7909
## Balanced Accuracy : 0.9949
##
## 'Positive' Class : 0
##
# Confusion Matrix on Test Data
caret::confusionMatrix(knn_predictions, test$Transport)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 153 2
## 1 3 40
##
## Accuracy : 0.9747
## 95% CI : (0.9421, 0.9918)
## No Information Rate : 0.7879
## P-Value [Acc > NIR] : 1.187e-14
##
## Kappa : 0.9251
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9808
## Specificity : 0.9524
## Pos Pred Value : 0.9871
## Neg Pred Value : 0.9302
## Prevalence : 0.7879
## Detection Rate : 0.7727
## Detection Prevalence : 0.7828
## Balanced Accuracy : 0.9666
##
## 'Positive' Class : 0
##
# Area Under the ROC curve (AUC - ROC) on Train Data
predROCKNN<-ROCR::prediction(train[,1], knn_predictions1)
perfKNN<-performance(predROCKNN, "tpr", "fpr")
plot(perfKNN, colorize = T)
as.numeric(performance(predROCKNN, "auc")@y.values)
## [1] 0.9885811
# Area Under the ROC curve (AUC - ROC) on Test Data
predROCKNN1<-ROCR::prediction(test[,1], knn_predictions)
perfKNN1<-performance(predROCKNN1, "tpr", "fpr")
plot(perfKNN1, colorize = T)
as.numeric(performance(predROCKNN1, "auc")@y.values)
## [1] 0.9551388
#K-S on Train Data
Ks.trainKNN<-max(attr(perfKNN, 'y.values')[[1]]-attr(perfKNN, 'x.values')[[1]])
Ks.trainKNN
## [1] 0.8988174
#K-S on Test Data
Ks.testKNN<-max(attr(perfKNN1, 'y.values')[[1]]-attr(perfKNN1, 'x.values')[[1]])
Ks.testKNN
## [1] 0.8657164
# Gini of Train Data of the KNN Model
GINI_KNN=ineq(knn_predictions1, type = "Gini")
GINI_KNN
## [1] 0.1367593
# Gini of Test Data of the Logistic Regression
GINI_KNN1=ineq(knn_predictions, type = "Gini")
GINI_KNN1
## [1] 0.1396748
library(gbm)
library(xgboost)
library(caret)
library(ipred)
library(plyr)
library(rpart)
library(MASS)
install.packages("TH.data")
library(TH.data)
BAGingmodel<-bagging(as.numeric(Transport) ~.,data=traindf, control=rpart.control(maxdepth=10, minsplit=50))
BAGingpredTest<-predict(BAGingmodel, testdf)
tabBAGing<-table(testdf$Transport,BAGingpredTest > 0.5)
tabBAGing
##
## TRUE
## 0 156
## 1 42
[* Here with bagging, we call those with Cars 2 and those without Cars 1
str(traindf)
## 'data.frame': 464 obs. of 8 variables:
## $ Age : num 28 24 27 25 25 21 24 28 26 21 ...
## $ Gender : num 2 2 1 2 1 2 2 2 2 2 ...
## $ Engineer : num 2 2 2 1 1 1 2 2 1 1 ...
## $ MBA : num 1 1 1 1 1 1 1 1 1 2 ...
## $ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 8.5 13.7 12.6 10.6 ...
## $ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.5 7.5 7.5 7.7 ...
## $ license : num 1 1 1 1 1 1 1 2 1 1 ...
## $ Transport: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
str(testdf)
## 'data.frame': 198 obs. of 8 variables:
## $ Age : num 23 23 22 27 29 29 27 25 34 23 ...
## $ Gender : num 2 2 1 2 1 2 1 1 2 1 ...
## $ Engineer : num 2 1 2 1 1 2 2 2 2 1 ...
## $ MBA : num 2 1 1 2 1 1 1 1 2 1 ...
## $ Salary : num 11.7 6.5 8.5 15.6 14.6 23.8 12.8 11.6 36.9 11.6 ...
## $ Distance : num 7.2 7.3 8.1 9 9.2 9.4 9.7 10.1 10.4 10.7 ...
## $ license : num 1 1 1 1 1 1 1 1 2 1 ...
## $ Transport: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
mod.bagging <- bagging(Transport ~.,
data=traindf, control=rpart.control(maxdepth=5, minsplit=4))
mod.bagging
##
## Bagging classification trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = Transport ~ ., data = traindf, control = rpart.control(maxdepth = 5,
## minsplit = 4))
bag.pred <- predict(mod.bagging, testdf)
bag.pred
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1
## Levels: 0 1
caret::confusionMatrix(bag.pred,testdf$Transport, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 156 2
## 1 0 40
##
## Accuracy : 0.9899
## 95% CI : (0.964, 0.9988)
## No Information Rate : 0.7879
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9692
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.9524
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9873
## Prevalence : 0.2121
## Detection Rate : 0.2020
## Detection Prevalence : 0.2020
## Balanced Accuracy : 0.9762
##
## 'Positive' Class : 1
##
train_boost <- traindf
test_boost <- testdf
mod.boost <- gbm(Transport ~ .,
data = train_boost,
distribution = "bernoulli", n.trees = 5000,
interaction.depth = 4, shrinkage = 0.01)
train_boost$Transport<-ifelse(train_boost$Transport=="2", 1,0)
test_boost$Transport<-ifelse(test_boost$Transport=="2", 1,0)
boost.pred <- predict(mod.boost, test_boost,n.trees =5000, type="response")
y_pred_num <- ifelse(boost.pred > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels=c(0, 1))
table(y_pred, test_boost$Transport)
##
## y_pred 0
## 0 0
## 1 0
caret::confusionMatrix(y_pred,test$Transport, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 0 0
##
## Accuracy : NaN
## 95% CI : (NA, NA)
## No Information Rate : NA
## P-Value [Acc > NIR] : NA
##
## Kappa : NaN
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : NA
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : NaN
## Detection Rate : NaN
## Detection Prevalence : NaN
## Balanced Accuracy : NA
##
## 'Positive' Class : 1
##
## Name Accuracy Sensitivity Specificity
## 1 Logistic Regression 0.975 0.997 0.900
## 2 Naive Bayes 0.970 0.974 0.954
## 3 KNN 0.998 0.997 0.976
## 4 Bagging 0.985 0.929 1.000
## 5 Boosting 1.000 1.000 1.000
The Boosting Model has the highest Sensitivity Rate, that it can correctly predict the use of Cars in 100% of cases. It also has 100% Accuracy and 100% Specificity. If it predict Non usage of Cars, it is treu in 100% of cases
Top 2 most significant variable according to Logistic regression are : Distance, and Age.
Top 3 variables of Influence according to the Boosting Model are Salary, Age and Distance, while the least Variable of Influence are Engineer Profession, Gender and overall least is license
Logistic Regression Model, KNN and Naive Bayes were all able to predict the Transport with very High Accuracy.
However, using Bagging and Boosting, we can predict the Choice of Transportation Mode with 100% Accuracy
Any of the models Logistic Regression, KNN, Naive Bayes or Bagging and especially Boosting can be used for high Accuracy Prediction
Ensure SMOTE is properly done to balance the minority and majority class, this was what enhanced the accuracy of our model; if not, the model would not be that accurate.
Improve logistic regression model to see if it can outperform KNN or NB.
install.packages("tinytex")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
tinytex::install_tinytex()
## tlmgr option sys_bin ~/bin
## tlmgr conf auxtrees add '/opt/R/4.0.2/lib/R/share/texmf'
#================================================== # # T H E - E N D # #================================================== # Generate the .R file from this .Rmd to hold the source code
purl(“CARS.Rmd”, documentation = 0) ```