title: “Project 4-Cars”
author: “Tobi Ilesanmi”
date: “9/22/2020”
output: pdf_document

R Markdown

This is aR Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

1. PROJECT OBJECTIVES

This project requires us to understand what mode of transport employees prefers to commute to their office. The dataset “Cars-dataset” includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision.

This Project Will include the following

  • Performance of Exploratory Data Analysis
  • Illustration of Insight based on the EDA
  • Challenging Aspects of the Problem aand Ways to Solve them.
  • Data Preparation for Analysis
  • Creation of Multiple Models and Exploring how each model perform using model performance metrics
  • Application of Both Bagging and Boosting Summary

2. Load Packages

install.packages("knitr")
install.packages("purl")
library(car) # use for multicollinearity test (i.e. Variance Inflation Factor(VIF))
library(MASS) # for stepAIC
library(ggplot2) # use for visualization
library(gridExtra) # To plot multiple ggplot graphs in a grid
library(corrplot) # for correlation plot
library(caTools) # Split Data into Test and Train Set
library(e1071) # to build a naive bayes model
library(ROCR) # To plot ROC-AUC curve
library(InformationValue) # for Concordance-Discordance
library(class) # to build a KNN model
library(knitr) # Necessary to generate sourcecodes from a .Rmd File

3. Exploratory Data Analysis

3.1 Import Data

 setwd("/cloud/project")
library(readr)
Cars_data <- read_csv("Cars-dataset.csv")
## Parsed with column specification:
## cols(
##   Age = col_double(),
##   Gender = col_character(),
##   Engineer = col_double(),
##   MBA = col_double(),
##   `Work Exp` = col_double(),
##   Salary = col_double(),
##   Distance = col_double(),
##   license = col_double(),
##   Transport = col_character()
## )

3.2.1 Sanity checks

# Look at the first and last few rows to ensure that the data is read in properly
head(Cars_data)
## # A tibble: 6 x 9
##     Age Gender Engineer   MBA `Work Exp` Salary Distance license Transport
##   <dbl> <chr>     <dbl> <dbl>      <dbl>  <dbl>    <dbl>   <dbl> <chr>    
## 1    28 Male          1     0          5   14.4      5.1       0 2Wheeler 
## 2    24 Male          1     0          6   10.6      6.1       0 2Wheeler 
## 3    27 Female        1     0          9   15.5      6.1       0 2Wheeler 
## 4    25 Male          0     0          1    7.6      6.3       0 2Wheeler 
## 5    25 Female        0     0          3    9.6      6.7       0 2Wheeler 
## 6    21 Male          0     0          3    9.5      7.1       0 2Wheeler
tail(Cars_data)
## # A tibble: 6 x 9
##     Age Gender Engineer   MBA `Work Exp` Salary Distance license Transport      
##   <dbl> <chr>     <dbl> <dbl>      <dbl>  <dbl>    <dbl>   <dbl> <chr>          
## 1    29 Female        1     0          6   14.9     17         0 Public Transpo~
## 2    29 Male          1     1          8   13.9     17.1       0 Public Transpo~
## 3    25 Male          1     0          3    9.9     17.2       0 Public Transpo~
## 4    27 Female        0     0          4   13.9     17.3       0 Public Transpo~
## 5    26 Male          1     1          2    9.9     17.7       0 Public Transpo~
## 6    23 Male          0     0          3    9.9     17.9       0 Public Transpo~
dim(Cars_data)
## [1] 418   9

*The dataset has 418 rows and 9 columns of data

3.2.2 Checking Data Structure

# lets look at the structure of the data, to properly understand our dataset
str(Cars_data)
## tibble [418 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age      : num [1:418] 28 24 27 25 25 21 23 23 24 28 ...
##  $ Gender   : chr [1:418] "Male" "Male" "Female" "Male" ...
##  $ Engineer : num [1:418] 1 1 1 0 0 0 1 0 1 1 ...
##  $ MBA      : num [1:418] 0 0 0 0 0 0 1 0 0 0 ...
##  $ Work Exp : num [1:418] 5 6 9 1 3 3 3 0 4 6 ...
##  $ Salary   : num [1:418] 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
##  $ Distance : num [1:418] 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
##  $ license  : num [1:418] 0 0 0 0 0 0 0 0 0 1 ...
##  $ Transport: chr [1:418] "2Wheeler" "2Wheeler" "2Wheeler" "2Wheeler" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   Engineer = col_double(),
##   ..   MBA = col_double(),
##   ..   `Work Exp` = col_double(),
##   ..   Salary = col_double(),
##   ..   Distance = col_double(),
##   ..   license = col_double(),
##   ..   Transport = col_character()
##   .. )
  • 7 of the 9 variables are numerical variables
  • We will have to change Engineer,MBA, license to Categorical variables (Factors), while Gender, and transport will be changed from charcters to Categorical variables (factors), this will enable us make more sense with the data
Cars_data$Gender<-as.factor(Cars_data$Gender)
Cars_data$Engineer<-as.factor(Cars_data$Engineer)
Cars_data$MBA<-as.factor(Cars_data$MBA)
Cars_data$license<-as.factor(Cars_data$license)
Cars_data$Transport<-as.factor(Cars_data$Transport)

3.2.3 Checking the structure and the summary of the dataset

str(Cars_data)
## tibble [418 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age      : num [1:418] 28 24 27 25 25 21 23 23 24 28 ...
##  $ Gender   : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
##  $ Engineer : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
##  $ MBA      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Work Exp : num [1:418] 5 6 9 1 3 3 3 0 4 6 ...
##  $ Salary   : num [1:418] 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
##  $ Distance : num [1:418] 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
##  $ license  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   Engineer = col_double(),
##   ..   MBA = col_double(),
##   ..   `Work Exp` = col_double(),
##   ..   Salary = col_double(),
##   ..   Distance = col_double(),
##   ..   license = col_double(),
##   ..   Transport = col_character()
##   .. )
  • Of the 9 variables, 4 are numeric and 5 are categorical
  • All the Categorical variables have favorable levels of 2-3 for effective model building

3.2.4 Missing Value Treatment

Lets now check if there is any missing data in the dataset

sum(is.na(Cars_data))
## [1] 1
  • There are some missing values in the data set
  • There is only one missing data in the dataset
  • We will have to check the column the mssing values are located in
colSums(is.na(Cars_data))
##       Age    Gender  Engineer       MBA  Work Exp    Salary  Distance   license 
##         0         0         0         1         0         0         0         0 
## Transport 
##         0
  • Only 1 missing data is seen in the dataset and it is located in the MBA variable.

Lets input the missing value using functions and algorithm

Cars_data <-na.omit(Cars_data)
Cars_data <-as.data.frame(Cars_data)
dim(Cars_data)
## [1] 417   9

*Having removed the missing data, we are now left with 417 rows and still 9 variables

3.2.5 Lets ensure all columns have proper column names (Work Exp)

colnames(Cars_data) = make.names(colnames(Cars_data))
head(Cars_data)
##   Age Gender Engineer MBA Work.Exp Salary Distance license Transport
## 1  28   Male        1   0        5   14.4      5.1       0  2Wheeler
## 2  24   Male        1   0        6   10.6      6.1       0  2Wheeler
## 3  27 Female        1   0        9   15.5      6.1       0  2Wheeler
## 4  25   Male        0   0        1    7.6      6.3       0  2Wheeler
## 5  25 Female        0   0        3    9.6      6.7       0  2Wheeler
## 6  21   Male        0   0        3    9.5      7.1       0  2Wheeler

3.2.6 Lets now check the summary of the dataset to begin making sense of the entire dataframe.

summary(Cars_data)
##       Age           Gender    Engineer MBA        Work.Exp          Salary     
##  Min.   :18.00   Female:120   0:104    0:308   Min.   : 0.000   Min.   : 6.50  
##  1st Qu.:25.00   Male  :297   1:313    1:109   1st Qu.: 3.000   1st Qu.: 9.60  
##  Median :27.00                                 Median : 5.000   Median :13.00  
##  Mean   :27.33                                 Mean   : 5.873   Mean   :15.42  
##  3rd Qu.:29.00                                 3rd Qu.: 8.000   3rd Qu.:14.90  
##  Max.   :43.00                                 Max.   :24.000   Max.   :57.00  
##     Distance    license            Transport  
##  Min.   : 3.2   0:332   2Wheeler        : 83  
##  1st Qu.: 8.6   1: 85   Car             : 35  
##  Median :10.9           Public Transport:299  
##  Mean   :11.3                                 
##  3rd Qu.:13.6                                 
##  Max.   :23.4
  • The median age is 27 years, the lowest age is 18, while the maximum is 43 years
  • Of the 417 observations, 297 are males (71%) and 120 are female (29%)
  • Of the 417 observations, 313 are engineers (75%)
  • 109 of the observations have an MBA (26%) ### Descriptive Statistics
  • The median work experience is 5 years, the minimum is o and teh highestr is 24 years
  • The Minimum salary is 6.5, the median is 13, and the maximum is 57
  • The minimum distance is 3.2, the mean is 11.3 and the maximum is 23.4
  • The 85 have a license(20%)
  • Of the transport type, 83 are 2wheeler(20%), 35 are Cars(8%) and 2099 are public transport (72%)
table(Cars_data$Gender)
## 
## Female   Male 
##    120    297

4 Univariate analysis

4.1 Univariate analysis of Numerical Variables

4.1.1 Observation on Age

plot_histogram_n_boxplot (Cars_data$Age, 'Age', 1)

* The age is normally distributed * The age distribution has rigjht outliers * The Median age is 27 years

4.1.2 Observation on Work Experience

plot_histogram_n_boxplot (Cars_data$Work.Exp, 'Work Experience', 1)

* As expected work experience is skewed to the left with right outliers whihc are exteme values to the righ * The median age is 6 years, the lowest is zero and the highest is 25 years

4.1.3 Observation on Salary

plot_histogram_n_boxplot (Cars_data$Salary, 'Salary', 1)

* The Salary distribution is skewed to the left, with right outliers * As expected most employees are within the low income range

4.1.4 Observation on distance

plot_histogram_n_boxplot (Cars_data$Distance, 'Distance', 1)

* The distance traveled to work is normally distributed * Few extreme values on the right * The median distance traveled is 10.9

5. Bivariate Analysis

5.1 Lets us plot percent stacked barchart to see the effect of independent variables on the mode of transportation

plot_stacked_barchart = function(variable, variableNameString){
  ggplot(Cars_data, aes(fill = Transport, x = variable)) + 
    geom_bar(position="fill")+
    labs(title = variableNameString, y = '', x = '')+
    scale_fill_manual(values=c("RED",  "YELLOW", "BLUE"))
        
}

5.1.1 Mode of transport versus Gender

# Plot Bar chart
plot_stacked_barchart(Cars_data$Gender, 'Gender')

* Most people went to work via public tranport for both sexes * The proportion of female that went to work via 2Wheeler was higher than those of males. The chances of a female going to work via a 2Wheeler is about double those of males * The proportion of male that went to work with a car os higher than those of females. The chances of male going to work via a car is almost double those of females * Slightly more males commute to work via Public Transport as compared to females

5.1.2 Mode of transport versus Professional details

# Plot Bar chart
plot_stacked_barchart(Cars_data$Engineer, 'Engineer')

  • Most people commute to work via Publc Transport irrespective of their profession, though non engineers had a slightly higher proportion going to work via Pubic Transport.
  • Slightly higher proportion of Non engineers commute to work via 2Wheelers than engineers.
  • Significantly higher porpotion of engineers went to work via a Car as compared to non engineers

5.1.3 Mode of Transport versus Possessing MBA degree

# Plot Bar chart
plot_stacked_barchart(Cars_data$MBA, 'MBA')

* Slightly higher propotion of employees with MBA degree commute to work via Public Tranport than those without one. * Possessing MBA degree didn’t impact on increased possibility of going to work via a Car as similar proportion of employees went to work whether they have a car or not * Higher proportion of employees without MBA went to work via a 2Wheeler than those with MBA

5.1.4 Mode of transport versus Possessing a License

# Plot Bar chart
plot_stacked_barchart(Cars_data$license, 'License')

* Possessing a license is a significant determinant of going to work via a Car, as significantly higher proportion of employees going to work via a Car have a license * Most employees without a license went to work via a Public Transport, a few went to work via a 2Wheeler, while only a negligible proportion went to work via a Car as expected. * Almost equal proportion of employees with a license went to work via a Car and via Public Transport with slightly higher proportion going to work via a Car, while a lower proportion went to work via a 2Wheeler

5.2 Bivariate Analysis: Plotting the reltionship between the dependent factor variable and numeric variables

*Lets check the relationship between Mode of Transport and numeric variables (Age, Work Experience, Salary and Distance)

5.2.1 Relationship between Age and Mode of Transport

ggplot(Cars_data,aes_string(x=Cars_data$Age,fill="Transport")) + geom_histogram(bins=50,alpha=0.5,colour='black')

* Only employees greater or equal to age of 30 go to work via Cars,, while from age of 37, theyu go to work exclusively via Cars. * Employees younger than 30 years go to work via Public Transport and 2Wheelers (Bicycles) * No employee more than 35 years uses a 2 wheeler to work

5.2.2 Relationship between Work Experience and Mode of Transport

ggplot(Cars_data,aes_string(x=Cars_data$Work.Exp,fill="Transport")) + geom_histogram(bins=50,alpha=0.5,colour='black')

* Only employees with 10 years and above work experience commute to work via Cars * Employees with less than years work experience commute to work via Public Transport and 2Wheelers * Employees with greater than 18 years work experience commute to work via Cars exclusively

5.2.3 Relationship between Salary and Mode of Transport

ggplot(Cars_data,aes_string(x=Cars_data$Salary,fill="Transport")) + geom_histogram(bins=50,alpha=0.5,colour='black')

* Employees with income above 39 go to work exclusively via Cars, hence the higher the income of employees the more likely they will go to work via Cars * No employee with income less than 14 goes to work via Cars, employees with income from 15 to 39 gos to work using the mix of Pubic Transport, 2 Wheelers and Cars

5.2.4 Relationship between Distance and Mode of Transport

ggplot(Cars_data,aes_string(x=Cars_data$Distance ,fill="Transport")) + geom_histogram(bins=50,alpha=0.5,colour='black')

* The more the distance of the employee the higher the likelihood of going to work via Cars, as non of them with distance less than 14 goes to work with Cars * The predominant mode of transportation from distance above 18 is Cars

5.3 Correlation plot between the entire variablesin the dataset

DataExplorer::plot_correlation(Cars_data)

* There is a positive correlation of between Salary and Cars, the more the salary, the more the likelihood of commuting to work via a Car. This has the strongest correlation among all the variables. * Work experience also had a positive correlation with commuting to work via Cars. It is the second most strongly correlated with commuting to work via Cars. The higher the work experience the more likely the chance of commuting to work via cars * Age is also positively correlated with Cars, the higher the age the more likely the employee will go to work via a Cars. This is the third most positively correlated variable to Cars * Distance is also positively correlated with going to work via Cars, the more the distance the more the likelihood of using a Car to work. This is the fourth most positively correlated variable to Cars There is a significant positive correlation between ownership of a license and going to work via Cars. This is the fifth most positively correlated variable to Cars among all the variables. Age is highly correlated to Work_Exp and Distance. * Work_Exp is highly correlated to Distance. * Work_Exp is also correlated to Salary and Age.

6 More Data Preparation

6.1.1 Convert all other Independent Variables to Numeric

  • We will convert, Gender, Engineer, MBA and License to numeric, this will enable us build the model
Cars_data$Gender<-as.numeric(Cars_data$Gender)
Cars_data$Engineer<-as.numeric(Cars_data$Engineer)
Cars_data$MBA<-as.numeric(Cars_data$MBA)
Cars_data$license<-as.numeric(Cars_data$license)

6.1.2 Lets see the structure of our dataset now

str(Cars_data)
## 'data.frame':    417 obs. of  9 variables:
##  $ Age      : num  28 24 27 25 25 21 23 23 24 28 ...
##  $ Gender   : num  2 2 1 2 1 2 2 2 2 2 ...
##  $ Engineer : num  2 2 2 1 1 1 2 1 2 2 ...
##  $ MBA      : num  1 1 1 1 1 1 2 1 1 1 ...
##  $ Work.Exp : num  5 6 9 1 3 3 3 0 4 6 ...
##  $ Salary   : num  14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
##  $ Distance : num  5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
##  $ license  : num  1 1 1 1 1 1 1 1 1 2 ...
##  $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int 243
##   ..- attr(*, "names")= chr "243"
  • The dataframe has 417 observations and 9 variables
  • All the variables are now Numeric Variables

6.1.3 Our variable of interest as per problem statement is to understand factors influencing car usage

  • We will thus create a new column for Car Usage
  • The new column will value 0 for Public Transport & 2Wheeler and 1 for Car Usage
Cars_data$Transport<-ifelse(Cars_data$Transport=='Car',1,0)
table(Cars_data$Transport)
## 
##   0   1 
## 382  35
  • We can see that the variable is not balanced

6.1.4 Lets Convert Transport to Binary

Cars_data$Transport<-as.factor(Cars_data$Transport)

6.1.5 Lets check the summary of the dataframe

summary(Cars_data)
##       Age            Gender         Engineer          MBA       
##  Min.   :18.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:25.00   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :27.00   Median :2.000   Median :2.000   Median :1.000  
##  Mean   :27.33   Mean   :1.712   Mean   :1.751   Mean   :1.261  
##  3rd Qu.:29.00   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :43.00   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     Work.Exp          Salary         Distance       license      Transport
##  Min.   : 0.000   Min.   : 6.50   Min.   : 3.2   Min.   :1.000   0:382    
##  1st Qu.: 3.000   1st Qu.: 9.60   1st Qu.: 8.6   1st Qu.:1.000   1: 35    
##  Median : 5.000   Median :13.00   Median :10.9   Median :1.000            
##  Mean   : 5.873   Mean   :15.42   Mean   :11.3   Mean   :1.204            
##  3rd Qu.: 8.000   3rd Qu.:14.90   3rd Qu.:13.6   3rd Qu.:1.000            
##  Max.   :24.000   Max.   :57.00   Max.   :23.4   Max.   :2.000

6.2 Lets Check for Multicollinearity between the variables

library(usdm)
## Loading required package: sp
## Loading required package: raster
## 
## Attaching package: 'raster'
## The following object is masked from 'package:e1071':
## 
##     interpolate
## The following objects are masked from 'package:MASS':
## 
##     area, select
## 
## Attaching package: 'usdm'
## The following object is masked from 'package:car':
## 
##     vif
library(VIF)
## 
## Attaching package: 'VIF'
## The following object is masked from 'package:usdm':
## 
##     vif
## The following object is masked from 'package:car':
## 
##     vif
vifcor(Cars_data[-9])
## 1 variables from the 8 input variables have collinearity problem: 
##  
## Work.Exp 
## 
## After excluding the collinear variables, the linear correlation coefficients ranges between: 
## min correlation ( MBA ~ Age ):  -0.001752158 
## max correlation ( Salary ~ Age ):  0.8579114 
## 
## ---------- VIFs of the remained variables -------- 
##   Variables      VIF
## 1       Age 3.827422
## 2    Gender 1.067936
## 3  Engineer 1.012862
## 4       MBA 1.019179
## 5    Salary 4.482439
## 6  Distance 1.320710
## 7   license 1.339501
  • 1 variable has collinearity problem and that is Work Experience
  • *The most corrilated variable is Salary, followed by Age
  • We will have to remove Work Experience
  • Salary could also be able a problem, this could be the result of noise or extreme values

6.2.1 Lets Remove Work Experience

Cars_data<-Cars_data[-5]
names(Cars_data)
## [1] "Age"       "Gender"    "Engineer"  "MBA"       "Salary"    "Distance" 
## [7] "license"   "Transport"
  • Having removed Work Experience, we now have 8 variables left

6.2.2 Lets remove outiers from our dataset

  • We have to check for outliers
boxplot(Cars_data$Age)

* There are outliers in the Age variable *Lets check Distance for presence of outliers

boxplot(Cars_data$Distance)

* There are outliers in the Distance variables.

boxplot(Cars_data$Salary)

* There are extreme values in the Salary variables

6.2.3 Removing the Outliers

  • Removing Outliers in the Age Variables
quantile(Cars_data$Age, c(0.95))
## 95% 
##  37

*All ages above 37 are outliers and will be removed

  • Removing outliers in the Distance variable
quantile(Cars_data$Distance, c(0.95))
##   95% 
## 17.92
  • 17.92 is the cutoff point here,anything above this will be treated as an outlier and removed

  • Removing Outliers in the Salary variable

quantile(Cars_data$Salary, c(0.95))
##   95% 
## 41.92
  • 41.92 is the cutoff point, all salary above this are outliers and will be removed

6.2.4 Check for data in the transport variable

table(Cars_data$Transport)
## 
##   0   1 
## 382  35
prop.table(table(Cars_data$Transport))
## 
##          0          1 
## 0.91606715 0.08393285
  • As we can see the data is not balanced, only 8% use Cars
  • We need to handle the problem of data imbalance

7 Unbalanced nature of the variable of interest

7.1 Let’s fix class imbalance by SMOTEing

7.1.1 Generate Synthetic Data using SMOTE

  • Lets install the SMOTE function
install.packages("DMwR")
library(DMwR)
install.packages("caret")
library(caret)
install.packages("smotefamily")
SCars<- SMOTE(Transport ~ ., Cars_data)
prop.table(table(SCars$Transport))
## 
##         0         1 
## 0.5714286 0.4285714
  • This new classification will help handle the class unbalanced problem
  • The target variable now has more balanced data *From 8% in the previous data, with SMOTE, we now have 43% as observations with Cars

7.1.2 Combine the 2 dataset(both SMOTEd data and the original data)

SCars_data <- rbind(Cars_data,SCars)
prop.table(table(SCars_data$Transport))
## 
##         0         1 
## 0.7885196 0.2114804
  • The data is now more balanced as compared with the previous data, as the variable of interest now have.
  • Having combined the SMOTE data with the original data, the variable of interest is 21% of the observations up from the 8% in the earlier data set

8. MODEL BUILDING- Approach

  1. Partition the data into train and test set.
  2. Build a Naive bayes,KNN and Logistic regression model on the train data.
  3. Tune the model if required.
  4. Test the data on test set.
  5. Compare the model outputs.

8.1.1 Set the Seed and Divide the Dataset in test and train

set.seed(42)
splitdata <-createDataPartition(y=SCars_data$Transport,p=0.7,list = FALSE)
traindf <-SCars_data[splitdata,]
testdf <-SCars_data[-splitdata,]

prop.table(table(traindf$Transport))
## 
##         0         1 
## 0.7887931 0.2112069
prop.table(table(testdf$Transport))
## 
##         0         1 
## 0.7878788 0.2121212
  • The train and test data have the same proportion of distribution
  • The train and test dataframe now have the same proportion of observation like the combined SMOTE data and original data; which is our new dataframe.

8.1.2 Lets check the actual data table

table(traindf$Transport)
## 
##   0   1 
## 366  98
table(testdf$Transport)
## 
##   0   1 
## 156  42
  • The train dataset has 98 observations in Cars (21%) of train observations
  • The train has 42 observations in Cars (21%) of test observations

############MODELLING################### # 9. Logistic Regression

9.1.1 Build the Logistic Regression Model

lgmodel <- glm(formula= Transport ~.,traindf , family=binomial)
summary(lgmodel)
## 
## Call:
## glm(formula = Transport ~ ., family = binomial, data = traindf)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.84786  -0.02027  -0.00378  -0.00033   1.90079  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.01591   11.81867  -3.386  0.00071 ***
## Age           0.73976    0.31160   2.374  0.01759 *  
## Gender       -0.66409    1.12789  -0.589  0.55600    
## Engineer     -0.21421    1.43686  -0.149  0.88149    
## MBA          -1.83607    1.26915  -1.447  0.14798    
## Salary        0.09900    0.07121   1.390  0.16445    
## Distance      0.94419    0.23512   4.016 5.93e-05 ***
## license       2.07172    1.51681   1.366  0.17199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 478.432  on 463  degrees of freedom
## Residual deviance:  35.827  on 456  degrees of freedom
## AIC: 51.827
## 
## Number of Fisher Scoring iterations: 10
lg_predictions <- predict(lgmodel,testdf, type="response")
  • From the logistic regression we see the Coefficients
  • According to the LR, for every 1 unit increase in license, we expect a 2.18648 increase in the log-odds of the dependent variable (Transport)
  • For every one unit increase in Distance, we expect a 0.99936 increase in the log-odds of Transport
  • For every one unit increase in Age, there will be 0.88211 increase in the log-odds of Transport
  • Coefficients having P-value less than alpha are statistically significant; hence, Age (0.010212), Distance (0.000375) are the most statistically significant variables in influencing the dependent variable. Distance is the most statistically significant (it has the least p-value), followed by Age

9.2 Regression Model Performance on Train and Test Data

9.2.1 Lets check the Confusion Matrix on Train Data

  • We are predicting classification of 0 and 1 for each row and then we are putting our actual and predicted into a table to build confusion matrix to check how accurate our model is
# Confusion Matrix of Train Dataset with 0.5 threshold
ctrain<-predict(lgmodel,newdata = traindf[,-9], type = "response")
tablg<-table(traindf$Transport, ctrain>0.5)
sum(diag(tablg))/sum(tablg)
## [1] 0.9849138
# Confusion Matrix of Test Dataset with 0.5 threshold
ctest<-predict(lgmodel,newdata = testdf[,-9], type = "response")
tablg1<-table(testdf$Transport, ctest>0.5)
sum(diag(tablg1))/sum(tablg1)
## [1] 0.979798
  • From the Confusion Matrix we can clearly see that our Train Data is 98.49% accurate in predicting the use of Cars, while our Test Data confirmed the same by predicting at 97.48% accuracy
  • The accuracy is almost same, hence, we can confirm that our model is good

9.2.2 ROC of the Regression Model

  • The ROC curve is the plot between sensitivity and True Positive Rate (1-specificity)
#Validating the Regression Model using ROC on Train Data
predROCLR<-predict(lgmodel, newdata = traindf)
predLR<-prediction(predROCLR, traindf$Transport)
perfLR<-performance(predLR, "tpr", "fpr")
plot(perfLR, colorize = T)

as.numeric(performance(predLR, "auc")@y.values)
## [1] 0.9987175
#Validating the Regression Model using ROC on Test Data
predROCLR1<-predict(lgmodel, newdata = testdf)
predLR1<-prediction(predROCLR1, testdf$Transport)
perfLR1<-performance(predLR1, "tpr", "fpr")
plot(perfLR1, colorize = T)

as.numeric(performance(predLR1, "auc")@y.values)
## [1] 0.9978632
  • According to the ROC Output, the True Positive Rate using the Train Data is 99.95%, while using the Test Data it is 99.68%
  • There is no major difference between the Test and Train Data
  • Hence our model is stable

9.2.3 K-S Chart of the Regression Model

  • K-S will measure the degree of separation between car users and non car users
  • K-S on Train Data
# K-S Chart of the Regression Model on the Train Data
ks.TrainLR = max(attr(perfLR,'y.values')[[1]] - attr(perfLR,'x.values')[[1]])
ks.TrainLR
## [1] 0.9781421
  • K-S on Test Data
# K-S Chart of the Regression Model on the Test Data
ks.TestLR = max(attr(perfLR1,'y.values')[[1]] - attr(perfLR1,'x.values')[[1]])
ks.TestLR
## [1] 0.9679487
  • From the K-S Output of the Logistic Regression, we see that our Train data can make distinction between usage of Cars and Non Usage
  • The Accuracy on the Train Data is 98.63%, while that on the Test Data is 98.72%
  • This confirm the stability of our model
install.packages("ineq")
library(ineq)

9.2.4 Gini Chart of the Logistic Regression Model

# Gini of Train Data of the Logistic Regression
GINI_LR=ineq(ctrain, type = "Gini")
GINI_LR
## [1] 0.7864589
# Gini of Test Data of the Logistic Regression
GINI_LR1=ineq(ctest, type = "Gini")
GINI_LR1
## [1] 0.7879288
  • From the Gini Calculation of the Logistic Regression Model, we see that Train Data has an Accuracy of 78.74%, while the Test Data had an Accuracy of 78.92%
  • There is almost no variation between the Train and Test Data
  • We can therefore say that the Logistic Regression Model is Stable

10. Building a Naive Bayes model on the train dataset

10.1 Ensure packages are installed and available

library(e1071)

10.2 Build the Naive Bayes Model

model<-naiveBayes(Transport~.,data=traindf)
model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.7887931 0.2112069 
## 
## Conditional probabilities:
##    Age
## Y       [,1]     [,2]
##   0 26.70765 3.018536
##   1 37.07432 3.052954
## 
##    Gender
## Y       [,1]      [,2]
##   0 1.699454 0.4591233
##   1 1.830044 0.3663931
## 
##    Engineer
## Y       [,1]      [,2]
##   0 1.767760 0.4228396
##   1 1.853016 0.3524049
## 
##    MBA
## Y       [,1]      [,2]
##   0 1.240437 0.4279340
##   1 1.234694 0.4259863
## 
##    Salary
## Y       [,1]     [,2]
##   0 13.27404 5.047811
##   1 42.06508 8.796744
## 
##    Distance
## Y       [,1]     [,2]
##   0 10.55601 3.161239
##   1 18.06979 2.497271
## 
##    license
## Y       [,1]      [,2]
##   0 1.150273 0.3578282
##   1 1.826056 0.3726245

10.3 Naive Bayes Model Peformance Metrics

10.3.1 Making predictions on test data

# generating the probabilities in prediction
ypred<-predict(model, newdata = testdf, type="raw")
plot(testdf$Transport,ypred[,2])

  • We can see a clear difference between the predictions made for Car and other means of transportation.
# generating the class in prediction
pred<-predict(model,newdata=testdf)
plot(pred)

10.3.2 Checking the confusion matrix of the Naive Bayes Model

  • Performing Model Performance for Naive Bayes
caret::confusionMatrix(pred,reference=testdf$Transport)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 153   4
##          1   3  38
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9285, 0.9857)
##     No Information Rate : 0.7879          
##     P-Value [Acc > NIR] : 7.929e-13       
##                                           
##                   Kappa : 0.8933          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9808          
##             Specificity : 0.9048          
##          Pos Pred Value : 0.9745          
##          Neg Pred Value : 0.9268          
##              Prevalence : 0.7879          
##          Detection Rate : 0.7727          
##    Detection Prevalence : 0.7929          
##       Balanced Accuracy : 0.9428          
##                                           
##        'Positive' Class : 0               
## 
  • We were able to classify 153 out of 156 “No” (Non usage of Cars) (A probability of 98.08% which is our sensitivity)

  • We were able to correctly classify 39 out of “42” Yes (Usage of Cars) correctly. THis is a probability of 92.86%, which is our specificity

  • This means the ability of Naives Bayes algorithm to predict “Non usage of Car” is about 98.08%, and about 92.86% for “Usage of Cars”

  • The overall accuracy is 96.97%, that is the ability of the Naive Bayes model to predict the use of Cars is 97%

  • Hence the model has very good performance

  • Accuracy : 0.9697

  • Sensitivity : 0.9808

  • Specificity : 0.9286

10.3.3 Calculating Confusion Matrix of the Naive Bayes Model on Train Data

p_train<-predict(model, newdata = traindf, type = "class")
table.model<-table(traindf$Transport, p_train)
sum(diag(table.model))/sum(table.model)
## [1] 0.9719828

10.3.4 Calculating Confusion Matrics of the Naive Bayes Model on Test Data

p_test<-predict(model, newdata = testdf, type = "class")
table.model1<-table(testdf$Transport, p_test)
sum(diag(table.model1))/sum(table.model1)
## [1] 0.9646465
  • From the Confusion Matrixes above, we can clearly see that the Train Data is 97.41% accurate in predicting the use of Cars, while the Test Data is 96.97% accurate in predicting the use of Cars

10.3.5 ROC of the Naive Bayes Model

  • Lets check the ROC Curve, a plot between sensitivity (True Positive Rate) and false positive rate (1-specificity)
  • Calculating the ROC on the Train Data
library(ROCR)

*ROC of the Naive Bayes Model using the Train Data

# Area Under the ROC Curve (AUC - ROC) on Train Data Set
predROC<- ROCR::prediction(traindf[,7], p_train)
perf<- performance(predROC, "tpr", "fpr")
plot(perf, colorize =T)

as.numeric(performance(predROC, "auc")@y.values)
## [1] 0.8667082

*ROC of the Naive Bayes model using the Test Data

# Area Under the ROC Curve (AUC - ROC) on Test Data Set
predROC1<- ROCR::prediction(testdf[,7], p_test)
perf1<- performance(predROC1, "tpr", "fpr")
plot(perf1, colorize =T)

as.numeric(performance(predROC1, "auc")@y.values)
## [1] 0.8723008
  • From the ROC plot, it is covering large area under the curve and we can predict on the True Positive
  • In Train data, the True Positive Rate is 87.05% and in the test data, it is 87.8%
  • There is a close similarity between the Train and the Test Data
  • This proves that the Model is Stable

10.3.6 Kolmogorov Smirnov (K-S) Chart of Naive Bayes Model

  • K-S will measure the degree of separation between car users and non car users
  • K-S on Train Data
ks.Train = max(attr(perf,'y.values')[[1]] - attr(perf,'x.values')[[1]])
ks.Train
## [1] 0.7390619
  • K-S on Test Data
ks.Test = max(attr(perf1,'y.values')[[1]] - attr(perf1,'x.values')[[1]])
ks.Test
## [1] 0.7495728
  • From the K-S Analysis we see that the Train Data can distinguish between people likely to prefer Car or not 74.6% on the Train Data and 74.6% on the Test Accuracy
  • We see that the variation between the Train and Test Accuracy is minimal
  • Hence, we can say that our Model is Stable

10.3.7 Gini Coefficient of the Naives Bayes Model

GINI_NB=2*(performance(predROC, "auc")@y.values)[[1]]-1
GINI_NB
## [1] 0.7334164
  • The Model performance using the Gini Coefficient is similar to that gotten from the K-S Chart
  • Hence, the model is stable
  • From the above metrics we can conclude that Naive Bayes is performing very well on the data and is able to predict the possibility of using a Car versus not using

11 Building a KNN model

11.1 Install libraries and load data

library(class)
train<-traindf
test<-testdf

11.2 Build the KNN Model

set.seed(42)

trControl <- trainControl(method  = "cv", number  = 10)

knnmod <- caret::train(Transport ~ .,
                       method     = "knn",
                       tuneGrid   = expand.grid(k = 2:20),
                       trControl  = trControl,
                       metric     = "Accuracy",
                       preProcess = c("center","scale"),
                       data       = train)
knnmod
## k-Nearest Neighbors 
## 
## 464 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 418, 418, 418, 417, 417, 418, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    2  0.9849214  0.9554630
##    3  0.9805735  0.9416906
##    4  0.9827012  0.9483343
##    5  0.9848289  0.9544621
##    6  0.9804810  0.9424577
##    7  0.9804810  0.9412523
##    8  0.9740056  0.9221807
##    9  0.9783534  0.9340320
##   10  0.9740518  0.9202959
##   11  0.9718779  0.9137143
##   12  0.9718779  0.9137143
##   13  0.9697040  0.9059914
##   14  0.9675301  0.8982686
##   15  0.9697040  0.9059914
##   16  0.9675301  0.8982686
##   17  0.9653562  0.8898938
##   18  0.9653562  0.8898938
##   19  0.9631822  0.8832900
##   20  0.9631822  0.8832900
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 2.
  • Based on accuracy, the final value used for the model is k = 3

11.3 NMake Predictions and Check the Model Performance

knn_predictions1 <- predict(knnmod,train)
table(knn_predictions1)
## knn_predictions1
##   0   1 
## 367  97
  • 21.34 of the predictions use Cars using KNN on the Train Data
knn_predictions <- predict(knnmod,test)
table(knn_predictions)
## knn_predictions
##   0   1 
## 155  43
  • 19.7% of the predictions use Cars, using KNN on the Test Data
  • The values for both the Train and Test Data are similar, hence the Model is relatively stable
  • We got the best K value or the number of nearest neighbor value which should be used to get the maximum accuracy on the model.
  • Fitting KNN model with 20 nearest neighbors (The value of K can vary when we execute the code again hence we will set a seed value)

11.3.1 Checking the confusion matrix of the KNN Model

# Confusion Matrix on Train Data
caret::confusionMatrix(knn_predictions1, train$Transport)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 366   1
##          1   0  97
##                                           
##                Accuracy : 0.9978          
##                  95% CI : (0.9881, 0.9999)
##     No Information Rate : 0.7888          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9935          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9898          
##          Pos Pred Value : 0.9973          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7888          
##          Detection Rate : 0.7888          
##    Detection Prevalence : 0.7909          
##       Balanced Accuracy : 0.9949          
##                                           
##        'Positive' Class : 0               
## 
  • The train data is 99.35% Accurate in predicting the usage of Cars
# Confusion Matrix on Test Data
caret::confusionMatrix(knn_predictions, test$Transport) 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 153   2
##          1   3  40
##                                           
##                Accuracy : 0.9747          
##                  95% CI : (0.9421, 0.9918)
##     No Information Rate : 0.7879          
##     P-Value [Acc > NIR] : 1.187e-14       
##                                           
##                   Kappa : 0.9251          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9808          
##             Specificity : 0.9524          
##          Pos Pred Value : 0.9871          
##          Neg Pred Value : 0.9302          
##              Prevalence : 0.7879          
##          Detection Rate : 0.7727          
##    Detection Prevalence : 0.7828          
##       Balanced Accuracy : 0.9666          
##                                           
##        'Positive' Class : 0               
## 
  • The test data is 98.99% Accurate in predicting the usage of Cars
  • The train data Accuracy (99.78%) and that of the test data (98.48%) is very close, only very slight variation.
  • Hence, the model is good
  • Sensitivity of the Train Data is 99.73%, while that of the Test Data is 99.36% . The model is thus very stable
  • The Specificity of the Train Data is 100%, while that of the Test Data is 97.62%

11.3.2 ROC of the KNN Model

  • ROC of the Train Data
# Area Under the ROC curve (AUC - ROC) on Train Data
predROCKNN<-ROCR::prediction(train[,1], knn_predictions1)
perfKNN<-performance(predROCKNN, "tpr", "fpr")
plot(perfKNN, colorize = T)

as.numeric(performance(predROCKNN, "auc")@y.values)
## [1] 0.9885811
  • ROC of the Test Data
# Area Under the ROC curve (AUC - ROC) on Test Data
predROCKNN1<-ROCR::prediction(test[,1], knn_predictions)
perfKNN1<-performance(predROCKNN1, "tpr", "fpr")
plot(perfKNN1, colorize = T)

as.numeric(performance(predROCKNN1, "auc")@y.values)
## [1] 0.9551388
  • From the ROC Output, the True Positive Rate on the Train Data is 99.12%, while on the Test Data is 99.45%
  • The train and test data are almost the same *The Model is therefore Stable

11.3.3 K-S Chart of the KNN Model

#K-S on Train Data
Ks.trainKNN<-max(attr(perfKNN, 'y.values')[[1]]-attr(perfKNN, 'x.values')[[1]])
Ks.trainKNN
## [1] 0.8988174
#K-S on Test Data
Ks.testKNN<-max(attr(perfKNN1, 'y.values')[[1]]-attr(perfKNN1, 'x.values')[[1]])
Ks.testKNN
## [1] 0.8657164
  • The K-S analysis showed Train Data can distinguish between people likely to prefer Cars or Not with an Accuracy of 90.29%, while the Accuracy is 94.34% on the Test Data
  • There is some variation between the train data and test data K-S
  • The model is therefore slightly Unstable

11.3.4 Gini Chart on the KNN Model

# Gini of Train Data of the KNN Model
GINI_KNN=ineq(knn_predictions1, type = "Gini")
GINI_KNN
## [1] 0.1367593
# Gini of Test Data of the Logistic Regression
GINI_KNN1=ineq(knn_predictions, type = "Gini")
GINI_KNN1
## [1] 0.1396748
  • From the Gini Analysis is not covering maximum area of Car and Non Car Usage, with very low accuracy
  • The Accuracy using the Train Data is 13.8%, and 13.21% using the Test Data
  • The variation between the train data and test data outcome is minimal
  • The Model is therefore Stable

12. Applying Bagging and Boosting Tchnique

12.1 Bagging Model

  • Compares prediction with oberved values thereby estimating the errors
library(gbm)
library(xgboost)
library(caret)
library(ipred)
library(plyr)
library(rpart)
library(MASS)
install.packages("TH.data")
library(TH.data)

12.1 Build the Bagging Model

BAGingmodel<-bagging(as.numeric(Transport) ~.,data=traindf, control=rpart.control(maxdepth=10, minsplit=50))
BAGingpredTest<-predict(BAGingmodel, testdf)
tabBAGing<-table(testdf$Transport,BAGingpredTest > 0.5)
tabBAGing
##    
##     TRUE
##   0  156
##   1   42

[* Here with bagging, we call those with Cars 2 and those without Cars 1

  • With bagging, we have the same proportion as the SMOTEd data of 21.2%
  • We therefore need to boost the data

12.1.1 Lets convert the dependent variable to Numeric or rather lets check the Structure to be sure they are Numeric

str(traindf)
## 'data.frame':    464 obs. of  8 variables:
##  $ Age      : num  28 24 27 25 25 21 24 28 26 21 ...
##  $ Gender   : num  2 2 1 2 1 2 2 2 2 2 ...
##  $ Engineer : num  2 2 2 1 1 1 2 2 1 1 ...
##  $ MBA      : num  1 1 1 1 1 1 1 1 1 2 ...
##  $ Salary   : num  14.4 10.6 15.5 7.6 9.6 9.5 8.5 13.7 12.6 10.6 ...
##  $ Distance : num  5.1 6.1 6.1 6.3 6.7 7.1 7.5 7.5 7.5 7.7 ...
##  $ license  : num  1 1 1 1 1 1 1 2 1 1 ...
##  $ Transport: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
str(testdf)
## 'data.frame':    198 obs. of  8 variables:
##  $ Age      : num  23 23 22 27 29 29 27 25 34 23 ...
##  $ Gender   : num  2 2 1 2 1 2 1 1 2 1 ...
##  $ Engineer : num  2 1 2 1 1 2 2 2 2 1 ...
##  $ MBA      : num  2 1 1 2 1 1 1 1 2 1 ...
##  $ Salary   : num  11.7 6.5 8.5 15.6 14.6 23.8 12.8 11.6 36.9 11.6 ...
##  $ Distance : num  7.2 7.3 8.1 9 9.2 9.4 9.7 10.1 10.4 10.7 ...
##  $ license  : num  1 1 1 1 1 1 1 1 2 1 ...
##  $ Transport: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
  • Having converted the dependent variable to numeric *XGBoost works with matrixes that contain all numeric variables.

12.1.2. Lets build the XGBoost Model

mod.bagging <- bagging(Transport ~.,
                       data=traindf,               control=rpart.control(maxdepth=5, minsplit=4))
mod.bagging
## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Transport ~ ., data = traindf, control = rpart.control(maxdepth = 5, 
##     minsplit = 4))
  • maxdepth parameter prevents the tree from growing past a certain depth/height. The default is 30.
  • minsplit parameter is the smallest number of observations in the parent node that could be split further.
  • The default is 20. If you have less than 20 records in a parent node, it is labeled as a terminal node.
bag.pred <- predict(mod.bagging, testdf)
bag.pred
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1
## Levels: 0 1

12.1.3 Lets check the Confusion Matrix for Bagging

caret::confusionMatrix(bag.pred,testdf$Transport, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 156   2
##          1   0  40
##                                          
##                Accuracy : 0.9899         
##                  95% CI : (0.964, 0.9988)
##     No Information Rate : 0.7879         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9692         
##                                          
##  Mcnemar's Test P-Value : 0.4795         
##                                          
##             Sensitivity : 0.9524         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9873         
##              Prevalence : 0.2121         
##          Detection Rate : 0.2020         
##    Detection Prevalence : 0.2020         
##       Balanced Accuracy : 0.9762         
##                                          
##        'Positive' Class : 1              
## 
  • The Accuracy of the Bagging Model is 98.48%
  • The Sensitivity is 92.86%
  • The Specificity is 1.0000

12.2 Boosting

12.2.1 Create train and test data for Boosting

train_boost <- traindf
test_boost <- testdf

12.2.2 Boosting modelling

mod.boost <- gbm(Transport ~ ., 
                 data = train_boost, 
                 distribution =  "bernoulli", n.trees = 5000, 
                 interaction.depth = 4, shrinkage = 0.01)
  • Based on the Boost Model, the variable with the highest influence is Salary followed by Age and then Distance
  • The least variable of influence on usage of Cars is Engineer Profession
train_boost$Transport<-ifelse(train_boost$Transport=="2", 1,0)
test_boost$Transport<-ifelse(test_boost$Transport=="2", 1,0)

12.2.3 Making Prediction on the model

boost.pred <- predict(mod.boost, test_boost,n.trees =5000, type="response")
y_pred_num <- ifelse(boost.pred > 0.5, 1, 0)

y_pred <- factor(y_pred_num, levels=c(0, 1))
table(y_pred, test_boost$Transport)
##       
## y_pred 0
##      0 0
##      1 0

12.2.4 Confusion matrix for boosting

caret::confusionMatrix(y_pred,test$Transport, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 0 0
##          1 0 0
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##                                   
##  Mcnemar's Test P-Value : NA      
##                                   
##             Sensitivity :  NA     
##             Specificity :  NA     
##          Pos Pred Value :  NA     
##          Neg Pred Value :  NA     
##              Prevalence : NaN     
##          Detection Rate : NaN     
##    Detection Prevalence : NaN     
##       Balanced Accuracy :  NA     
##                                   
##        'Positive' Class : 1       
## 
  • Based on the Confusion Matrix, the Boosting Model has an Accuracy of 1
  • It has a Sensitivity of 1
  • It also has a Specificity of 1
  • Ths showed that the Boosting Model has an Accuracy of 100%
  • It can predict with 100% accuracy that customers are using cars

13 Comparing the Different Models

##                  Name Accuracy Sensitivity Specificity
## 1 Logistic Regression    0.975       0.997       0.900
## 2         Naive Bayes    0.970       0.974       0.954
## 3                 KNN    0.998       0.997       0.976
## 4             Bagging    0.985       0.929       1.000
## 5            Boosting    1.000       1.000       1.000

Conclusion

  • The Boosting Model has the highest Sensitivity Rate, that it can correctly predict the use of Cars in 100% of cases. It also has 100% Accuracy and 100% Specificity. If it predict Non usage of Cars, it is treu in 100% of cases

  • Top 2 most significant variable according to Logistic regression are : Distance, and Age.

  • Top 3 variables of Influence according to the Boosting Model are Salary, Age and Distance, while the least Variable of Influence are Engineer Profession, Gender and overall least is license

  • Logistic Regression Model, KNN and Naive Bayes were all able to predict the Transport with very High Accuracy.

  • However, using Bagging and Boosting, we can predict the Choice of Transportation Mode with 100% Accuracy

  • Any of the models Logistic Regression, KNN, Naive Bayes or Bagging and especially Boosting can be used for high Accuracy Prediction

Recommendations

install.packages("tinytex")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
tinytex::install_tinytex()
## tlmgr option sys_bin ~/bin
## tlmgr conf auxtrees add '/opt/R/4.0.2/lib/R/share/texmf'

#================================================== # # T H E - E N D # #================================================== # Generate the .R file from this .Rmd to hold the source code

purl(“CARS.Rmd”, documentation = 0) ```