Problem Statement

To understand the factors that influence the use of cars as a mode of transport and to that best explains the employee???s decision to use cars as the main means of transport

Preparation of the Data

Read the data and understand the structure

library(caret)

## Warning: package 'caret' was built under R version 3.4.4

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.4.4

## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018g.
## 1.0/zoneinfo/Asia/Kolkata'

library(car)

## Warning: package 'car' was built under R version 3.4.4

## Loading required package: carData

## Warning: package 'carData' was built under R version 3.4.4

library(DMwR)

## Loading required package: grid

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.4.4

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 3.4.4

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 3.4.3

## Loaded glmnet 2.0-16

setwd("~/Documents")
carsbasedata<-read.csv("cars.csv", header = TRUE)
str(carsbasedata)

## 'data.frame':    444 obs. of  9 variables:
##  $ Age      : int  28 23 29 28 27 26 28 26 22 27 ...
##  $ Gender   : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
##  $ Engineer : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ MBA      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Work.Exp : int  4 4 7 5 4 4 5 3 1 4 ...
##  $ Salary   : num  14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
##  $ Distance : num  3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
##  $ license  : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...

summary(carsbasedata)

##       Age           Gender       Engineer           MBA        
##  Min.   :18.00   Female:128   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:25.00   Male  :316   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :27.00                Median :1.0000   Median :0.0000  
##  Mean   :27.75                Mean   :0.7545   Mean   :0.2528  
##  3rd Qu.:30.00                3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :43.00                Max.   :1.0000   Max.   :1.0000  
##                                                NA's   :1       
##     Work.Exp        Salary         Distance        license      
##  Min.   : 0.0   Min.   : 6.50   Min.   : 3.20   Min.   :0.0000  
##  1st Qu.: 3.0   1st Qu.: 9.80   1st Qu.: 8.80   1st Qu.:0.0000  
##  Median : 5.0   Median :13.60   Median :11.00   Median :0.0000  
##  Mean   : 6.3   Mean   :16.24   Mean   :11.32   Mean   :0.2342  
##  3rd Qu.: 8.0   3rd Qu.:15.72   3rd Qu.:13.43   3rd Qu.:0.0000  
##  Max.   :24.0   Max.   :57.00   Max.   :23.40   Max.   :1.0000  
##                                                                 
##             Transport  
##  2Wheeler        : 83  
##  Car             : 61  
##  Public Transport:300  
##                        
##                        
##                        
##

## The columns Engineer,MBA and license need to be converted into factors and impute the missing value in MBA
carsbasedata$Engineer<-as.factor(carsbasedata$Engineer)
carsbasedata$MBA<-as.factor(carsbasedata$MBA)
carsbasedata$license<-as.factor(carsbasedata$license)
carsbasedata<-knnImputation(carsbasedata)
str(carsbasedata)

## 'data.frame':    444 obs. of  9 variables:
##  $ Age      : int  28 23 29 28 27 26 28 26 22 27 ...
##  $ Gender   : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
##  $ Engineer : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
##  $ MBA      : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ Work.Exp : int  4 4 7 5 4 4 5 3 1 4 ...
##  $ Salary   : num  14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
##  $ Distance : num  3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
##  $ license  : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...

summary(carsbasedata)

##       Age           Gender    Engineer MBA        Work.Exp   
##  Min.   :18.00   Female:128   0:109    0:332   Min.   : 0.0  
##  1st Qu.:25.00   Male  :316   1:335    1:112   1st Qu.: 3.0  
##  Median :27.00                                 Median : 5.0  
##  Mean   :27.75                                 Mean   : 6.3  
##  3rd Qu.:30.00                                 3rd Qu.: 8.0  
##  Max.   :43.00                                 Max.   :24.0  
##      Salary         Distance     license            Transport  
##  Min.   : 6.50   Min.   : 3.20   0:340   2Wheeler        : 83  
##  1st Qu.: 9.80   1st Qu.: 8.80   1:104   Car             : 61  
##  Median :13.60   Median :11.00           Public Transport:300  
##  Mean   :16.24   Mean   :11.32                                 
##  3rd Qu.:15.72   3rd Qu.:13.43                                 
##  Max.   :57.00   Max.   :23.40

Our primary interest as per problem statement is to understand the factors influencing car usage. Hence we will create a new column for Car usage. It will take value 0 for Public Transport & 2 Wheeler and 1 for car usage

Understand the proprotion of cars in Transport Mode

carsbasedata$CarUsage<-ifelse(carsbasedata$Transport =='Car',1,0)
table(carsbasedata$CarUsage)

## 
##   0   1 
## 383  61

sum(carsbasedata$CarUsage == 1)/nrow(carsbasedata)

## [1] 0.1373874

carsbasedata$CarUsage<-as.factor(carsbasedata$CarUsage)

Considerations for the model building and data split

The number of records for people travelling by car is in minority. Hence we need to use an appropriate sampling method on the train data. We will explore using SMOTE We will use logistic regression, decision trees to see the best fit model and also explore a couple of blackbox models for prediction later on

Balancing the data

##Split the data into test and train
set.seed(400)
carindex<-createDataPartition(carsbasedata$CarUsage, p=0.7,list = FALSE,times = 1)
carsdatatrain<-carsbasedata[carindex,]
carsdatatest<-carsbasedata[-carindex,]
prop.table(table(carsdatatrain$CarUsage))

## 
##         0         1 
## 0.8621795 0.1378205

prop.table(table(carsdatatest$CarUsage))

## 
##         0         1 
## 0.8636364 0.1363636

carsdatatrain<-carsdatatrain[,c(1:8,10)]
carsdatatest<-carsdatatest[,c(1:8,10)]
## The train and test data have almost same percentage of cars usage as the base data
## Apply SMOTE on Training data set
library(DMwR)
attach(carsdatatrain)
carsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)
prop.table(table(carsdataSMOTE$CarUsage))

## 
##   0   1 
## 0.5 0.5

We now have an equal split in the data between car users and non car users. Let us proceed with building the models

Model Building We will use the Logistic regression method a model on the SMOTE data to understand the factors influencing car usage. Since we have only limited variable, we will use them all in model building

##Create control parameter for GLM
outcomevar<-'CarUsage'
regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family = "binomial",trControl = trainctrl)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(carsglm$finalModel)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.98326  -0.00024   0.00000   0.00000   1.11801  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -118.62275   51.84138  -2.288   0.0221 *
## Age            3.69904    1.69960   2.176   0.0295 *
## Work.Exp      -1.20774    0.89958  -1.343   0.1794  
## Salary         0.96031    0.72224   1.330   0.1836  
## Distance      -0.04112    0.32873  -0.125   0.9005  
## license1       4.44597    2.89203   1.537   0.1242  
## Engineer1     -0.72971    2.98679  -0.244   0.8070  
## MBA1          -1.05817    1.70158  -0.622   0.5340  
## GenderMale    -3.03060    2.10593  -1.439   0.1501  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 357.664  on 257  degrees of freedom
## Residual deviance:  17.959  on 249  degrees of freedom
## AIC: 35.959
## 
## Number of Fisher Scoring iterations: 12

carglmcoeff<-exp(coef(carsglm$finalModel))
write.csv(carglmcoeff,file = "Coeffs.csv")
varImp(object = carsglm)

## glm variable importance
## 
##            Overall
## Age        100.000
## license1    68.845
## GenderMale  64.056
## Work.Exp    59.351
## Salary      58.720
## MBA1        24.219
## Engineer1    5.813
## Distance     0.000

plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")

From the model we see that Age and License are more significant. When we look at the odds and probabilities table, we get to see that Increase in age by 1 year implies that thre is a 98% probability that the employee will use a car. As expected , if the employee has a license, then it implies a 99% probability that he/she will use a car. One lakh increase in salary increases the probability of car usage by 72%.

The null deviance of this model is 357.664 and the residual deviance is 17.959. This yields a McFadden R Square of almost 0.94 yielding a very good fit. We get to see Accuracy and Kappa values are high.

We shall do the prediction based on this model

carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")
confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 108   1
##          1   6  17
##                                           
##                Accuracy : 0.947           
##                  95% CI : (0.8938, 0.9784)
##     No Information Rate : 0.8636          
##     P-Value [Acc > NIR] : 0.001692        
##                                           
##                   Kappa : 0.7984          
##  Mcnemar's Test P-Value : 0.130570        
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.9474          
##          Pos Pred Value : 0.7391          
##          Neg Pred Value : 0.9908          
##              Prevalence : 0.1364          
##          Detection Rate : 0.1288          
##    Detection Prevalence : 0.1742          
##       Balanced Accuracy : 0.9459          
##                                           
##        'Positive' Class : 1               
##

carusagepreddata<-carsdatatest
carusagepreddata$predictusage<-carusageprediction

Interpretation of Prediction

We see that the accuracy of prediction is 95% with almost all non users gettng predicted accurately. We have a 94% accuracy in predicting the car users.

Let us perform the prediction for the two given cases

carunknown<-read.csv("cars2.csv", header = TRUE)

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'cars2.csv'

carunknown$license<-as.factor(carunknown$license)
carunknown$Engineer<-as.factor(carunknown$Engineer)
carunknown$MBA<-as.factor(carunknown$MBA)
carunknown$predictcaruse<-predict.train(object = carsglm,carunknown[,regressors],type = "raw")
print(carunknown)

##   Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1  25   Male        0   0        2     10        5       1             0
## 2  25 Female        1   0        2     10        5       0             0

As seen, the model has predicted that both the new rows are predicted to be not car users.

Improving the model

Let us now try using glmnet method of caret package to try and run ridge Regression Model

trainctrlgn<-trainControl(method = 'cv',number = 10,returnResamp = 'none')
carsglmnet<-train(CarUsage~Age+Work.Exp+Salary+Distance+license, data = carsdataSMOTE, method = 'glmnet', trControl = trainctrlgn)
carsglmnet

## glmnet 
## 
## 258 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 232, 232, 232, 232, 232, 232, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa    
##   0.10   0.0008540753  0.9807692  0.9615385
##   0.10   0.0085407532  0.9730769  0.9461538
##   0.10   0.0854075320  0.9575385  0.9150897
##   0.55   0.0008540753  0.9807692  0.9615385
##   0.55   0.0085407532  0.9769231  0.9538462
##   0.55   0.0854075320  0.9536923  0.9073974
##   1.00   0.0008540753  0.9846154  0.9692308
##   1.00   0.0085407532  0.9846154  0.9692308
##   1.00   0.0854075320  0.9730769  0.9461538
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda
##  = 0.008540753.

varImp(object = carsglmnet)

## glmnet variable importance
## 
##          Overall
## license1 100.000
## Age       99.068
## Distance  21.095
## Salary     8.684
## Work.Exp   0.000

plot(varImp(object = carsglmnet), main="Vairable Importance for Logistic Regression - Post Ridge Regularization")

We get license and Age as the most significant variables, followed by the distance in the improtance of variables. Let us try prediction using the regularized model

carusagepredictiong<-predict.train(object = carsglmnet,carsdatatest[,regressors],type = "raw")
confusionMatrix(carusagepredictiong,carsdatatest[,outcomevar], positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 109   2
##          1   5  16
##                                           
##                Accuracy : 0.947           
##                  95% CI : (0.8938, 0.9784)
##     No Information Rate : 0.8636          
##     P-Value [Acc > NIR] : 0.001692        
##                                           
##                   Kappa : 0.7896          
##  Mcnemar's Test P-Value : 0.449692        
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.9561          
##          Pos Pred Value : 0.7619          
##          Neg Pred Value : 0.9820          
##              Prevalence : 0.1364          
##          Detection Rate : 0.1212          
##    Detection Prevalence : 0.1591          
##       Balanced Accuracy : 0.9225          
##                                           
##        'Positive' Class : 1               
##

## Let us predict for unknown cases
carunknown$predictcarusegn<-predict.train(object = carsglmnet,carunknown[,regressors],type = "raw")
print(carunknown)

##   Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1  25   Male        0   0        2     10        5       1             0
## 2  25 Female        1   0        2     10        5       0             0
##   predictcarusegn
## 1               0
## 2               0

We see that the accuracy of prediction is 94.7% with almost all non users gettng predicted accurately. However we have a 88% accuracy in predicting the car users, slightly lower than the regular GLM model. The unknown cases are also predicted as not car users.

Inference & Prediction Using Linear Discriminant Analysis

Though the problem statement is to more predominantly look at the factors influencing car usage, the dataset given is a good case for a linear discriminant analysis to understand the factors driving the choice of transportaion mode.

##Split the original base data into test and train samples again
carsbasedatalda<-read.csv("cars.csv", header = TRUE)
carsbasedatalda$Gender<-as.factor(carsbasedatalda$Gender)
carsbasedatalda$Engineer<-as.factor(carsbasedatalda$Engineer)
carsbasedatalda$MBA<-as.factor(carsbasedatalda$MBA)
carsbasedatalda<-knnImputation(carsbasedatalda)
set.seed(400)
carindexlda<-createDataPartition(carsbasedatalda$Transport, p=0.7,list = FALSE,times = 1)
carstrainlda<-carsbasedatalda[carindexlda,]
carstestlda<-carsbasedatalda[-carindexlda,]
carstrainlda$license<-as.factor(carstrainlda$license)
carstestlda$license<-as.factor(carstestlda$license)
cartrainlda.car<-carstrainlda[carstrainlda$Transport %in% c("Car", "Public Transport"),]
cartrainlda.twlr<-carstrainlda[carstrainlda$Transport %in% c("2Wheeler", "Public Transport"),]
cartrainlda.car$Transport<-as.character(cartrainlda.car$Transport)
cartrainlda.car$Transport<-as.factor(cartrainlda.car$Transport)
cartrainlda.twlr$Transport<-as.character(cartrainlda.twlr$Transport)
cartrainlda.twlr$Transport<-as.factor(cartrainlda.twlr$Transport)
prop.table(table(cartrainlda.car$Transport))

## 
##              Car Public Transport 
##        0.1699605        0.8300395

prop.table(table(cartrainlda.twlr$Transport))

## 
##         2Wheeler Public Transport 
##        0.2193309        0.7806691

carldatwlrsm <- SMOTE(Transport~., data = cartrainlda.twlr, perc.over = 150, perc.under=200)
table(carldatwlrsm$Transport)

## 
##         2Wheeler Public Transport 
##              118              118

carldacarsm <- SMOTE(Transport~., data = cartrainlda.car, perc.over = 175, perc.under=200)
table(carldacarsm$Transport)

## 
##              Car Public Transport 
##               86               86

carldacar<-carldacarsm[carldacarsm$Transport %in% c("Car"),]
carsdatatrainldasm<-rbind(carldatwlrsm,carldacar)
str(carsdatatrainldasm)

## 'data.frame':    322 obs. of  9 variables:
##  $ Age      : num  24 27 24 29 26 23 27 25 26 26 ...
##  $ Gender   : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 1 2 1 1 ...
##  $ Engineer : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 1 2 2 ...
##  $ MBA      : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 2 ...
##  $ Work.Exp : num  4 5 2 6 4 2 3 3 6 4 ...
##  $ Salary   : num  10.9 13.8 8.5 14.6 12.7 8.5 10.7 9.9 17.8 12.8 ...
##  $ Distance : num  9 15.2 8.5 10.2 9 6.1 11.7 15.9 10.4 10.8 ...
##  $ license  : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 1 1 ...
##  $ Transport: Factor w/ 3 levels "2Wheeler","Public Transport",..: 2 2 2 2 2 2 2 2 2 2 ...

table(carsdatatrainldasm$Transport)

## 
##         2Wheeler Public Transport              Car 
##              118              118               86

## Build the model
attach(carsdatatrainldasm)

## The following objects are masked from carsdatatrain:
## 
##     Age, Distance, Engineer, Gender, MBA, Salary, Work.Exp,
##     license

trainctrllda<-trainControl(method = 'cv',number = 10)
carslda<-train(Transport~Age+Work.Exp+Salary+Distance+license+Gender+Engineer+MBA ,data = carsdatatrainldasm, method="lda", trControl=trainctrllda)
carslda$finalModel

## Call:
## lda(x, grouping = y)
## 
## Prior probabilities of groups:
##         2Wheeler Public Transport              Car 
##        0.3664596        0.3664596        0.2670807 
## 
## Group means:
##                       Age  Work.Exp   Salary Distance  license1 GenderMale
## 2Wheeler         25.25983  3.726144 12.36309 12.23362 0.2627119  0.5000000
## Public Transport 26.55085  4.983051 13.51864 10.68729 0.1440678  0.6949153
## Car              35.75833 15.854594 36.15677 15.59036 0.7093023  0.7093023
##                  Engineer1      MBA1
## 2Wheeler         0.6949153 0.2542373
## Public Transport 0.6864407 0.2796610
## Car              0.8488372 0.2558140
## 
## Coefficients of linear discriminants:
##                     LD1        LD2
## Age         0.210574627 -0.1462080
## Work.Exp    0.081188593 -0.0836168
## Salary      0.007534406  0.0476583
## Distance    0.074968787  0.1906158
## license1    0.416869466  1.2676211
## GenderMale -0.186709183 -1.0865173
## Engineer1   0.150855710  0.1757128
## MBA1       -0.072490532 -0.1201606
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9477 0.0523

plot(varImp(object = carslda),main="Variable Importance for Linear Discriminant Analysis" )

As per the out put, the first discriminant function achieves 90% separation and as per that Age, Work Experience and Salary play an important part in the choice of transport, followed by Distance and License.

carsldapredict<-predict.train(object = carslda,newdata = carstestlda)
confusionMatrix(carsldapredict,carstestlda[,9])

## Warning in confusionMatrix.default(carsldapredict, carstestlda[, 9]):
## Levels are not in the same order for reference and data. Refactoring data
## to match.

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         2Wheeler Car Public Transport
##   2Wheeler               15   0               12
##   Car                     1  14                1
##   Public Transport        8   4               77
## 
## Overall Statistics
##                                           
##                Accuracy : 0.803           
##                  95% CI : (0.7249, 0.8671)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.001323        
##                                           
##                   Kappa : 0.5952          
##  Mcnemar's Test P-Value : 0.308022        
## 
## Statistics by Class:
## 
##                      Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity                   0.6250     0.7778                  0.8556
## Specificity                   0.8889     0.9825                  0.7143
## Pos Pred Value                0.5556     0.8750                  0.8652
## Neg Pred Value                0.9143     0.9655                  0.6977
## Prevalence                    0.1818     0.1364                  0.6818
## Detection Rate                0.1136     0.1061                  0.5833
## Detection Prevalence          0.2045     0.1212                  0.6742
## Balanced Accuracy             0.7569     0.8801                  0.7849

The overall accuracy is 78% with a prediction of cars usage at 83%.

Let us understand the transportaion choice of two unknown cases

predictTransportuk<-predict.train(object = carslda,newdata = carunknown)
carunknown$predictTransport<-predictTransportuk
print(carunknown)

##   Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1  25   Male        0   0        2     10        5       1             0
## 2  25 Female        1   0        2     10        5       0             0
##   predictcarusegn predictTransport
## 1               0 Public Transport
## 2               0 Public Transport

As per the prediction by LDA, both employees will take public transport as a means of transporation

Let us predict the 2 unknown cases

predictTransportuk1<-predict.train(object = carslda,newdata = carunknown)
carunknown$predictTransportpda<-predictTransportuk1
print(carunknown)

##   Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1  25   Male        0   0        2     10        5       1             0
## 2  25 Female        1   0        2     10        5       0             0
##   predictcarusegn predictTransport predictTransportpda
## 1               0 Public Transport    Public Transport
## 2               0 Public Transport    Public Transport

As per the prediction by LDA, both employees will take public transport as a means of transporation. The cases are precited to be choosing Public Transport.

Prediction using CART

Let us try a descision tree method , CART to infer and understand from the dataset

carscart<-train(Transport~.,carsdatatrainldasm,method = 'rpart', trControl = trainControl(method = 'cv',number = 5,savePredictions = 'final'))
carscart$finalModel

## n= 322 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 322 204 2Wheeler (0.366459627 0.366459627 0.267080745)  
##   2) Age< 30.02888 228 111 2Wheeler (0.513157895 0.482456140 0.004385965)  
##     4) Distance>=10.90546 127  44 2Wheeler (0.653543307 0.338582677 0.007874016) *
##     5) Distance< 10.90546 101  34 Public Transport (0.336633663 0.663366337 0.000000000) *
##   3) Age>=30.02888 94   9 Car (0.010638298 0.085106383 0.904255319) *

library(rattle)

## Warning: package 'rattle' was built under R version 3.4.4

## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(carscart$finalModel)

Age and distance are the major variables based on which CART has done the split.

Let us predict the accuracy of the model in test data

predictions_CART<-predict(carscart,carstestlda)
confusionMatrix(predictions_CART,carstestlda$Transport)

## Warning in confusionMatrix.default(predictions_CART,
## carstestlda$Transport): Levels are not in the same order for reference and
## data. Refactoring data to match.

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         2Wheeler Car Public Transport
##   2Wheeler               15   0               31
##   Car                     1  17                5
##   Public Transport        8   1               54
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6515          
##                  95% CI : (0.5637, 0.7323)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.8007180       
##                                           
##                   Kappa : 0.4068          
##  Mcnemar's Test P-Value : 0.0006336       
## 
## Statistics by Class:
## 
##                      Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity                   0.6250     0.9444                  0.6000
## Specificity                   0.7130     0.9474                  0.7857
## Pos Pred Value                0.3261     0.7391                  0.8571
## Neg Pred Value                0.8953     0.9908                  0.4783
## Prevalence                    0.1818     0.1364                  0.6818
## Detection Rate                0.1136     0.1288                  0.4091
## Detection Prevalence          0.3485     0.1742                  0.4773
## Balanced Accuracy             0.6690     0.9459                  0.6929

We get an overall accuracy of 55% with car usage getting predicted at 88% accuracy.

Let us predict for the unknown cases

predictTransportuk2<-predict.train(object = carscart,newdata = carunknown)
carunknown$predictTransportcart<-predictTransportuk2
print(carunknown)

##   Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1  25   Male        0   0        2     10        5       1             0
## 2  25 Female        1   0        2     10        5       0             0
##   predictcarusegn predictTransport predictTransportpda
## 1               0 Public Transport    Public Transport
## 2               0 Public Transport    Public Transport
##   predictTransportcart
## 1     Public Transport
## 2     Public Transport

CART model also predicts the transport mode as Public Transport for the 2 cases

Prediction using Boosting

boostcontrol <- trainControl(number=10)

xgbGrid <- expand.grid(
  eta = 0.3,
  max_depth = 1,
  nrounds = 50,
  gamma = 0,
  colsample_bytree = 0.6,
  min_child_weight = 1, subsample = 1
)

carsxgb <-  train(Transport ~ .,carsdatatrainldasm,trControl = boostcontrol,tuneGrid = xgbGrid,metric = "Accuracy",method = "xgbTree")

carsxgb$finalModel

## ##### xgb.Booster
## raw: 38.5 Kb 
## call:
##   xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, 
##     gamma = param$gamma, colsample_bytree = param$colsample_bytree, 
##     min_child_weight = param$min_child_weight, subsample = param$subsample), 
##     data = x, nrounds = param$nrounds, num_class = length(lev), 
##     objective = "multi:softprob")
## params (as set within xgb.train):
##   eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsample = "1", num_class = "3", objective = "multi:softprob", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## # of features: 8 
## niter: 50
## nfeatures : 8 
## xNames : Age GenderMale Engineer1 MBA1 Work.Exp Salary Distance license1 
## problemType : Classification 
## tuneValue :
##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 1      50         1 0.3     0              0.6                1         1
## obsLevels : 2Wheeler Public Transport Car 
## param :
##  list()

Let us predict using the test dataset

predictions_xgb<-predict(carsxgb,carstestlda)
confusionMatrix(predictions_xgb,carstestlda$Transport)

## Warning in confusionMatrix.default(predictions_xgb, carstestlda$Transport):
## Levels are not in the same order for reference and data. Refactoring data
## to match.

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         2Wheeler Car Public Transport
##   2Wheeler               17   0               31
##   Car                     1  17                5
##   Public Transport        6   1               54
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.5794, 0.7463)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.68290         
##                                           
##                   Kappa : 0.4398          
##  Mcnemar's Test P-Value : 0.00013         
## 
## Statistics by Class:
## 
##                      Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity                   0.7083     0.9444                  0.6000
## Specificity                   0.7130     0.9474                  0.8333
## Pos Pred Value                0.3542     0.7391                  0.8852
## Neg Pred Value                0.9167     0.9908                  0.4930
## Prevalence                    0.1818     0.1364                  0.6818
## Detection Rate                0.1288     0.1288                  0.4091
## Detection Prevalence          0.3636     0.1742                  0.4621
## Balanced Accuracy             0.7106     0.9459                  0.7167

The overall accuracy is at 73% with accuracy of car usage predicted at 88%.

Let us try and predict for the two unknown cases

predictTransportuk3<-predict.train(object = carsxgb,newdata = carunknown)
carunknown$predictTransportxgb<-predictTransportuk3
print(carunknown)

##   Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1  25   Male        0   0        2     10        5       1             0
## 2  25 Female        1   0        2     10        5       0             0
##   predictcarusegn predictTransport predictTransportpda
## 1               0 Public Transport    Public Transport
## 2               0 Public Transport    Public Transport
##   predictTransportcart predictTransportxgb
## 1     Public Transport            2Wheeler
## 2     Public Transport            2Wheeler

Interestingly, XgB model has predicted the case with Male, license yes and distance as 5 KM to be using 2 Wheeler as a mode of transport and the other case to be using Public Transport.

Prediction Using Multinomial Logistic Regression

We will now use multinomial logistic regression to understand the factors driving transport mode choice. In this, the data is relevelled with respect to one of the three classes and 2 independent Logistic regression models are run. In our case, 2-Wheeler class is taken as the reference class.

carsmlr<-train(Transport ~.,carsdatatrainldasm,method = "multinom")

## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 216.474257
## iter  20 value 157.476869
## iter  30 value 146.596273
## iter  40 value 143.915361
## iter  50 value 143.832635
## iter  60 value 143.823362
## iter  70 value 143.818423
## final  value 143.815137 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 217.045679
## iter  20 value 171.634159
## iter  30 value 170.607288
## final  value 170.607266 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 216.474837
## iter  20 value 157.542781
## iter  30 value 147.255415
## iter  40 value 145.364575
## iter  50 value 145.353838
## final  value 145.352425 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 212.921614
## iter  20 value 164.727443
## iter  30 value 150.283440
## iter  40 value 149.697868
## iter  50 value 149.519865
## iter  60 value 149.495598
## iter  70 value 149.492798
## final  value 149.492595 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 213.335343
## iter  20 value 178.727443
## iter  30 value 177.774962
## final  value 177.774955 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 212.922033
## iter  20 value 164.804173
## iter  30 value 151.622833
## iter  40 value 151.033257
## iter  50 value 150.958383
## final  value 150.951077 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 212.576764
## iter  20 value 163.864576
## iter  30 value 154.063710
## iter  40 value 152.722448
## iter  50 value 152.689973
## iter  60 value 152.680694
## iter  70 value 152.623475
## final  value 152.623468 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 213.069977
## iter  20 value 180.605269
## iter  30 value 180.158464
## final  value 180.158458 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 212.577263
## iter  20 value 163.965876
## iter  30 value 154.683098
## iter  40 value 153.781052
## iter  50 value 153.768646
## iter  60 value 153.755871
## final  value 153.744174 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 237.356898
## iter  20 value 167.417863
## iter  30 value 157.228942
## iter  40 value 156.995449
## iter  50 value 156.987013
## iter  60 value 156.985892
## final  value 156.985553 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 237.871461
## iter  20 value 184.778299
## iter  30 value 183.908483
## final  value 183.908474 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 237.357418
## iter  20 value 167.514995
## iter  30 value 157.830057
## iter  40 value 157.620938
## iter  50 value 157.615345
## final  value 157.614335 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 233.852595
## iter  20 value 174.908251
## iter  30 value 163.379276
## iter  40 value 163.271343
## iter  50 value 163.271224
## final  value 163.271112 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 234.278675
## iter  20 value 186.829334
## iter  30 value 186.429781
## final  value 186.429777 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 233.853026
## iter  20 value 174.945151
## iter  30 value 163.808399
## final  value 163.725497 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 228.167177
## iter  20 value 158.983582
## iter  30 value 148.404985
## iter  40 value 147.098125
## iter  50 value 147.005289
## iter  60 value 146.972949
## iter  70 value 146.928208
## final  value 146.906915 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 228.313449
## iter  20 value 190.816725
## final  value 190.296599 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 228.167325
## iter  20 value 159.248522
## iter  30 value 149.412058
## iter  40 value 148.624812
## iter  50 value 148.517018
## iter  60 value 148.488165
## final  value 148.437131 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 247.848049
## iter  20 value 171.839552
## iter  30 value 158.813531
## iter  40 value 158.220509
## iter  50 value 158.197133
## iter  60 value 158.195964
## final  value 158.194405 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 238.916812
## iter  20 value 186.070105
## iter  30 value 185.534581
## iter  30 value 185.534580
## iter  30 value 185.534580
## final  value 185.534580 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 247.849397
## iter  20 value 171.898645
## iter  30 value 159.380099
## iter  40 value 158.920982
## iter  50 value 158.901135
## iter  60 value 158.900092
## final  value 158.899882 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 202.260622
## iter  20 value 164.432063
## iter  30 value 152.253097
## iter  40 value 150.262441
## iter  50 value 150.234810
## iter  60 value 150.228150
## iter  70 value 150.227658
## final  value 150.220175 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 202.818810
## iter  20 value 173.923632
## iter  30 value 173.098449
## final  value 173.098272 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 202.261190
## iter  20 value 164.474704
## iter  30 value 153.625762
## iter  40 value 151.963960
## iter  50 value 151.910236
## iter  60 value 151.902414
## iter  70 value 151.879582
## final  value 151.879578 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 194.494908
## iter  20 value 155.385338
## iter  30 value 147.144402
## iter  40 value 146.951976
## iter  50 value 146.951298
## iter  60 value 146.951137
## final  value 146.950589 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 195.890031
## iter  20 value 169.899900
## iter  30 value 169.787453
## final  value 169.787449 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 194.496329
## iter  20 value 155.475985
## iter  30 value 147.679499
## iter  40 value 147.558881
## iter  50 value 147.558610
## final  value 147.558598 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 223.129426
## iter  20 value 160.327263
## iter  30 value 143.322790
## iter  40 value 142.283652
## iter  50 value 142.122906
## iter  60 value 142.086342
## iter  70 value 142.079664
## final  value 142.079327 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 223.525000
## iter  20 value 182.169278
## final  value 181.536483 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 223.129827
## iter  20 value 160.431173
## iter  30 value 144.650319
## iter  40 value 143.973494
## iter  50 value 143.882952
## iter  60 value 143.868987
## final  value 143.850764 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 248.703485
## iter  20 value 167.451286
## iter  30 value 152.729542
## iter  40 value 145.277171
## iter  50 value 145.010998
## iter  60 value 144.831656
## iter  70 value 144.663975
## iter  80 value 144.628366
## iter  90 value 144.615352
## iter 100 value 144.603869
## final  value 144.603869 
## stopped after 100 iterations
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 248.961835
## iter  20 value 182.518351
## iter  30 value 181.111320
## final  value 181.111263 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 248.703746
## iter  20 value 167.515286
## iter  30 value 155.156126
## iter  40 value 153.437942
## iter  50 value 152.472310
## iter  60 value 152.057901
## iter  70 value 152.050005
## final  value 152.039173 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 196.558901
## iter  20 value 154.246510
## iter  30 value 137.773778
## iter  40 value 136.925900
## iter  50 value 136.619348
## iter  60 value 136.536023
## iter  70 value 136.284482
## iter  80 value 134.972315
## iter  90 value 134.839406
## iter 100 value 134.820690
## final  value 134.820690 
## stopped after 100 iterations
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 197.089055
## iter  20 value 168.517382
## iter  30 value 168.126039
## final  value 168.126036 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 196.559437
## iter  20 value 154.309745
## iter  30 value 140.106850
## final  value 139.941710 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 187.185634
## iter  20 value 150.727085
## iter  30 value 142.281417
## iter  40 value 142.096346
## iter  50 value 142.095793
## final  value 142.095753 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 188.278881
## iter  20 value 171.195282
## iter  30 value 171.175960
## final  value 171.175958 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 187.186745
## iter  20 value 150.864227
## iter  30 value 142.917922
## final  value 142.798943 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 200.163494
## iter  20 value 164.134466
## iter  30 value 157.228101
## iter  40 value 155.861314
## iter  50 value 155.497346
## iter  60 value 155.470068
## iter  70 value 155.461965
## iter  70 value 155.461964
## iter  70 value 155.461964
## final  value 155.461964 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 200.693791
## iter  20 value 171.437166
## iter  30 value 169.184879
## final  value 169.183951 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 200.164032
## iter  20 value 164.162475
## iter  30 value 157.479825
## iter  40 value 156.205161
## iter  50 value 156.005819
## iter  60 value 155.991138
## final  value 155.989177 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 226.667658
## iter  20 value 165.154651
## iter  30 value 149.510953
## iter  40 value 145.072856
## iter  50 value 144.018755
## iter  60 value 143.973773
## iter  70 value 143.936156
## final  value 143.890182 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 226.927067
## iter  20 value 182.385399
## iter  30 value 181.511224
## final  value 181.511192 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 226.667920
## iter  20 value 165.236761
## iter  30 value 151.020920
## iter  40 value 149.586930
## iter  50 value 149.382443
## iter  60 value 149.277634
## iter  70 value 149.273427
## final  value 149.267925 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 237.102034
## iter  20 value 184.913163
## iter  30 value 173.761567
## final  value 173.351965 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 237.588498
## iter  20 value 192.588302
## iter  30 value 191.912204
## final  value 191.912200 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 237.102531
## iter  20 value 184.937282
## iter  30 value 174.199942
## iter  40 value 173.811824
## final  value 173.811819 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 220.509736
## iter  20 value 181.720368
## iter  30 value 164.769033
## iter  40 value 164.566223
## iter  50 value 164.548910
## final  value 164.548476 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 221.109316
## iter  20 value 195.653681
## iter  30 value 195.123639
## final  value 195.123636 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 220.510348
## iter  20 value 181.778292
## iter  30 value 166.062212
## iter  40 value 166.022277
## iter  50 value 166.020384
## iter  60 value 166.020075
## final  value 166.019722 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 185.210041
## iter  20 value 164.070417
## iter  30 value 152.933403
## iter  40 value 152.837581
## iter  50 value 152.808946
## iter  60 value 152.804811
## iter  70 value 152.797488
## final  value 152.797393 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 185.718046
## iter  20 value 171.235038
## iter  30 value 170.940606
## final  value 170.940577 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 185.210553
## iter  20 value 164.106690
## iter  30 value 153.954517
## iter  40 value 153.940657
## iter  50 value 153.936613
## iter  60 value 153.936101
## final  value 153.935590 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 215.405764
## iter  20 value 160.319712
## iter  30 value 143.996928
## iter  40 value 138.112559
## iter  50 value 137.251127
## iter  60 value 136.996778
## iter  70 value 133.841781
## iter  80 value 131.526885
## iter  90 value 131.511580
## iter 100 value 131.498869
## final  value 131.498869 
## stopped after 100 iterations
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 216.822904
## iter  20 value 176.208925
## iter  30 value 175.332005
## final  value 175.331975 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 215.407202
## iter  20 value 160.393441
## iter  30 value 145.368641
## iter  40 value 142.255618
## iter  50 value 142.221533
## iter  60 value 142.216673
## final  value 142.204358 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 207.250037
## iter  20 value 162.674466
## iter  30 value 152.011237
## iter  40 value 151.125800
## iter  50 value 151.098901
## iter  60 value 151.093058
## final  value 151.077979 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 207.663764
## iter  20 value 177.184663
## iter  30 value 176.795232
## iter  30 value 176.795230
## iter  30 value 176.795230
## final  value 176.795230 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 207.250456
## iter  20 value 162.741786
## iter  30 value 152.524819
## iter  40 value 151.788057
## iter  50 value 151.772467
## iter  60 value 151.770266
## final  value 151.766147 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 233.761533
## iter  20 value 176.882382
## iter  30 value 160.276887
## iter  40 value 159.991549
## iter  50 value 159.990858
## iter  50 value 159.990857
## iter  50 value 159.990857
## final  value 159.990857 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 234.383439
## iter  20 value 197.922847
## iter  30 value 197.672608
## final  value 197.672605 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 233.762165
## iter  20 value 176.966178
## iter  30 value 161.078140
## iter  40 value 160.911258
## iter  50 value 160.910916
## final  value 160.910892 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 246.682241
## iter  20 value 175.241067
## iter  30 value 157.883803
## iter  40 value 157.704781
## iter  50 value 157.701290
## iter  60 value 157.700903
## final  value 157.700564 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 247.067589
## iter  20 value 199.938405
## iter  30 value 199.762475
## iter  30 value 199.762474
## iter  30 value 199.762474
## final  value 199.762474 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 246.682628
## iter  20 value 175.350265
## iter  30 value 158.738390
## iter  40 value 158.579502
## iter  50 value 158.577113
## final  value 158.576885 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 197.138496
## iter  20 value 162.169200
## iter  30 value 149.406524
## iter  40 value 139.460350
## iter  50 value 139.239774
## iter  60 value 138.966614
## iter  70 value 138.942244
## final  value 138.937982 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 198.090770
## iter  20 value 177.176388
## final  value 177.160338 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 197.139480
## iter  20 value 162.274436
## iter  30 value 151.145542
## iter  40 value 149.433962
## iter  50 value 149.378804
## iter  60 value 149.374475
## iter  70 value 149.372437
## final  value 149.372299 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 219.460292
## iter  20 value 168.710914
## iter  30 value 153.735594
## iter  40 value 153.571838
## iter  50 value 153.557336
## final  value 153.556390 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 219.934003
## iter  20 value 187.525340
## final  value 187.217660 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 219.460772
## iter  20 value 168.808423
## iter  30 value 154.990137
## iter  40 value 154.961274
## iter  50 value 154.959828
## final  value 154.959739 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 223.153324
## iter  20 value 167.990379
## iter  30 value 155.911287
## iter  40 value 155.843905
## iter  50 value 155.827703
## iter  60 value 155.825896
## iter  70 value 155.823932
## final  value 155.823930 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 223.435867
## iter  20 value 192.922316
## iter  30 value 192.454292
## iter  30 value 192.454292
## iter  30 value 192.454292
## final  value 192.454292 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 223.153610
## iter  20 value 168.184994
## iter  30 value 157.180241
## iter  40 value 157.107302
## iter  50 value 157.092150
## iter  60 value 157.090306
## final  value 157.089187 
## converged
## # weights:  30 (18 variable)
## initial  value 353.753157 
## iter  10 value 218.032173
## iter  20 value 173.255952
## iter  30 value 163.190947
## iter  40 value 163.175049
## iter  50 value 163.174933
## final  value 163.174911 
## converged

carsmlr$finalModel

## Call:
## nnet::multinom(formula = .outcome ~ ., data = dat, decay = param$decay)
## 
## Coefficients:
##                  (Intercept)       Age GenderMale  Engineer1       MBA1
## Public Transport   -3.009644 0.1904219  0.9071324 -0.1385301  0.1833559
## Car               -78.448134 2.5821660 -0.9090194  1.6644457 -0.8002783
##                    Work.Exp      Salary   Distance  license1
## Public Transport  0.1010928 -0.07163237 -0.1476360 -1.157672
## Car              -1.1773870  0.18870405  0.3620958  1.502582
## 
## Residual Deviance: 326.3498 
## AIC: 362.3498

carmlrcoeff<-exp(coef(carsmlr$finalModel))
write.csv(carmlrcoeff,file = "Coeffsmlr.csv")
plot(varImp(object=carsmlr), main = "Variable Importance for Multinomial Logit")

The model has a residual deviance of 296.416. The model implies that an increase in Age by 1 year increases the odds of taking Public Transport as compared to 2 Wheeler by 1.3 (57%), whereas it increses the odds of choosing car by 12 (92%) Age and license are the two main important factors in deciding mode of transport.

Let us try and predict using the test data

predictions_mlr<-predict(carsmlr,carstestlda)
confusionMatrix(predictions_mlr,carstestlda$Transport)

## Warning in confusionMatrix.default(predictions_mlr, carstestlda$Transport):
## Levels are not in the same order for reference and data. Refactoring data
## to match.

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         2Wheeler Car Public Transport
##   2Wheeler               16   0                8
##   Car                     1  17                3
##   Public Transport        7   1               79
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8485          
##                  95% CI : (0.7757, 0.9049)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 9.773e-06       
##                                           
##                   Kappa : 0.6944          
##  Mcnemar's Test P-Value : 0.5587          
## 
## Statistics by Class:
## 
##                      Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity                   0.6667     0.9444                  0.8778
## Specificity                   0.9259     0.9649                  0.8095
## Pos Pred Value                0.6667     0.8095                  0.9080
## Neg Pred Value                0.9259     0.9910                  0.7556
## Prevalence                    0.1818     0.1364                  0.6818
## Detection Rate                0.1212     0.1288                  0.5985
## Detection Prevalence          0.1818     0.1591                  0.6591
## Balanced Accuracy             0.7963     0.9547                  0.8437

The overall accuracy is at 70% with accuracy of car usage predicted at 94%.

Let us try and predict for the two unknown cases

predictTransportuk4<-predict.train(object = carsmlr,newdata = carunknown)
carunknown$predictTransportmlr<-predictTransportuk4
print(carunknown)

##   Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1  25   Male        0   0        2     10        5       1             0
## 2  25 Female        1   0        2     10        5       0             0
##   predictcarusegn predictTransport predictTransportpda
## 1               0 Public Transport    Public Transport
## 2               0 Public Transport    Public Transport
##   predictTransportcart predictTransportxgb predictTransportmlr
## 1     Public Transport            2Wheeler    Public Transport
## 2     Public Transport            2Wheeler    Public Transport

The model has predicted both cases to be using public transport

Prediction using Random Forest

rftrcontrol<-control <- trainControl(method="repeatedcv", number=10, repeats=3)
mtry<-sqrt(ncol(carsdatatrainldasm))
tunegridrf <- expand.grid(.mtry=mtry)
carsrf<-train(Transport ~.,carsdatatrainldasm,method = "rf", trControl=rftrcontrol, tuneGrid = tunegridrf)
carsrf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 15.84%
## Confusion matrix:
##                  2Wheeler Public Transport Car class.error
## 2Wheeler               86               30   2  0.27118644
## Public Transport       14              100   4  0.15254237
## Car                     1                0  85  0.01162791

plot(varImp(object=carsrf), main = "Variable Importance for Random Forest")

The out of bag error estimate rate is 16.7% in the training dataset. Age, Work Experience, Distance and Salary are the most significant variables.

Let us try and predict for test data

predictions_rf<-predict(carsrf,carstestlda)
confusionMatrix(predictions_rf,carstestlda$Transport)

## Warning in confusionMatrix.default(predictions_rf, carstestlda$Transport):
## Levels are not in the same order for reference and data. Refactoring data
## to match.

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         2Wheeler Car Public Transport
##   2Wheeler               18   0               26
##   Car                     1  17                1
##   Public Transport        5   1               63
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7424          
##                  95% CI : (0.6591, 0.8146)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.078662        
##                                           
##                   Kappa : 0.5428          
##  Mcnemar's Test P-Value : 0.001634        
## 
## Statistics by Class:
## 
##                      Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity                   0.7500     0.9444                  0.7000
## Specificity                   0.7593     0.9825                  0.8571
## Pos Pred Value                0.4091     0.8947                  0.9130
## Neg Pred Value                0.9318     0.9912                  0.5714
## Prevalence                    0.1818     0.1364                  0.6818
## Detection Rate                0.1364     0.1288                  0.4773
## Detection Prevalence          0.3333     0.1439                  0.5227
## Balanced Accuracy             0.7546     0.9635                  0.7786

We have an overall accuracy of 72% and 94% accuracy for prediction of car usage.

Let us now predict the choice of transport for 2 unknown cases

predictTransportuk5<-predict.train(object = carsrf,newdata = carunknown)
carunknown$predictTransportrf<-predictTransportuk5
print(carunknown)

##   Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1  25   Male        0   0        2     10        5       1             0
## 2  25 Female        1   0        2     10        5       0             0
##   predictcarusegn predictTransport predictTransportpda
## 1               0 Public Transport    Public Transport
## 2               0 Public Transport    Public Transport
##   predictTransportcart predictTransportxgb predictTransportmlr
## 1     Public Transport            2Wheeler    Public Transport
## 2     Public Transport            2Wheeler    Public Transport
##   predictTransportrf
## 1           2Wheeler
## 2           2Wheeler

We have one record (female,engineer) predicted to choose 2 Wheeler and the other record to have chosen Public Transport

Machine Learning Group Assignment - Group 9

Deepak Gupta, Tulika Choudhary, Ankit Chamoli, Prabhakar Pandey, Rahul Khetan

12/3/2018