Classification models on the Auto data of ISLR2 package part 1

The classification models used in this projects are:

Logistic regression
K-Nearest Neighbor (KNN)
Naive Bayes
Decision (classification) Tree.
Random Forest

Additional packages (aside the base/default R packages) used are:

ISLR2
MASS
e1071
class
dpylr
ggplot2
Hmisc
rpart
rpart.plot
caret
randomForest
formattable

Loading required packages

library(ISLR2)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:ISLR2':
## 
##     Boston

library(e1071)
library(class)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following object is masked from 'package:e1071':
## 
##     impute

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(rpart)
library(rpart.plot)
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(formattable)

## 
## Attaching package: 'formattable'

## The following object is masked from 'package:MASS':
## 
##     area

Data Description

Auto Data Set

Description Gas mileage, horsepower, and other information for 392 vehicles. A data frame with 392 observations on the following 9 variables.

mpg miles per gallon

cylinders Number of cylinders between 4 and 8

displacement Engine displacement (cu. inches)

horsepower Engine horsepower

weight Vehicle weight (lbs.)

acceleration Time to accelerate from 0 to 60 mph (sec.)

year Model year (modulo 100)

origin Origin of car (1. American, 2. European, 3. Japanese)

name Vehicle name

Source This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. The original dataset has 397 observations, of which 5 have missing values for the variable “horsepower”. These rows are removed here. The original dataset is avaliable as a CSV file in the docs directory, as well as at https://www.statlearning.com.

References James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York

Data loading. prep, and cleaning

df <- data.frame(Auto)

head(df, 5)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino

Check if there are any missing values in our dataframe and if there are printing all the records with at least one missing column.

which(is.na(df))

## integer(0)

df[!complete.cases(df),]

## [1] mpg          cylinders    displacement horsepower   weight      
## [6] acceleration year         origin       name        
## <0 rows> (or 0-length row.names)

From the above we see that there are no missing/null values

Converting variables cylinders and origin to factors Note: For origin, 1 - American, 2 - European, 3 - Japanese

df$cylinders <- as.factor(df$cylinders)
df$origin <- as.factor(df$origin)

Exploratory data analysis

Variables cylinders, origin, year, and name where excluded as the summary we would are redundant

summary(df[-c(2,7,8,9)])

##       mpg         displacement     horsepower        weight      acceleration  
##  Min.   : 9.00   Min.   : 68.0   Min.   : 46.0   Min.   :1613   Min.   : 8.00  
##  1st Qu.:17.00   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225   1st Qu.:13.78  
##  Median :22.75   Median :151.0   Median : 93.5   Median :2804   Median :15.50  
##  Mean   :23.45   Mean   :194.4   Mean   :104.5   Mean   :2978   Mean   :15.54  
##  3rd Qu.:29.00   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615   3rd Qu.:17.02  
##  Max.   :46.60   Max.   :455.0   Max.   :230.0   Max.   :5140   Max.   :24.80

From the above summary of the quantitative variables (inputs: displacement, horsepower, weight, & height and output: mpg), we see that the mean and median for these variables are quite different except the acceleratio variable, we can plots histograms for these variables to know the shape of their respective distributions.

hist.data.frame(df[,-c(2,7,8,9)])

Scatter plots for each quantitative variable with another was also made to get an overview of their likely relationship with one another.

pairs(~mpg + displacement + weight + horsepower + acceleration, data=df)

From the above, we see that some of the input variables are collinear (that is, linearly related with one another), e.g. weight & horsepower, displacement & weight, we can also get the correlation matrix to show this.

cor(df[,-c(2,7,8,9)])

##                     mpg displacement horsepower     weight acceleration
## mpg           1.0000000   -0.8051269 -0.7784268 -0.8322442    0.4233285
## displacement -0.8051269    1.0000000  0.8972570  0.9329944   -0.5438005
## horsepower   -0.7784268    0.8972570  1.0000000  0.8645377   -0.6891955
## weight       -0.8322442    0.9329944  0.8645377  1.0000000   -0.4168392
## acceleration  0.4233285   -0.5438005 -0.6891955 -0.4168392    1.0000000

From the above histogram, we see that the variable miles per gallon (mpg) is skewed right, making the median the best metric to use has the cutoff point in creating a new variable mpg01 with two classes; 0 if mpg is less or equals to the median, 1 if mpg is greater than the median, this new variable; mpg01 then serves as the target variable which we will then build classification models on.

paste("The median of mpg is", median(df$mpg))

## [1] "The median of mpg is 22.75"

df$mpg01 <- rep(0, length(df$mpg))
df$mpg01[df$mpg > median(df$mpg)] <- 1
df$mpg01 <- as.factor(df$mpg01)

A quick check that the new variable was created correctly

df[c(10:19), c("mpg","mpg01")]

##    mpg mpg01
## 10  15     0
## 11  15     0
## 12  14     0
## 13  15     0
## 14  14     0
## 15  24     1
## 16  22     0
## 17  18     0
## 18  21     0
## 19  27     1

For the factor (qualitative variables), we first make a contingency table of our factor variables cylinders & origin with the new target variable mpg01

table(df$cylinders, df$mpg01)

##    
##       0   1
##   3   3   1
##   4  20 179
##   5   1   2
##   6  72  11
##   8 100   3

table(df$origin, df$mpg01)

##    
##       0   1
##   1 173  72
##   2  14  54
##   3   9  70

To check the relationship between our factor variables cylinders & origin with the new target variable mpg01, we make a chi-square test from the contingency table created above.

chisq.test(df$cylinders, df$mpg01)

## Warning in chisq.test(df$cylinders, df$mpg01): Chi-squared approximation may be
## incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  df$cylinders and df$mpg01
## X-squared = 264.55, df = 4, p-value < 2.2e-16

chisq.test(df$origin, df$mpg01)

## 
##  Pearson's Chi-squared test
## 
## data:  df$origin and df$mpg01
## X-squared = 112.27, df = 2, p-value < 2.2e-16

From the above, we see that both mpg01 is dependent on both factor variables as each p-value gotten above is a very small number p-value < 2.2e-16

Now we check the relation of the continuous input variables with this new categorical variable using one-way ANOVA test

data.frame(group_by(df, mpg01) %>%
  summarise(count = n(),
    avg_disp = mean(displacement, na.rm = TRUE),
    sd_disp = sd(displacement, na.rm = TRUE),
    avg_hrp = mean(horsepower, na.rm = TRUE),
    sd_hrp = sd(horsepower, na.rm = TRUE),
    avg_wgt = mean(weight, na.rm = TRUE),
    sd_wgt = sd(weight, na.rm = TRUE),
    avg_acc = mean(acceleration, na.rm = TRUE),
    sd_acc = sd(acceleration, na.rm = TRUE), .groups="drop"
  ))

##   mpg01 count avg_disp  sd_disp   avg_hrp   sd_hrp  avg_wgt   sd_wgt  avg_acc
## 1     0   196 273.1582 89.52399 130.11224 37.35564 3620.403 676.9322 14.58571
## 2     1   196 115.6658 38.42951  78.82653 15.91969 2334.765 397.1924 16.49694
##     sd_acc
## 1 2.685154
## 2 2.493168

From the above, we have equal count between the two classes of mpg01, showing that the data is balanced.

We can also visualize the above using the boxplots below

boxplot(df$displacement~df$mpg01, main="displacement vs mpg01", col = c("#00AFBB", "#FC4E07"))

boxplot(df$horsepower~df$mpg01, main="horsepower vs mpg01", col = c("#00AFBB", "#FC4E07"))

boxplot(df$weight~df$mpg01, main="weight vs mpg01", col = c("#00AFBB", "#FC4E07"))

boxplot(df$acceleration~df$mpg01, main="acceleration vs mpg01", col = c("#00AFBB", "#FC4E07"))

From the box plots above, we can conclude that there is a significant difference between each level of the target variable mpg01 and the various continuous input variables, hence there is a significant association between the continuous independent variables and the target variable mpg01

For variable importance of all input variables to the target variable mpg01 which can ranked in descending order of importance

For the factor input variables, cylinders has a more significant association with mpg01 compared to origin has it has an higher chi-squared test statistic
For the quantitative input variables, weight and displacement have very distinct boxplot for each level of mpg01, followed by horsepower, and then acceleration.

Classification models

train-test split

A 80-20 train-test split respectively was used for this project.

set.seed(12345)
split<- sample(c(rep(0, 0.8 * nrow(df)), rep(1, 0.2 * nrow(df))))

df.train <-df[split == 0, ] 
df.test <- df[split == 1, ]

Let’s have a look at the first rows of our train data:

head(df.train)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name mpg01
## 1 chevrolet chevelle malibu     0
## 2         buick skylark 320     0
## 3        plymouth satellite     0
## 4             amc rebel sst     0
## 5               ford torino     0
## 6          ford galaxie 500     0

Let’s have a look at the first rows of our test data:

head(df.test)

##    mpg cylinders displacement horsepower weight acceleration year origin
## 7   14         8          454        220   4354          9.0   70      1
## 10  15         8          390        190   3850          8.5   70      1
## 14  14         8          455        225   3086         10.0   70      1
## 17  18         6          199         97   2774         15.5   70      1
## 24  26         4          121        113   2234         12.5   70      2
## 36  17         6          250        100   3329         15.5   71      1
##                         name mpg01
## 7           chevrolet impala     0
## 10        amc ambassador dpl     0
## 14   buick estate wagon (sw)     0
## 17                amc hornet     0
## 24                  bmw 2002     1
## 36 chevrolet chevelle malibu     0

Now we create a vector for just the target variable in both the training and test data

Actual.mpg01_train <- df.train$mpg01
Actual.mpg01_test <- df.test$mpg01

1. Logistic Regression model

log_reg <- glm(mpg01~cylinders + displacement + horsepower + weight + acceleration + origin, data = df.train, family=binomial)
summary(log_reg)

## 
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower + 
##     weight + acceleration + origin, family = binomial, data = df.train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.40184  -0.25539   0.06431   0.28834   2.84926  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  10.1422805  3.7717084   2.689  0.00717 **
## cylinders4    4.5454003  1.5408100   2.950  0.00318 **
## cylinders5    3.9376520  2.2225913   1.772  0.07645 . 
## cylinders6    2.9795082  1.7936226   1.661  0.09668 . 
## cylinders8    5.8969995  2.4559551   2.401  0.01635 * 
## displacement -0.0213256  0.0129690  -1.644  0.10010   
## horsepower   -0.0790939  0.0268315  -2.948  0.00320 **
## weight        0.0004541  0.0011932   0.381  0.70354   
## acceleration -0.2586885  0.1589896  -1.627  0.10372   
## origin2      -0.4008971  0.7554499  -0.531  0.59565   
## origin3       1.0483281  0.8720313   1.202  0.22930   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 435.28  on 313  degrees of freedom
## Residual deviance: 137.60  on 303  degrees of freedom
## AIC: 159.6
## 
## Number of Fisher Scoring iterations: 7

In the above logistic regression model, no interaction between predictor variables was considered.

From the above, we see that at various significant levels (0.001, 0.01, 0.05, 0.1, etc), some variables are not significant (less important for the model), that is at 0.05,cylinders6 (representing level 6 of variable cylinders) is not significant, etc

Now we perform a stepwise using the function stepAIC to select the most important variables affecting the target variable mpg01

final.log_reg <- stepAIC(log_reg)

## Start:  AIC=159.6
## mpg01 ~ cylinders + displacement + horsepower + weight + acceleration + 
##     origin
## 
##                Df Deviance    AIC
## - weight        1   137.75 157.75
## - origin        2   140.60 158.60
## <none>              137.60 159.60
## - acceleration  1   140.33 160.33
## - displacement  1   140.50 160.50
## - horsepower    1   147.21 167.21
## - cylinders     4   159.85 173.85
## 
## Step:  AIC=157.75
## mpg01 ~ cylinders + displacement + horsepower + acceleration + 
##     origin
## 
##                Df Deviance    AIC
## - origin        2   140.66 156.66
## <none>              137.75 157.75
## - acceleration  1   140.98 158.98
## - displacement  1   141.13 159.13
## - horsepower    1   150.12 168.12
## - cylinders     4   160.42 172.42
## 
## Step:  AIC=156.65
## mpg01 ~ cylinders + displacement + horsepower + acceleration
## 
##                Df Deviance    AIC
## <none>              140.66 156.66
## - acceleration  1   144.71 158.71
## - displacement  1   148.18 162.18
## - horsepower    1   152.79 166.79
## - cylinders     4   161.09 169.09

coef(final.log_reg)

##  (Intercept)   cylinders4   cylinders5   cylinders6   cylinders8 displacement 
##  11.40448898   3.75890305   2.98239500   2.54297883   5.51538124  -0.02281515 
##   horsepower acceleration 
##  -0.07060382  -0.24725339

From the above, we see the steps taken in getting the final optimal logistic regression model, if you do not want this displayed until you get the final optimal logistic regression model include the argument (“trace=FALSE”) as shown below

final.log_reg <- stepAIC(log_reg, trace=FALSE)
coef(final.log_reg)

##  (Intercept)   cylinders4   cylinders5   cylinders6   cylinders8 displacement 
##  11.40448898   3.75890305   2.98239500   2.54297883   5.51538124  -0.02281515 
##   horsepower acceleration 
##  -0.07060382  -0.24725339

From the above final optimal logistic regression model, we see that variables weight, and origin were removed

Now we check the prediction accuracy on the final/optimal model based on the test data and compare it with the initial logistic regression model. The cutoff threshold chosen was 0.5, that is based on the input features (cylinders, displacement, horsepower & acceleration), a car is classified as 1 (having a value greater than 22.75 for its miles per gallon) if greater than 0.5 or classified as 0 (having a value less than or equal to 22.75 for its miles per gallon) if less than 0.5

pr.final_log_reg <- predict(final.log_reg, df.test, type="response")
pred.final_log_reg <- ifelse(pr.final_log_reg > 0.5, "1", "0")
mean(pred.final_log_reg == Actual.mpg01_test)

## [1] 0.8589744

pr.log_reg <- predict(log_reg, df.test, type="response")
pred.log_reg <- ifelse(pr.log_reg > 0.5, "1", "0")
mean(pred.log_reg == Actual.mpg01_test)

## [1] 0.8589744

From the above, we can see see that the final/optimal logistic regression model gives the same prediction accuracy and is also less complex than the initial model (as it uses fewer input variables).

Now, making classification with the final/optimal logistic regression model, we have.

cm_final_log_reg <- confusionMatrix(as.factor(pred.final_log_reg), Actual.mpg01_test, positive='1', mode='everything')
cm_final_log_reg

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 31  2
##          1  9 36
##                                           
##                Accuracy : 0.859           
##                  95% CI : (0.7617, 0.9274)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 1.266e-10       
##                                           
##                   Kappa : 0.7191          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.7750          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.9394          
##               Precision : 0.8000          
##                  Recall : 0.9474          
##                      F1 : 0.8675          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4615          
##    Detection Prevalence : 0.5769          
##       Balanced Accuracy : 0.8612          
##                                           
##        'Positive' Class : 1               
##

paste("The test data misclassification rate is:", mean(pred.final_log_reg != Actual.mpg01_test))

## [1] "The test data misclassification rate is: 0.141025641025641"

From the above, we have

True Positive (TP) = 36 True Negative (TN) = 31 False Positive (FP) = 9 False Negative (FN) = 2

We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.

2. K-Nearest Neighbor (KNN) model

For the knn model, we will try various values of k; 1,5,10, and 100 and select the one with the least test data misclassification rate.

I also remove variables mpg01, year, origin, name, mpg01 using their relative positive in the respective dataframes as the function knn() of the package class requires a training data frame/matrix, test data frame/matrix of only the input and finally a vector of target with factors from the training data.

k = 1

pred.knn1 <- knn(df.train[,-c(1,7,9,10)], df.test[,-c(1,7,9,10)], Actual.mpg01_train, k = 1)
cm_knn1 <- confusionMatrix(pred.knn1, Actual.mpg01_test, positive='1', mode='everything')
cm_knn1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 30  5
##          1 10 33
##                                           
##                Accuracy : 0.8077          
##                  95% CI : (0.7027, 0.8882)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 6.353e-08       
##                                           
##                   Kappa : 0.6164          
##                                           
##  Mcnemar's Test P-Value : 0.3017          
##                                           
##             Sensitivity : 0.8684          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.7674          
##          Neg Pred Value : 0.8571          
##               Precision : 0.7674          
##                  Recall : 0.8684          
##                      F1 : 0.8148          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4231          
##    Detection Prevalence : 0.5513          
##       Balanced Accuracy : 0.8092          
##                                           
##        'Positive' Class : 1               
##

paste("The test data misclassification rate is:", mean(pred.knn1 != Actual.mpg01_test))

## [1] "The test data misclassification rate is: 0.192307692307692"

From the above, we have

True Positive (TP) = 33 True Negative (TN) = 30 False Positive (FP) = 10 False Negative (FN) = 5

We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.

k = 5

pred.knn5 <- knn(df.train[,-c(1,7,9,10)], df.test[,-c(1,7,9,10)], Actual.mpg01_train, k = 5)
cm_knn5 <- confusionMatrix(pred.knn5, Actual.mpg01_test, positive='1', mode='everything')
cm_knn5

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 31  1
##          1  9 37
##                                           
##                Accuracy : 0.8718          
##                  95% CI : (0.7768, 0.9368)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 2.112e-11       
##                                           
##                   Kappa : 0.7448          
##                                           
##  Mcnemar's Test P-Value : 0.02686         
##                                           
##             Sensitivity : 0.9737          
##             Specificity : 0.7750          
##          Pos Pred Value : 0.8043          
##          Neg Pred Value : 0.9688          
##               Precision : 0.8043          
##                  Recall : 0.9737          
##                      F1 : 0.8810          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4744          
##    Detection Prevalence : 0.5897          
##       Balanced Accuracy : 0.8743          
##                                           
##        'Positive' Class : 1               
##

paste("The test data misclassification rate is:", mean(pred.knn5 != Actual.mpg01_test))

## [1] "The test data misclassification rate is: 0.128205128205128"

From the above, we have

True Positive (TP) = 37 True Negative (TN) = 31 False Positive (FP) = 9 False Negative (FN) = 1

We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.

k = 10

pred.knn10 <- knn(df.train[,-c(1,7,9,10)], df.test[,-c(1,7,9,10)], Actual.mpg01_train, k = 10)
cm_knn10 <- confusionMatrix(pred.knn10, Actual.mpg01_test, positive='1', mode='everything')
cm_knn10

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 32  1
##          1  8 37
##                                           
##                Accuracy : 0.8846          
##                  95% CI : (0.7922, 0.9459)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 3.16e-12        
##                                           
##                   Kappa : 0.7701          
##                                           
##  Mcnemar's Test P-Value : 0.0455          
##                                           
##             Sensitivity : 0.9737          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.8222          
##          Neg Pred Value : 0.9697          
##               Precision : 0.8222          
##                  Recall : 0.9737          
##                      F1 : 0.8916          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4744          
##    Detection Prevalence : 0.5769          
##       Balanced Accuracy : 0.8868          
##                                           
##        'Positive' Class : 1               
##

paste("The test data misclassification rate is:", mean(pred.knn10 != Actual.mpg01_test))

## [1] "The test data misclassification rate is: 0.115384615384615"

From the above, we have

True Positive (TP) = 37 True Negative (TN) = 32 False Positive (FP) = 8 False Negative (FN) = 1

We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.

k = 100

pred.knn100 <- knn(df.train[,-c(1,7,9,10)], df.test[,-c(1,7,9,10)], Actual.mpg01_train, k = 100)
cm_knn100 <- confusionMatrix(pred.knn100, Actual.mpg01_test, positive='1', mode='everything')
cm_knn100

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 33  3
##          1  7 35
##                                           
##                Accuracy : 0.8718          
##                  95% CI : (0.7768, 0.9368)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 2.112e-11       
##                                           
##                   Kappa : 0.7441          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9211          
##             Specificity : 0.8250          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.9167          
##               Precision : 0.8333          
##                  Recall : 0.9211          
##                      F1 : 0.8750          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4487          
##    Detection Prevalence : 0.5385          
##       Balanced Accuracy : 0.8730          
##                                           
##        'Positive' Class : 1               
##

paste("The test data misclassification rate is:", mean(pred.knn100 != Actual.mpg01_test))

## [1] "The test data misclassification rate is: 0.128205128205128"

From the above, we have

True Positive (TP) = 35 True Negative (TN) = 33 False Positive (FP) = 7 False Negative (FN) = 3

We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.

3. Naive Bayes

nb_mod <- naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration + origin, data = df.train)
nb_mod

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.4968153 0.5031847 
## 
## Conditional probabilities:
##    cylinders
## Y             3           4           5           6           8
##   0 0.012820513 0.076923077 0.006410256 0.365384615 0.538461538
##   1 0.006329114 0.911392405 0.006329114 0.056962025 0.018987342
## 
##    displacement
## Y       [,1]     [,2]
##   0 277.4359 84.40082
##   1 115.1108 40.84656
## 
##    horsepower
## Y        [,1]     [,2]
##   0 130.42308 35.42207
##   1  78.67089 16.54965
## 
##    weight
## Y       [,1]     [,2]
##   0 3657.397 678.9579
##   1 2330.158 408.3661
## 
##    acceleration
## Y       [,1]     [,2]
##   0 14.56346 2.617290
##   1 16.60633 2.536101
## 
##    origin
## Y            1          2          3
##   0 0.90384615 0.05128205 0.04487179
##   1 0.35443038 0.27215190 0.37341772

pred.nb <- predict(nb_mod, df.test)
cm_nb <- confusionMatrix(pred.nb, Actual.mpg01_test, positive='1', mode='everything')
cm_nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 31  2
##          1  9 36
##                                           
##                Accuracy : 0.859           
##                  95% CI : (0.7617, 0.9274)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 1.266e-10       
##                                           
##                   Kappa : 0.7191          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.7750          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.9394          
##               Precision : 0.8000          
##                  Recall : 0.9474          
##                      F1 : 0.8675          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4615          
##    Detection Prevalence : 0.5769          
##       Balanced Accuracy : 0.8612          
##                                           
##        'Positive' Class : 1               
##

paste("The test data misclassification rate is:", mean(pred.nb != Actual.mpg01_test))

## [1] "The test data misclassification rate is: 0.141025641025641"

From the above, we have

True Positive (TP) = 36 True Negative (TN) = 31 False Positive (FP) = 9 False Negative (FN) = 2

We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.

4. Decision (classification) Tree

set.seed(111)
dtree_mod <- rpart(mpg01 ~ .-mpg-year-name, data = df.train, method = 'class')
rpart.plot(dtree_mod)

pred.dtree <- predict(dtree_mod, df.test, type='class')
cm_dtree <- confusionMatrix(pred.dtree, Actual.mpg01_test, positive='1', mode='everything')
cm_dtree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 29  1
##          1 11 37
##                                           
##                Accuracy : 0.8462          
##                  95% CI : (0.7467, 0.9179)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 6.861e-10       
##                                           
##                   Kappa : 0.6941          
##                                           
##  Mcnemar's Test P-Value : 0.009375        
##                                           
##             Sensitivity : 0.9737          
##             Specificity : 0.7250          
##          Pos Pred Value : 0.7708          
##          Neg Pred Value : 0.9667          
##               Precision : 0.7708          
##                  Recall : 0.9737          
##                      F1 : 0.8605          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4744          
##    Detection Prevalence : 0.6154          
##       Balanced Accuracy : 0.8493          
##                                           
##        'Positive' Class : 1               
##

paste("The test data misclassification rate is:", mean(pred.dtree != Actual.mpg01_test))

## [1] "The test data misclassification rate is: 0.153846153846154"

From the above, we have

True Positive (TP) = 37 True Negative (TN) = 29 False Positive (FP) = 11 False Negative (FN) = 1

We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.

5. Random Forest

set.seed(111)
rfor_mod <- randomForest(mpg01 ~ .-mpg-year-name, data = df.train, importance=TRUE, proximity=TRUE)
rfor_mod

## 
## Call:
##  randomForest(formula = mpg01 ~ . - mpg - year - name, data = df.train,      importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 7.96%
## Confusion matrix:
##     0   1 class.error
## 0 139  17  0.10897436
## 1   8 150  0.05063291

plot(rfor_mod)

pred.rfor <- predict(rfor_mod, df.test)
cm_rfor <- confusionMatrix(pred.rfor, Actual.mpg01_test, positive='1', mode='everything')
cm_rfor

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  2
##          1  5 36
##                                           
##                Accuracy : 0.9103          
##                  95% CI : (0.8238, 0.9632)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 4.89e-14        
##                                           
##                   Kappa : 0.8207          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.8750          
##          Pos Pred Value : 0.8780          
##          Neg Pred Value : 0.9459          
##               Precision : 0.8780          
##                  Recall : 0.9474          
##                      F1 : 0.9114          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4615          
##    Detection Prevalence : 0.5256          
##       Balanced Accuracy : 0.9112          
##                                           
##        'Positive' Class : 1               
##

paste("The test data misclassification rate is:", mean(pred.rfor != Actual.mpg01_test))

## [1] "The test data misclassification rate is: 0.0897435897435897"

From the above, we have

True Positive (TP) = 36 True Negative (TN) = 35 False Positive (FP) = 5 False Negative (FN) = 2

We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.

Model Comparison

For Model comparison only two metrics were considered Sensitivity_or_recall and Specificity when choosing the best model as these two metrics considers all entries in the dataset. Sentivity deals with true positives and false negatives while specificity deals with false positives and true negatives.

For this data, since the classification of a cars mpg to both 1 ( mpg > 22.75) and 0 (mpg <= 22.75) are both important, the combination of sensitivity and specificity which are holistic metrics are considered. I also common metrics used in model selection/comparison; the codes for these and a summary are shown below.

Model_sens_or_recall <- rbind(cm_final_log_reg$byClass["Sensitivity"], cm_knn1$byClass["Sensitivity"], cm_knn5$byClass["Sensitivity"], cm_knn10$byClass["Sensitivity"], cm_knn100$byClass["Sensitivity"], cm_nb$byClass["Sensitivity"], cm_dtree$byClass["Sensitivity"], cm_rfor$byClass["Sensitivity"])

rownames(Model_sens_or_recall) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_sens_or_recall) <- c("Sensitivity_or_Recall")

Model_sens_or_recall

##               Sensitivity_or_Recall
## Log_reg                   0.9473684
## Knn1                      0.8684211
## Knn5                      0.9736842
## Knn10                     0.9736842
## Knn100                    0.9210526
## Naive_bayes               0.9473684
## Decision_tree             0.9736842
## Random_forest             0.9473684

Model_spec <- rbind(cm_final_log_reg$byClass["Specificity"], cm_knn1$byClass["Specificity"], cm_knn5$byClass["Specificity"], cm_knn10$byClass["Specificity"], cm_knn100$byClass["Specificity"], cm_nb$byClass["Specificity"], cm_dtree$byClass["Specificity"], cm_rfor$byClass["Specificity"])

rownames(Model_spec) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_spec) <- c("Specificity")

Model_spec

##               Specificity
## Log_reg             0.775
## Knn1                0.750
## Knn5                0.775
## Knn10               0.800
## Knn100              0.825
## Naive_bayes         0.775
## Decision_tree       0.725
## Random_forest       0.875

Model_acc <- rbind(cm_final_log_reg$overall["Accuracy"], cm_knn1$overall["Accuracy"], cm_knn5$overall["Accuracy"], cm_knn10$overall["Accuracy"], cm_knn100$overall["Accuracy"], cm_nb$overall["Accuracy"], cm_dtree$overall["Accuracy"], cm_rfor$overall["Accuracy"])

rownames(Model_acc) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_acc) <- c("Accuracy")

Model_acc

##                Accuracy
## Log_reg       0.8589744
## Knn1          0.8076923
## Knn5          0.8717949
## Knn10         0.8846154
## Knn100        0.8717949
## Naive_bayes   0.8589744
## Decision_tree 0.8461538
## Random_forest 0.9102564

Model_misclassifation_rate <- rbind(1-cm_final_log_reg$overall["Accuracy"], 1-cm_knn1$overall["Accuracy"], 1-cm_knn5$overall["Accuracy"], 1-cm_knn10$overall["Accuracy"], 1-cm_knn100$overall["Accuracy"], 1-cm_nb$overall["Accuracy"], 1-cm_dtree$overall["Accuracy"], 1-cm_rfor$overall["Accuracy"])

rownames(Model_misclassifation_rate) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_misclassifation_rate) <- c("Misclassification_rate")

Model_misclassifation_rate

##               Misclassification_rate
## Log_reg                   0.14102564
## Knn1                      0.19230769
## Knn5                      0.12820513
## Knn10                     0.11538462
## Knn100                    0.12820513
## Naive_bayes               0.14102564
## Decision_tree             0.15384615
## Random_forest             0.08974359

Model_precision <- rbind(cm_final_log_reg$byClass["Precision"], cm_knn1$byClass["Precision"], cm_knn5$byClass["Precision"], cm_knn10$byClass["Precision"], cm_knn100$byClass["Precision"], cm_nb$byClass["Precision"], cm_dtree$byClass["Precision"], cm_rfor$byClass["Precision"])

rownames(Model_precision) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_precision) <- c("Precision")

Model_precision

##               Precision
## Log_reg       0.8000000
## Knn1          0.7674419
## Knn5          0.8043478
## Knn10         0.8222222
## Knn100        0.8333333
## Naive_bayes   0.8000000
## Decision_tree 0.7708333
## Random_forest 0.8780488

Model_f1 <- rbind(cm_final_log_reg$byClass["F1"], cm_knn1$byClass["F1"], cm_knn5$byClass["F1"], cm_knn10$byClass["F1"], cm_knn100$byClass["F1"], cm_nb$byClass["F1"], cm_dtree$byClass["F1"], cm_rfor$byClass["F1"])

rownames(Model_f1) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_f1) <- c("F1")

Model_f1

##                      F1
## Log_reg       0.8674699
## Knn1          0.8148148
## Knn5          0.8809524
## Knn10         0.8915663
## Knn100        0.8750000
## Naive_bayes   0.8674699
## Decision_tree 0.8604651
## Random_forest 0.9113924

Model_summary <- data.frame(Model_sens_or_recall, Model_spec, Model_acc, Model_misclassifation_rate, Model_precision, Model_f1)
Model_summary

##               Sensitivity_or_Recall Specificity  Accuracy
## Log_reg                   0.9473684       0.775 0.8589744
## Knn1                      0.8684211       0.750 0.8076923
## Knn5                      0.9736842       0.775 0.8717949
## Knn10                     0.9736842       0.800 0.8846154
## Knn100                    0.9210526       0.825 0.8717949
## Naive_bayes               0.9473684       0.775 0.8589744
## Decision_tree             0.9736842       0.725 0.8461538
## Random_forest             0.9473684       0.875 0.9102564
##               Misclassification_rate Precision        F1
## Log_reg                   0.14102564 0.8000000 0.8674699
## Knn1                      0.19230769 0.7674419 0.8148148
## Knn5                      0.12820513 0.8043478 0.8809524
## Knn10                     0.11538462 0.8222222 0.8915663
## Knn100                    0.12820513 0.8333333 0.8750000
## Naive_bayes               0.14102564 0.8000000 0.8674699
## Decision_tree             0.15384615 0.7708333 0.8604651
## Random_forest             0.08974359 0.8780488 0.9113924

For visual effect, I decided to use color shades on the Model_summary data frame.

For Metrics where the higher values, the better the model, green shades are used (with the darker green being the best of the model) and for those where lower values are better, red shades are used (with the darker reds being the worst of the model)

formattable(Model_summary, list(
  Sensitivity_or_Recall = color_tile("white", "green"), 
  Specificity = color_tile("white", "green"), 
  Accuracy = color_tile("white", "green"), 
  Misclassification_rate = color_tile("white", "red"), 
  Precision = color_tile("white", "green"), 
  F1 = color_tile("white", "green")
  ))

	Sensitivity_or_Recall	Specificity	Accuracy	Misclassification_rate	Precision	F1
Log_reg	0.9473684	0.775	0.8589744	0.14102564	0.8000000	0.8674699
Knn1	0.8684211	0.750	0.8076923	0.19230769	0.7674419	0.8148148
Knn5	0.9736842	0.775	0.8717949	0.12820513	0.8043478	0.8809524
Knn10	0.9736842	0.800	0.8846154	0.11538462	0.8222222	0.8915663
Knn100	0.9210526	0.825	0.8717949	0.12820513	0.8333333	0.8750000
Naive_bayes	0.9473684	0.775	0.8589744	0.14102564	0.8000000	0.8674699
Decision_tree	0.9736842	0.725	0.8461538	0.15384615	0.7708333	0.8604651
Random_forest	0.9473684	0.875	0.9102564	0.08974359	0.8780488	0.9113924

From the above, we see that the Random_forest model has the second highest sensitivity and highest specificity which we decide to used as the metric for selecting the best model. Hence, the Random_forest model is our optimal model for classifying cars mpg for this data.

Classification models on the Auto data of ISLR2 package part 1

Emmanuel Esivwenughwu

8/9/2022

Loading required packages

Data Description

Data loading. prep, and cleaning

Exploratory data analysis

Classification models

train-test split

1. Logistic Regression model

2. K-Nearest Neighbor (KNN) model

3. Naive Bayes

4. Decision (classification) Tree

5. Random Forest

Model Comparison