Introduction

Performance evaluation is an important part of machine learning process. We need to know how well the model works on the unseen data set.
In this study, I will evaluate my model based on the following Classification metrics: 1. Accuracy Score: How well the model accurately predict the actual class
2. Missclassification Score:1- Accuracy Score
3. Confusion Matrix: Confusion matrix describe the performance for all classes

4. Precision:What proportion of positive identifications was actually correct?

\[ Precision = \frac{True Pos}{(True Pos+False Pos)} \]

  1. Recall: What proportion of actual positives was identified correctly?

\[ Recall = \frac{True Pos}{(True Pos+False Neg)} \] 6. Specificity: how good a test is at avoiding false alarms. \[ Recall = \frac{True Neg}{(True Neg+False Pos)} \]

7.F1 Score:

\[ Recall = 2 * \frac{Precision * Recall}{(Precision + Recall)} \]

Note that for skewed data set accuracy score may be highly misleading.

Data:

Data:

Auto data set: A data frame with 392 observations on the following 9 variables.

mpg:miles per gallon

cylinders: Number of cylinders between 4 and 8

displacement:Engine displacement (cu. inches)

horsepower:Engine horsepower

weight:Vehicle weight (lbs.)

acceleration:Time to accelerate from 0 to 60 mph (sec.)

year:Model year (modulo 100)

origin:Origin of car (1. American, 2. European, 3. Japanese)

name:Vehicle name

The orginal data contained 408 observations but 16 observations with missing values were removed.

Objective:

Carry out the logistic regression and evaluate model and choose the best cut-off point using ROC Curve.



Data Preparation

#Loading necessary libraries
library(tidyverse) # For data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr) 
library(caret)  # confusion matrix
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(e1071)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(rpart)
library(ROCR)
library(rpart.plot)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(rattle)
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
## 
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:readr':
## 
##     spec
library(formattable)
#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())
# Load data
df <- read.csv("auto-mpg.csv", header = TRUE)

head(df)
##   mpg cylinders displacement horsepower weight acceleration model.year origin
## 1  18         8          307        130   3504         12.0         70      1
## 2  15         8          350        165   3693         11.5         70      1
## 3  18         8          318        150   3436         11.0         70      1
## 4  16         8          304        150   3433         12.0         70      1
## 5  17         8          302        140   3449         10.5         70      1
## 6  15         8          429        198   4341         10.0         70      1
##                    car.name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
nrow(df)
## [1] 398
col <-colnames(df)

col
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "model.year"   "origin"       "car.name"
#Summary Statistics
summary(df)
##       mpg          cylinders      displacement    horsepower       
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Length:398        
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   Class :character  
##  Median :23.00   Median :4.000   Median :148.5   Mode  :character  
##  Mean   :23.51   Mean   :5.455   Mean   :193.4                     
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0                     
##  Max.   :46.60   Max.   :8.000   Max.   :455.0                     
##      weight      acceleration     model.year        origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573  
##  3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##    car.name        
##  Length:398        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
sapply(df, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    7.8159843    1.7010042  104.2698382           NA  846.8417742    2.7576889 
##   model.year       origin     car.name 
##    3.6976266    0.8020549           NA
#Data Structure
str(df)
## 'data.frame':    398 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr  "130" "165" "150" "150" ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model.year  : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car.name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
#Convert to Categorical Variable & to Numeric
df <- df %>%
  select(-car.name) %>%
  mutate_at(vars(cylinders,model.year,origin ), funs(factor)) %>%
  mutate_at(vars(horsepower),funs(as.numeric))

str(df)
## 'data.frame':    398 obs. of  8 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model.year  : Factor w/ 13 levels "70","71","72",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ origin      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
#Create binary class based on MPG
m0 <- min(df$mpg)
m1 <- mean(df$mpg)
m2 <- max(df$mpg)

df$mpg_con <- cut(x=df$mpg, breaks=c(min(df$mpg),mean(df$mpg), max(df$mpg)), labels=c("L", "H"))

df <- df %>%
  select(-mpg)

Data Partition

ind <- sample(2, nrow(df), replace = T, prob=c(0.8,0.2))
train <- df[ind==1,]
test <- df[ind==2,]
head(train)
##   cylinders displacement horsepower weight acceleration model.year origin
## 1         8          307        130   3504         12.0         70      1
## 3         8          318        150   3436         11.0         70      1
## 4         8          304        150   3433         12.0         70      1
## 5         8          302        140   3449         10.5         70      1
## 6         8          429        198   4341         10.0         70      1
## 7         8          454        220   4354          9.0         70      1
##   mpg_con
## 1       L
## 3       L
## 4       L
## 5       L
## 6       L
## 7       L
head(test)
##    cylinders displacement horsepower weight acceleration model.year origin
## 2          8          350        165   3693         11.5         70      1
## 8          8          440        215   4312          8.5         70      1
## 13         8          400        150   3761          9.5         70      1
## 23         4          104         95   2375         17.5         70      2
## 24         4          121        113   2234         12.5         70      2
## 25         6          199         90   2648         15.0         70      1
##    mpg_con
## 2        L
## 8        L
## 13       L
## 23       H
## 24       H
## 25       L

Checking Stratified Sampling

#Overall
prop.table(table(df$mpg_con))
## 
##         L         H 
## 0.5239295 0.4760705
#Train
prop.table(table(train$mpg_con))
## 
##         L         H 
## 0.5216049 0.4783951
#Test
prop.table(table(test$mpg_con))
## 
##         L         H 
## 0.5342466 0.4657534

Decision Tree Model

model_fit1 <- rpart(mpg_con ~ ., data=train)

rpart.plot(model_fit1)

# For probabilities
p <- predict (model_fit1, newdata=test, type="prob")

head(p)
##             L        H
## 2  1.00000000 0.000000
## 8  1.00000000 0.000000
## 13 1.00000000 0.000000
## 23 0.05755396 0.942446
## 24 0.05755396 0.942446
## 25 1.00000000 0.000000
#For classes
pred <- predict (model_fit1, newdata=test, type="class")

Accuracy rate for mimimum level

t<-table(test$mpg_con)
t
## 
##  L  H 
## 39 34
40/ 76
## [1] 0.5263158

We should be able to get a higher accuracy rate to say our model is succesful.

Accuracy Score

acc<-mean(test$mpg_con == pred)
acc
## [1] 0.9041096

Misclassification Score

1- acc
## [1] 0.09589041

Confusion Matrix

table(pred, test$mpg_con)
##     
## pred  L  H
##    L 37  5
##    H  2 29
#prediciton data frame
df1 <- data.frame(pred, mpg_con=test$mpg_con)
df1
##     pred mpg_con
## 2      L       L
## 8      L       L
## 13     L       L
## 23     H       H
## 24     H       H
## 25     L       L
## 34     L       L
## 38     L       L
## 40     L       L
## 55     H       H
## 61     L       L
## 65     L       L
## 68     L       L
## 69     L       L
## 109    L       L
## 114    L       L
## 115    L       H
## 120    L       L
## 123    L       H
## 127    L       L
## 133    H       H
## 135    L       L
## 138    L       L
## 146    H       H
## 150    H       H
## 154    L       L
## 157    L       L
## 161    L       L
## 179    L       L
## 181    L       H
## 185    H       H
## 192    L       L
## 194    L       H
## 199    H       H
## 201    L       L
## 210    H       L
## 219    H       H
## 223    L       L
## 228    L       L
## 240    H       H
## 244    L       L
## 246    H       H
## 250    L       L
## 253    L       L
## 254    L       L
## 261    L       L
## 265    L       L
## 267    H       H
## 272    H       L
## 276    L       L
## 284    L       L
## 285    L       L
## 287    L       L
## 289    L       L
## 295    H       H
## 296    H       H
## 297    H       H
## 298    H       H
## 299    L       L
## 306    H       H
## 310    H       H
## 311    H       H
## 312    H       H
## 326    H       H
## 327    H       H
## 328    H       H
## 335    H       H
## 347    H       H
## 348    H       H
## 381    H       H
## 385    H       H
## 388    L       H
## 394    H       H
#with Yardstick library
conf_mat(estimate=pred, truth= mpg_con, data=df1)
##           Truth
## Prediction  L  H
##          L 37  5
##          H  2 29

Precision

pre <- precision(data=df1, estimate=pred, truth=mpg_con)
pre
## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.881

Recall (Sensitivity)

rec <- recall(data=df1, estimate=pred, truth=mpg_con)
rec
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.949

Specificity

spe <- spec(data=df1, estimate=pred, truth=mpg_con)
spe
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary         0.853

F1 Score

f1 <- f_meas(data=df1, estimate=pred, truth=mpg_con)
f1
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.914
df<-data_frame(
  id = 1:5,
  Model = c('Accuracy', 'Precision', 'Sensitivity','Specificity', 'F1 Score'),
  Scores =  c(acc, pre[3], rec[3], spe[3], f1[3]), 
)


formattable(df, 
            list(
              Model = formatter("span",
                                style = ~ style(color = ifelse((Scores) > mean(Scores), 
                                                               "green", "red"))),
              area(col = c(Scores)) ~ normalize_bar("lightgreen", 0.2)
              
            ))
id Model Scores
1 Accuracy 0.9041096
2 Precision 0.8809524
3 Sensitivity 0.9487179
4 Specificity 0.8529412
5 F1 Score 0.9135802

References:

  1. CSU Applied Statistics Course Notes

  2. https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall </br

  3. https://uberpython.wordpress.com/2012/01/01/precision-recall-sensitivity-and-specificity/



*************************