The classification models used in this projects are:
Additional packages (aside the base/default R packages) used are:
library(ISLR2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
library(e1071)
library(class)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following object is masked from 'package:e1071':
##
## impute
## The following objects are masked from 'package:base':
##
## format.pval, units
library(rpart)
library(rpart.plot)
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(formattable)
##
## Attaching package: 'formattable'
## The following object is masked from 'package:MASS':
##
## area
Auto Data Set
Description Gas mileage, horsepower, and other information for 392 vehicles. A data frame with 392 observations on the following 9 variables.
mpg miles per gallon
cylinders Number of cylinders between 4 and 8
displacement Engine displacement (cu. inches)
horsepower Engine horsepower
weight Vehicle weight (lbs.)
acceleration Time to accelerate from 0 to 60 mph (sec.)
year Model year (modulo 100)
origin Origin of car (1. American, 2. European, 3. Japanese)
name Vehicle name
Source This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. The original dataset has 397 observations, of which 5 have missing values for the variable “horsepower”. These rows are removed here. The original dataset is avaliable as a CSV file in the docs directory, as well as at https://www.statlearning.com.
References James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
df <- data.frame(Auto)
head(df, 5)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
Check if there are any missing values in our dataframe and if there are printing all the records with at least one missing column.
which(is.na(df))
## integer(0)
df[!complete.cases(df),]
## [1] mpg cylinders displacement horsepower weight
## [6] acceleration year origin name
## <0 rows> (or 0-length row.names)
From the above we see that there are no missing/null values
Converting variables cylinders and origin to factors Note: For origin, 1 - American, 2 - European, 3 - Japanese
df$cylinders <- as.factor(df$cylinders)
df$origin <- as.factor(df$origin)
Variables cylinders, origin, year, and name where excluded as the summary we would are redundant
summary(df[-c(2,7,8,9)])
## mpg displacement horsepower weight acceleration
## Min. : 9.00 Min. : 68.0 Min. : 46.0 Min. :1613 Min. : 8.00
## 1st Qu.:17.00 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 1st Qu.:13.78
## Median :22.75 Median :151.0 Median : 93.5 Median :2804 Median :15.50
## Mean :23.45 Mean :194.4 Mean :104.5 Mean :2978 Mean :15.54
## 3rd Qu.:29.00 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 3rd Qu.:17.02
## Max. :46.60 Max. :455.0 Max. :230.0 Max. :5140 Max. :24.80
From the above summary of the quantitative variables (inputs: displacement, horsepower, weight, & height and output: mpg), we see that the mean and median for these variables are quite different except the acceleratio variable, we can plots histograms for these variables to know the shape of their respective distributions.
hist.data.frame(df[,-c(2,7,8,9)])
Scatter plots for each quantitative variable with another was also made to get an overview of their likely relationship with one another.
pairs(~mpg + displacement + weight + horsepower + acceleration, data=df)
From the above, we see that some of the input variables are collinear (that is, linearly related with one another), e.g. weight & horsepower, displacement & weight, we can also get the correlation matrix to show this.
cor(df[,-c(2,7,8,9)])
## mpg displacement horsepower weight acceleration
## mpg 1.0000000 -0.8051269 -0.7784268 -0.8322442 0.4233285
## displacement -0.8051269 1.0000000 0.8972570 0.9329944 -0.5438005
## horsepower -0.7784268 0.8972570 1.0000000 0.8645377 -0.6891955
## weight -0.8322442 0.9329944 0.8645377 1.0000000 -0.4168392
## acceleration 0.4233285 -0.5438005 -0.6891955 -0.4168392 1.0000000
From the above histogram, we see that the variable miles per gallon (mpg) is skewed right, making the median the best metric to use has the cutoff point in creating a new variable mpg01 with two classes; 0 if mpg is less or equals to the median, 1 if mpg is greater than the median, this new variable; mpg01 then serves as the target variable which we will then build classification models on.
paste("The median of mpg is", median(df$mpg))
## [1] "The median of mpg is 22.75"
df$mpg01 <- rep(0, length(df$mpg))
df$mpg01[df$mpg > median(df$mpg)] <- 1
df$mpg01 <- as.factor(df$mpg01)
A quick check that the new variable was created correctly
df[c(10:19), c("mpg","mpg01")]
## mpg mpg01
## 10 15 0
## 11 15 0
## 12 14 0
## 13 15 0
## 14 14 0
## 15 24 1
## 16 22 0
## 17 18 0
## 18 21 0
## 19 27 1
For the factor (qualitative variables), we first make a contingency table of our factor variables cylinders & origin with the new target variable mpg01
table(df$cylinders, df$mpg01)
##
## 0 1
## 3 3 1
## 4 20 179
## 5 1 2
## 6 72 11
## 8 100 3
table(df$origin, df$mpg01)
##
## 0 1
## 1 173 72
## 2 14 54
## 3 9 70
To check the relationship between our factor variables cylinders & origin with the new target variable mpg01, we make a chi-square test from the contingency table created above.
chisq.test(df$cylinders, df$mpg01)
## Warning in chisq.test(df$cylinders, df$mpg01): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: df$cylinders and df$mpg01
## X-squared = 264.55, df = 4, p-value < 2.2e-16
chisq.test(df$origin, df$mpg01)
##
## Pearson's Chi-squared test
##
## data: df$origin and df$mpg01
## X-squared = 112.27, df = 2, p-value < 2.2e-16
From the above, we see that both mpg01 is dependent on both factor variables as each p-value gotten above is a very small number p-value < 2.2e-16
Now we check the relation of the continuous input variables with this new categorical variable using one-way ANOVA test
data.frame(group_by(df, mpg01) %>%
summarise(count = n(),
avg_disp = mean(displacement, na.rm = TRUE),
sd_disp = sd(displacement, na.rm = TRUE),
avg_hrp = mean(horsepower, na.rm = TRUE),
sd_hrp = sd(horsepower, na.rm = TRUE),
avg_wgt = mean(weight, na.rm = TRUE),
sd_wgt = sd(weight, na.rm = TRUE),
avg_acc = mean(acceleration, na.rm = TRUE),
sd_acc = sd(acceleration, na.rm = TRUE), .groups="drop"
))
## mpg01 count avg_disp sd_disp avg_hrp sd_hrp avg_wgt sd_wgt avg_acc
## 1 0 196 273.1582 89.52399 130.11224 37.35564 3620.403 676.9322 14.58571
## 2 1 196 115.6658 38.42951 78.82653 15.91969 2334.765 397.1924 16.49694
## sd_acc
## 1 2.685154
## 2 2.493168
From the above, we have equal count between the two classes of mpg01, showing that the data is balanced.
We can also visualize the above using the boxplots below
boxplot(df$displacement~df$mpg01, main="displacement vs mpg01", col = c("#00AFBB", "#FC4E07"))
boxplot(df$horsepower~df$mpg01, main="horsepower vs mpg01", col = c("#00AFBB", "#FC4E07"))
boxplot(df$weight~df$mpg01, main="weight vs mpg01", col = c("#00AFBB", "#FC4E07"))
boxplot(df$acceleration~df$mpg01, main="acceleration vs mpg01", col = c("#00AFBB", "#FC4E07"))
From the box plots above, we can conclude that there is a significant difference between each level of the target variable mpg01 and the various continuous input variables, hence there is a significant association between the continuous independent variables and the target variable mpg01
For variable importance of all input variables to the target variable mpg01 which can ranked in descending order of importance
For the factor input variables, cylinders has a more significant association with mpg01 compared to origin has it has an higher chi-squared test statistic
For the quantitative input variables, weight and displacement have very distinct boxplot for each level of mpg01, followed by horsepower, and then acceleration.
A 80-20 train-test split respectively was used for this project.
set.seed(12345)
split<- sample(c(rep(0, 0.8 * nrow(df)), rep(1, 0.2 * nrow(df))))
df.train <-df[split == 0, ]
df.test <- df[split == 1, ]
Let’s have a look at the first rows of our train data:
head(df.train)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name mpg01
## 1 chevrolet chevelle malibu 0
## 2 buick skylark 320 0
## 3 plymouth satellite 0
## 4 amc rebel sst 0
## 5 ford torino 0
## 6 ford galaxie 500 0
Let’s have a look at the first rows of our test data:
head(df.test)
## mpg cylinders displacement horsepower weight acceleration year origin
## 7 14 8 454 220 4354 9.0 70 1
## 10 15 8 390 190 3850 8.5 70 1
## 14 14 8 455 225 3086 10.0 70 1
## 17 18 6 199 97 2774 15.5 70 1
## 24 26 4 121 113 2234 12.5 70 2
## 36 17 6 250 100 3329 15.5 71 1
## name mpg01
## 7 chevrolet impala 0
## 10 amc ambassador dpl 0
## 14 buick estate wagon (sw) 0
## 17 amc hornet 0
## 24 bmw 2002 1
## 36 chevrolet chevelle malibu 0
Now we create a vector for just the target variable in both the training and test data
Actual.mpg01_train <- df.train$mpg01
Actual.mpg01_test <- df.test$mpg01
log_reg <- glm(mpg01~cylinders + displacement + horsepower + weight + acceleration + origin, data = df.train, family=binomial)
summary(log_reg)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight + acceleration + origin, family = binomial, data = df.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.40184 -0.25539 0.06431 0.28834 2.84926
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.1422805 3.7717084 2.689 0.00717 **
## cylinders4 4.5454003 1.5408100 2.950 0.00318 **
## cylinders5 3.9376520 2.2225913 1.772 0.07645 .
## cylinders6 2.9795082 1.7936226 1.661 0.09668 .
## cylinders8 5.8969995 2.4559551 2.401 0.01635 *
## displacement -0.0213256 0.0129690 -1.644 0.10010
## horsepower -0.0790939 0.0268315 -2.948 0.00320 **
## weight 0.0004541 0.0011932 0.381 0.70354
## acceleration -0.2586885 0.1589896 -1.627 0.10372
## origin2 -0.4008971 0.7554499 -0.531 0.59565
## origin3 1.0483281 0.8720313 1.202 0.22930
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 435.28 on 313 degrees of freedom
## Residual deviance: 137.60 on 303 degrees of freedom
## AIC: 159.6
##
## Number of Fisher Scoring iterations: 7
In the above logistic regression model, no interaction between predictor variables was considered.
From the above, we see that at various significant levels (0.001, 0.01, 0.05, 0.1, etc), some variables are not significant (less important for the model), that is at 0.05,cylinders6 (representing level 6 of variable cylinders) is not significant, etc
Now we perform a stepwise using the function stepAIC to select the most important variables affecting the target variable mpg01
final.log_reg <- stepAIC(log_reg)
## Start: AIC=159.6
## mpg01 ~ cylinders + displacement + horsepower + weight + acceleration +
## origin
##
## Df Deviance AIC
## - weight 1 137.75 157.75
## - origin 2 140.60 158.60
## <none> 137.60 159.60
## - acceleration 1 140.33 160.33
## - displacement 1 140.50 160.50
## - horsepower 1 147.21 167.21
## - cylinders 4 159.85 173.85
##
## Step: AIC=157.75
## mpg01 ~ cylinders + displacement + horsepower + acceleration +
## origin
##
## Df Deviance AIC
## - origin 2 140.66 156.66
## <none> 137.75 157.75
## - acceleration 1 140.98 158.98
## - displacement 1 141.13 159.13
## - horsepower 1 150.12 168.12
## - cylinders 4 160.42 172.42
##
## Step: AIC=156.65
## mpg01 ~ cylinders + displacement + horsepower + acceleration
##
## Df Deviance AIC
## <none> 140.66 156.66
## - acceleration 1 144.71 158.71
## - displacement 1 148.18 162.18
## - horsepower 1 152.79 166.79
## - cylinders 4 161.09 169.09
coef(final.log_reg)
## (Intercept) cylinders4 cylinders5 cylinders6 cylinders8 displacement
## 11.40448898 3.75890305 2.98239500 2.54297883 5.51538124 -0.02281515
## horsepower acceleration
## -0.07060382 -0.24725339
From the above, we see the steps taken in getting the final optimal logistic regression model, if you do not want this displayed until you get the final optimal logistic regression model include the argument (“trace=FALSE”) as shown below
final.log_reg <- stepAIC(log_reg, trace=FALSE)
coef(final.log_reg)
## (Intercept) cylinders4 cylinders5 cylinders6 cylinders8 displacement
## 11.40448898 3.75890305 2.98239500 2.54297883 5.51538124 -0.02281515
## horsepower acceleration
## -0.07060382 -0.24725339
From the above final optimal logistic regression model, we see that variables weight, and origin were removed
Now we check the prediction accuracy on the final/optimal model based on the test data and compare it with the initial logistic regression model. The cutoff threshold chosen was 0.5, that is based on the input features (cylinders, displacement, horsepower & acceleration), a car is classified as 1 (having a value greater than 22.75 for its miles per gallon) if greater than 0.5 or classified as 0 (having a value less than or equal to 22.75 for its miles per gallon) if less than 0.5
pr.final_log_reg <- predict(final.log_reg, df.test, type="response")
pred.final_log_reg <- ifelse(pr.final_log_reg > 0.5, "1", "0")
mean(pred.final_log_reg == Actual.mpg01_test)
## [1] 0.8589744
pr.log_reg <- predict(log_reg, df.test, type="response")
pred.log_reg <- ifelse(pr.log_reg > 0.5, "1", "0")
mean(pred.log_reg == Actual.mpg01_test)
## [1] 0.8589744
From the above, we can see see that the final/optimal logistic regression model gives the same prediction accuracy and is also less complex than the initial model (as it uses fewer input variables).
Now, making classification with the final/optimal logistic regression model, we have.
cm_final_log_reg <- confusionMatrix(as.factor(pred.final_log_reg), Actual.mpg01_test, positive='1', mode='everything')
cm_final_log_reg
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 31 2
## 1 9 36
##
## Accuracy : 0.859
## 95% CI : (0.7617, 0.9274)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 1.266e-10
##
## Kappa : 0.7191
##
## Mcnemar's Test P-Value : 0.07044
##
## Sensitivity : 0.9474
## Specificity : 0.7750
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.9394
## Precision : 0.8000
## Recall : 0.9474
## F1 : 0.8675
## Prevalence : 0.4872
## Detection Rate : 0.4615
## Detection Prevalence : 0.5769
## Balanced Accuracy : 0.8612
##
## 'Positive' Class : 1
##
paste("The test data misclassification rate is:", mean(pred.final_log_reg != Actual.mpg01_test))
## [1] "The test data misclassification rate is: 0.141025641025641"
From the above, we have
True Positive (TP) = 36 True Negative (TN) = 31 False Positive (FP) = 9 False Negative (FN) = 2
We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.
For the knn model, we will try various values of k; 1,5,10, and 100 and select the one with the least test data misclassification rate.
I also remove variables mpg01, year, origin, name, mpg01 using their relative positive in the respective dataframes as the function knn() of the package class requires a training data frame/matrix, test data frame/matrix of only the input and finally a vector of target with factors from the training data.
k = 1
pred.knn1 <- knn(df.train[,-c(1,7,9,10)], df.test[,-c(1,7,9,10)], Actual.mpg01_train, k = 1)
cm_knn1 <- confusionMatrix(pred.knn1, Actual.mpg01_test, positive='1', mode='everything')
cm_knn1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 30 5
## 1 10 33
##
## Accuracy : 0.8077
## 95% CI : (0.7027, 0.8882)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 6.353e-08
##
## Kappa : 0.6164
##
## Mcnemar's Test P-Value : 0.3017
##
## Sensitivity : 0.8684
## Specificity : 0.7500
## Pos Pred Value : 0.7674
## Neg Pred Value : 0.8571
## Precision : 0.7674
## Recall : 0.8684
## F1 : 0.8148
## Prevalence : 0.4872
## Detection Rate : 0.4231
## Detection Prevalence : 0.5513
## Balanced Accuracy : 0.8092
##
## 'Positive' Class : 1
##
paste("The test data misclassification rate is:", mean(pred.knn1 != Actual.mpg01_test))
## [1] "The test data misclassification rate is: 0.192307692307692"
From the above, we have
True Positive (TP) = 33 True Negative (TN) = 30 False Positive (FP) = 10 False Negative (FN) = 5
We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.
k = 5
pred.knn5 <- knn(df.train[,-c(1,7,9,10)], df.test[,-c(1,7,9,10)], Actual.mpg01_train, k = 5)
cm_knn5 <- confusionMatrix(pred.knn5, Actual.mpg01_test, positive='1', mode='everything')
cm_knn5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 31 1
## 1 9 37
##
## Accuracy : 0.8718
## 95% CI : (0.7768, 0.9368)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 2.112e-11
##
## Kappa : 0.7448
##
## Mcnemar's Test P-Value : 0.02686
##
## Sensitivity : 0.9737
## Specificity : 0.7750
## Pos Pred Value : 0.8043
## Neg Pred Value : 0.9688
## Precision : 0.8043
## Recall : 0.9737
## F1 : 0.8810
## Prevalence : 0.4872
## Detection Rate : 0.4744
## Detection Prevalence : 0.5897
## Balanced Accuracy : 0.8743
##
## 'Positive' Class : 1
##
paste("The test data misclassification rate is:", mean(pred.knn5 != Actual.mpg01_test))
## [1] "The test data misclassification rate is: 0.128205128205128"
From the above, we have
True Positive (TP) = 37 True Negative (TN) = 31 False Positive (FP) = 9 False Negative (FN) = 1
We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.
k = 10
pred.knn10 <- knn(df.train[,-c(1,7,9,10)], df.test[,-c(1,7,9,10)], Actual.mpg01_train, k = 10)
cm_knn10 <- confusionMatrix(pred.knn10, Actual.mpg01_test, positive='1', mode='everything')
cm_knn10
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 32 1
## 1 8 37
##
## Accuracy : 0.8846
## 95% CI : (0.7922, 0.9459)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 3.16e-12
##
## Kappa : 0.7701
##
## Mcnemar's Test P-Value : 0.0455
##
## Sensitivity : 0.9737
## Specificity : 0.8000
## Pos Pred Value : 0.8222
## Neg Pred Value : 0.9697
## Precision : 0.8222
## Recall : 0.9737
## F1 : 0.8916
## Prevalence : 0.4872
## Detection Rate : 0.4744
## Detection Prevalence : 0.5769
## Balanced Accuracy : 0.8868
##
## 'Positive' Class : 1
##
paste("The test data misclassification rate is:", mean(pred.knn10 != Actual.mpg01_test))
## [1] "The test data misclassification rate is: 0.115384615384615"
From the above, we have
True Positive (TP) = 37 True Negative (TN) = 32 False Positive (FP) = 8 False Negative (FN) = 1
We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.
k = 100
pred.knn100 <- knn(df.train[,-c(1,7,9,10)], df.test[,-c(1,7,9,10)], Actual.mpg01_train, k = 100)
cm_knn100 <- confusionMatrix(pred.knn100, Actual.mpg01_test, positive='1', mode='everything')
cm_knn100
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 33 3
## 1 7 35
##
## Accuracy : 0.8718
## 95% CI : (0.7768, 0.9368)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 2.112e-11
##
## Kappa : 0.7441
##
## Mcnemar's Test P-Value : 0.3428
##
## Sensitivity : 0.9211
## Specificity : 0.8250
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.9167
## Precision : 0.8333
## Recall : 0.9211
## F1 : 0.8750
## Prevalence : 0.4872
## Detection Rate : 0.4487
## Detection Prevalence : 0.5385
## Balanced Accuracy : 0.8730
##
## 'Positive' Class : 1
##
paste("The test data misclassification rate is:", mean(pred.knn100 != Actual.mpg01_test))
## [1] "The test data misclassification rate is: 0.128205128205128"
From the above, we have
True Positive (TP) = 35 True Negative (TN) = 33 False Positive (FP) = 7 False Negative (FN) = 3
We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.
nb_mod <- naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration + origin, data = df.train)
nb_mod
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.4968153 0.5031847
##
## Conditional probabilities:
## cylinders
## Y 3 4 5 6 8
## 0 0.012820513 0.076923077 0.006410256 0.365384615 0.538461538
## 1 0.006329114 0.911392405 0.006329114 0.056962025 0.018987342
##
## displacement
## Y [,1] [,2]
## 0 277.4359 84.40082
## 1 115.1108 40.84656
##
## horsepower
## Y [,1] [,2]
## 0 130.42308 35.42207
## 1 78.67089 16.54965
##
## weight
## Y [,1] [,2]
## 0 3657.397 678.9579
## 1 2330.158 408.3661
##
## acceleration
## Y [,1] [,2]
## 0 14.56346 2.617290
## 1 16.60633 2.536101
##
## origin
## Y 1 2 3
## 0 0.90384615 0.05128205 0.04487179
## 1 0.35443038 0.27215190 0.37341772
pred.nb <- predict(nb_mod, df.test)
cm_nb <- confusionMatrix(pred.nb, Actual.mpg01_test, positive='1', mode='everything')
cm_nb
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 31 2
## 1 9 36
##
## Accuracy : 0.859
## 95% CI : (0.7617, 0.9274)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 1.266e-10
##
## Kappa : 0.7191
##
## Mcnemar's Test P-Value : 0.07044
##
## Sensitivity : 0.9474
## Specificity : 0.7750
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.9394
## Precision : 0.8000
## Recall : 0.9474
## F1 : 0.8675
## Prevalence : 0.4872
## Detection Rate : 0.4615
## Detection Prevalence : 0.5769
## Balanced Accuracy : 0.8612
##
## 'Positive' Class : 1
##
paste("The test data misclassification rate is:", mean(pred.nb != Actual.mpg01_test))
## [1] "The test data misclassification rate is: 0.141025641025641"
From the above, we have
True Positive (TP) = 36 True Negative (TN) = 31 False Positive (FP) = 9 False Negative (FN) = 2
We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.
set.seed(111)
dtree_mod <- rpart(mpg01 ~ .-mpg-year-name, data = df.train, method = 'class')
rpart.plot(dtree_mod)
pred.dtree <- predict(dtree_mod, df.test, type='class')
cm_dtree <- confusionMatrix(pred.dtree, Actual.mpg01_test, positive='1', mode='everything')
cm_dtree
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 29 1
## 1 11 37
##
## Accuracy : 0.8462
## 95% CI : (0.7467, 0.9179)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 6.861e-10
##
## Kappa : 0.6941
##
## Mcnemar's Test P-Value : 0.009375
##
## Sensitivity : 0.9737
## Specificity : 0.7250
## Pos Pred Value : 0.7708
## Neg Pred Value : 0.9667
## Precision : 0.7708
## Recall : 0.9737
## F1 : 0.8605
## Prevalence : 0.4872
## Detection Rate : 0.4744
## Detection Prevalence : 0.6154
## Balanced Accuracy : 0.8493
##
## 'Positive' Class : 1
##
paste("The test data misclassification rate is:", mean(pred.dtree != Actual.mpg01_test))
## [1] "The test data misclassification rate is: 0.153846153846154"
From the above, we have
True Positive (TP) = 37 True Negative (TN) = 29 False Positive (FP) = 11 False Negative (FN) = 1
We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.
set.seed(111)
rfor_mod <- randomForest(mpg01 ~ .-mpg-year-name, data = df.train, importance=TRUE, proximity=TRUE)
rfor_mod
##
## Call:
## randomForest(formula = mpg01 ~ . - mpg - year - name, data = df.train, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 7.96%
## Confusion matrix:
## 0 1 class.error
## 0 139 17 0.10897436
## 1 8 150 0.05063291
plot(rfor_mod)
pred.rfor <- predict(rfor_mod, df.test)
cm_rfor <- confusionMatrix(pred.rfor, Actual.mpg01_test, positive='1', mode='everything')
cm_rfor
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 2
## 1 5 36
##
## Accuracy : 0.9103
## 95% CI : (0.8238, 0.9632)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 4.89e-14
##
## Kappa : 0.8207
##
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.9474
## Specificity : 0.8750
## Pos Pred Value : 0.8780
## Neg Pred Value : 0.9459
## Precision : 0.8780
## Recall : 0.9474
## F1 : 0.9114
## Prevalence : 0.4872
## Detection Rate : 0.4615
## Detection Prevalence : 0.5256
## Balanced Accuracy : 0.9112
##
## 'Positive' Class : 1
##
paste("The test data misclassification rate is:", mean(pred.rfor != Actual.mpg01_test))
## [1] "The test data misclassification rate is: 0.0897435897435897"
From the above, we have
True Positive (TP) = 36 True Negative (TN) = 35 False Positive (FP) = 5 False Negative (FN) = 2
We could also use the formula (FP + FN) / (TP + TN + FP + FN) or (1 - Accuracy) and we would arrive at both result above.
For Model comparison only two metrics were considered Sensitivity_or_recall and Specificity when choosing the best model as these two metrics considers all entries in the dataset. Sentivity deals with true positives and false negatives while specificity deals with false positives and true negatives.
For this data, since the classification of a cars mpg to both 1 ( mpg > 22.75) and 0 (mpg <= 22.75) are both important, the combination of sensitivity and specificity which are holistic metrics are considered. I also common metrics used in model selection/comparison; the codes for these and a summary are shown below.
Model_sens_or_recall <- rbind(cm_final_log_reg$byClass["Sensitivity"], cm_knn1$byClass["Sensitivity"], cm_knn5$byClass["Sensitivity"], cm_knn10$byClass["Sensitivity"], cm_knn100$byClass["Sensitivity"], cm_nb$byClass["Sensitivity"], cm_dtree$byClass["Sensitivity"], cm_rfor$byClass["Sensitivity"])
rownames(Model_sens_or_recall) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_sens_or_recall) <- c("Sensitivity_or_Recall")
Model_sens_or_recall
## Sensitivity_or_Recall
## Log_reg 0.9473684
## Knn1 0.8684211
## Knn5 0.9736842
## Knn10 0.9736842
## Knn100 0.9210526
## Naive_bayes 0.9473684
## Decision_tree 0.9736842
## Random_forest 0.9473684
Model_spec <- rbind(cm_final_log_reg$byClass["Specificity"], cm_knn1$byClass["Specificity"], cm_knn5$byClass["Specificity"], cm_knn10$byClass["Specificity"], cm_knn100$byClass["Specificity"], cm_nb$byClass["Specificity"], cm_dtree$byClass["Specificity"], cm_rfor$byClass["Specificity"])
rownames(Model_spec) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_spec) <- c("Specificity")
Model_spec
## Specificity
## Log_reg 0.775
## Knn1 0.750
## Knn5 0.775
## Knn10 0.800
## Knn100 0.825
## Naive_bayes 0.775
## Decision_tree 0.725
## Random_forest 0.875
Model_acc <- rbind(cm_final_log_reg$overall["Accuracy"], cm_knn1$overall["Accuracy"], cm_knn5$overall["Accuracy"], cm_knn10$overall["Accuracy"], cm_knn100$overall["Accuracy"], cm_nb$overall["Accuracy"], cm_dtree$overall["Accuracy"], cm_rfor$overall["Accuracy"])
rownames(Model_acc) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_acc) <- c("Accuracy")
Model_acc
## Accuracy
## Log_reg 0.8589744
## Knn1 0.8076923
## Knn5 0.8717949
## Knn10 0.8846154
## Knn100 0.8717949
## Naive_bayes 0.8589744
## Decision_tree 0.8461538
## Random_forest 0.9102564
Model_misclassifation_rate <- rbind(1-cm_final_log_reg$overall["Accuracy"], 1-cm_knn1$overall["Accuracy"], 1-cm_knn5$overall["Accuracy"], 1-cm_knn10$overall["Accuracy"], 1-cm_knn100$overall["Accuracy"], 1-cm_nb$overall["Accuracy"], 1-cm_dtree$overall["Accuracy"], 1-cm_rfor$overall["Accuracy"])
rownames(Model_misclassifation_rate) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_misclassifation_rate) <- c("Misclassification_rate")
Model_misclassifation_rate
## Misclassification_rate
## Log_reg 0.14102564
## Knn1 0.19230769
## Knn5 0.12820513
## Knn10 0.11538462
## Knn100 0.12820513
## Naive_bayes 0.14102564
## Decision_tree 0.15384615
## Random_forest 0.08974359
Model_precision <- rbind(cm_final_log_reg$byClass["Precision"], cm_knn1$byClass["Precision"], cm_knn5$byClass["Precision"], cm_knn10$byClass["Precision"], cm_knn100$byClass["Precision"], cm_nb$byClass["Precision"], cm_dtree$byClass["Precision"], cm_rfor$byClass["Precision"])
rownames(Model_precision) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_precision) <- c("Precision")
Model_precision
## Precision
## Log_reg 0.8000000
## Knn1 0.7674419
## Knn5 0.8043478
## Knn10 0.8222222
## Knn100 0.8333333
## Naive_bayes 0.8000000
## Decision_tree 0.7708333
## Random_forest 0.8780488
Model_f1 <- rbind(cm_final_log_reg$byClass["F1"], cm_knn1$byClass["F1"], cm_knn5$byClass["F1"], cm_knn10$byClass["F1"], cm_knn100$byClass["F1"], cm_nb$byClass["F1"], cm_dtree$byClass["F1"], cm_rfor$byClass["F1"])
rownames(Model_f1) <- c("Log_reg", "Knn1", "Knn5", "Knn10", "Knn100", "Naive_bayes", "Decision_tree", "Random_forest")
colnames(Model_f1) <- c("F1")
Model_f1
## F1
## Log_reg 0.8674699
## Knn1 0.8148148
## Knn5 0.8809524
## Knn10 0.8915663
## Knn100 0.8750000
## Naive_bayes 0.8674699
## Decision_tree 0.8604651
## Random_forest 0.9113924
Model_summary <- data.frame(Model_sens_or_recall, Model_spec, Model_acc, Model_misclassifation_rate, Model_precision, Model_f1)
Model_summary
## Sensitivity_or_Recall Specificity Accuracy
## Log_reg 0.9473684 0.775 0.8589744
## Knn1 0.8684211 0.750 0.8076923
## Knn5 0.9736842 0.775 0.8717949
## Knn10 0.9736842 0.800 0.8846154
## Knn100 0.9210526 0.825 0.8717949
## Naive_bayes 0.9473684 0.775 0.8589744
## Decision_tree 0.9736842 0.725 0.8461538
## Random_forest 0.9473684 0.875 0.9102564
## Misclassification_rate Precision F1
## Log_reg 0.14102564 0.8000000 0.8674699
## Knn1 0.19230769 0.7674419 0.8148148
## Knn5 0.12820513 0.8043478 0.8809524
## Knn10 0.11538462 0.8222222 0.8915663
## Knn100 0.12820513 0.8333333 0.8750000
## Naive_bayes 0.14102564 0.8000000 0.8674699
## Decision_tree 0.15384615 0.7708333 0.8604651
## Random_forest 0.08974359 0.8780488 0.9113924
For visual effect, I decided to use color shades on the Model_summary data frame.
For Metrics where the higher values, the better the model, green shades are used (with the darker green being the best of the model) and for those where lower values are better, red shades are used (with the darker reds being the worst of the model)
formattable(Model_summary, list(
Sensitivity_or_Recall = color_tile("white", "green"),
Specificity = color_tile("white", "green"),
Accuracy = color_tile("white", "green"),
Misclassification_rate = color_tile("white", "red"),
Precision = color_tile("white", "green"),
F1 = color_tile("white", "green")
))
Sensitivity_or_Recall | Specificity | Accuracy | Misclassification_rate | Precision | F1 | |
---|---|---|---|---|---|---|
Log_reg | 0.9473684 | 0.775 | 0.8589744 | 0.14102564 | 0.8000000 | 0.8674699 |
Knn1 | 0.8684211 | 0.750 | 0.8076923 | 0.19230769 | 0.7674419 | 0.8148148 |
Knn5 | 0.9736842 | 0.775 | 0.8717949 | 0.12820513 | 0.8043478 | 0.8809524 |
Knn10 | 0.9736842 | 0.800 | 0.8846154 | 0.11538462 | 0.8222222 | 0.8915663 |
Knn100 | 0.9210526 | 0.825 | 0.8717949 | 0.12820513 | 0.8333333 | 0.8750000 |
Naive_bayes | 0.9473684 | 0.775 | 0.8589744 | 0.14102564 | 0.8000000 | 0.8674699 |
Decision_tree | 0.9736842 | 0.725 | 0.8461538 | 0.15384615 | 0.7708333 | 0.8604651 |
Random_forest | 0.9473684 | 0.875 | 0.9102564 | 0.08974359 | 0.8780488 | 0.9113924 |
From the above, we see that the Random_forest model has the second highest sensitivity and highest specificity which we decide to used as the metric for selecting the best model. Hence, the Random_forest model is our optimal model for classifying cars mpg for this data.