Section 5: Machine Learning

knitr::opts_chunk$set(echo = TRUE)

Overview

Using the dataset from https://archive.ics.uci.edu/ml/datasets/Car+Evaluation, create a machine learning model to predict the buying price of a car given the following parameters:

Maintenance = High
Number of Doors = 4
Lug Boot Size = Big
Safety = High
Class Value = Good

Loading of Libraries

# Loading of Libraries
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(RColorBrewer)
library(corrplot)

## corrplot 0.92 loaded

library(rpart)
library(rpart.plot)
library(rattle)

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(gbm)

## Loaded gbm 2.1.8

library(stringr)
library(latex2exp)

Loading and Cleaning of Data

# Loading the Data
train_in <- read.csv('./car_train.csv', header=T)
test_in <- read.csv('./car_test.csv', header=T)

train_in[c('buying_price', 'maintenance', 'no_of_doors', 'capacity_persons', 'lug_boot_size', 'safety', 'class')] <- str_split_fixed(train_in$car_train, ',', 7)
train_in <- train_in[c('buying_price', 'maintenance', 'no_of_doors', 'capacity_persons', 'lug_boot_size', 'safety', 'class')]

test_in[c('buying_price', 'maintenance', 'no_of_doors', 'capacity_persons', 'lug_boot_size', 'safety', 'class')] <- str_split_fixed(test_in$car_test, ',', 7)
test_in <- test_in[c('buying_price', 'maintenance', 'no_of_doors', 'capacity_persons', 'lug_boot_size', 'safety', 'class')]

dim(train_in)

## [1] 1728    7

dim(test_in)

## [1] 1 7

# Cleaning the Data
trainData <- train_in[, colSums(is.na(train_in)) == 0]
testData <- test_in[, colSums(is.na(test_in)) == 0]

dim(trainData)

## [1] 1728    7

dim(testData)

## [1] 1 7

Preparing Datasets for Prediction

We split the training data (trainData) into 50% for training (trainData) and 50% for cross validation (validData). This will help us to determine out-of-sample errors. We will then use our prediction models to predict the buying price of the car for our test case (testData).

# Splitting the Training Data
set.seed(1234)
inTrain <- createDataPartition(trainData$buying_price, p = 0.5, list = FALSE)
trainData <- trainData[inTrain, ]
validData <- trainData[-inTrain, ]

dim(trainData)

## [1] 864   7

dim(validData)

## [1] 432   7

trainData <- as.data.frame(lapply(trainData, as.numeric))
validData <- as.data.frame(lapply(validData, as.numeric))

Plotting a Correlation Plot for Training Data

# Plotting a Correlation Plot for Training Data
cor_mat <- cor(trainData)
corrplot(cor_mat, order = "FPC", method = "color",
         type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))

In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a threshold value of 0.95 to determine these highly correlated variables.

Building our Prediction Models

For this Section, we will use 3 different algorithms to predict the outcome (buying_price). The algorithms are as follows:

Classification Tree
Generalized Boosted Models
Random Decision Forests

Prediction with Classification Tree

# Building our Classification Tree Model with Training Data
set.seed(12345)
decisionTreeMod1 <- rpart(buying_price ~ ., data = trainData, method = "class")
fancyRpartPlot(decisionTreeMod1)

Next, we cross validate our Classification Tree Model with our validation data (validData), to determine the accuracy of this prediction model.

# Cross Validating the Classification Tree Model with Validation Data
Prediction_Matrix_CT <- predict(decisionTreeMod1, validData, type = "class")
cmtree <- confusionMatrix(table(Prediction_Matrix_CT, validData$buying_price))
cmtree

## Confusion Matrix and Statistics
## 
##                     
## Prediction_Matrix_CT  1  2  3  4
##                    1 68 56 41 36
##                    2 10 19  8  0
##                    3  0  3 12 10
##                    4 22 38 49 60
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3681          
##                  95% CI : (0.3225, 0.4155)
##     No Information Rate : 0.2685          
##     P-Value [Acc > NIR] : 3.847e-06       
##                                           
##                   Kappa : 0.1669          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.6800  0.16379  0.10909   0.5660
## Specificity            0.5994  0.94304  0.95963   0.6656
## Pos Pred Value         0.3383  0.51351  0.48000   0.3550
## Neg Pred Value         0.8615  0.75443  0.75921   0.8251
## Prevalence             0.2315  0.26852  0.25463   0.2454
## Detection Rate         0.1574  0.04398  0.02778   0.1389
## Detection Prevalence   0.4653  0.08565  0.05787   0.3912
## Balanced Accuracy      0.6397  0.55342  0.53436   0.6158

# Plotting Results in a Matrix
plot(cmtree$table, col = cmtree$byClass,
     main = paste("Classification Tree: Accuracy =",
                  round(cmtree$overall['Accuracy'], 4)))

From the Classification Tree Matrix shown above, the accuracy of our Classification Tree Model is 0.3681. Therefore, its out-of-sample error is 0.6319.

Prediction with Generalized Boosted Models

# Building our Generalized Boosted Models with Training Data
set.seed(12345)
modGBM <- gbm(formula = as.factor(buying_price) ~ ., distribution = "gaussian",
              data = trainData, n.trees = 1000, interaction.depth = 3,
              shrinkage = 0.1, cv.folds = 5, n.cores = NULL, verbose = FALSE)
print(modGBM)

## gbm(formula = as.factor(buying_price) ~ ., distribution = "gaussian", 
##     data = trainData, n.trees = 1000, interaction.depth = 3, 
##     shrinkage = 0.1, cv.folds = 5, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## The best cross-validation iteration was 81.
## There were 6 predictors of which 6 had non-zero influence.

# Plotting our Generalized Boosted Models
gbm.perf(modGBM, method = "cv")

## [1] 81

Next, we cross validate our Generalized Boosted Models with our validation data (validData), to determine the accuracy of this prediction model.

# Cross Validating the Generalized Boosted Models with Validation Data
Prediction_Matrix_GBM <- round(predict(modGBM, newdata=validData, n.trees=1000), digits=0)
cmGBM <- confusionMatrix(table(Prediction_Matrix_GBM, validData$buying_price))
cmGBM

## Confusion Matrix and Statistics
## 
##                      
## Prediction_Matrix_GBM  1  2  3  4
##                     1 29 14  0  0
##                     2 51 58 36 20
##                     3 20 44 68 59
##                     4  0  0  6 27
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4213          
##                  95% CI : (0.3743, 0.4694)
##     No Information Rate : 0.2685          
##     P-Value [Acc > NIR] : 5.283e-12       
##                                           
##                   Kappa : 0.2212          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity           0.29000   0.5000   0.6182  0.25472
## Specificity           0.95783   0.6614   0.6180  0.98160
## Pos Pred Value        0.67442   0.3515   0.3560  0.81818
## Neg Pred Value        0.81748   0.7828   0.8257  0.80201
## Prevalence            0.23148   0.2685   0.2546  0.24537
## Detection Rate        0.06713   0.1343   0.1574  0.06250
## Detection Prevalence  0.09954   0.3819   0.4421  0.07639
## Balanced Accuracy     0.62392   0.5807   0.6181  0.61816

# Plotting Results in a Matrix
plot(cmGBM$table, col = cmGBM$byClass,
     main = paste("Generalized Boosted Models: Accuracy =",
                  round(cmGBM$overall['Accuracy'], 4)))

From the Generalized Boosted Models Matrix shown above, the accuracy of our Generalized Boosted Models is 0.4213. Therefore, its out-of-sample error is 0.5787.

Prediction with Random Decision Forests

# Building our Random Decision Forests Model with Training Data
set.seed(12345)
controlRF <- trainControl(method="cv", number=10, verboseIter=FALSE)
modRF1 <- train(as.factor(buying_price) ~ ., data=trainData,
                method="rf", ntree=1000, trControl=controlRF)
modRF1$finalModel

## 
## Call:
##  randomForest(x = x, y = y, ntree = 1000, mtry = min(param$mtry,      ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 83.1%
## Confusion matrix:
##    1  2  3  4 class.error
## 1 52 79 42 43   0.7592593
## 2 93 16 55 52   0.9259259
## 3 46 51 29 90   0.8657407
## 4 41 40 86 49   0.7731481

# Plotting our Random Decision Forests Model
plot(modRF1)

Next, we cross validate our Random Decision Forests Model with our validation data (validData), to determine the accuracy of this prediction model.

# Cross Validating the Random Decision Forests Model with Validation Data
Prediction_Matrix_RDF <- predict(modRF1, newdata=validData, type = "raw")
cmrf <- confusionMatrix(table(Prediction_Matrix_RDF, validData$buying_price))
cmrf

## Confusion Matrix and Statistics
## 
##                      
## Prediction_Matrix_RDF  1  2  3  4
##                     1 65 28 13  9
##                     2 14 51  9 13
##                     3  9 20 66 18
##                     4 12 17 22 66
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5741          
##                  95% CI : (0.5259, 0.6212)
##     No Information Rate : 0.2685          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.433           
##                                           
##  Mcnemar's Test P-Value : 0.09062         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.6500   0.4397   0.6000   0.6226
## Specificity            0.8494   0.8861   0.8540   0.8436
## Pos Pred Value         0.5652   0.5862   0.5841   0.5641
## Neg Pred Value         0.8896   0.8116   0.8621   0.8730
## Prevalence             0.2315   0.2685   0.2546   0.2454
## Detection Rate         0.1505   0.1181   0.1528   0.1528
## Detection Prevalence   0.2662   0.2014   0.2616   0.2708
## Balanced Accuracy      0.7497   0.6629   0.7270   0.7331

# Plotting Results in a Matrix
plot(cmrf$table, col = cmrf$byClass,
     main = paste("Random Decision Forests: Accuracy =",
                  round(cmrf$overall['Accuracy'], 4)))

From the Random Decision Forests Matrix shown above, the accuracy of our Random Decision Forests Model is 0.5741. Therefore, its out-of-sample error is 0.4259.

Best Prediction Model

The accuracy values of the 3 prediction models are as follow:

Classification Tree = 0.3681
Generalized Boosted Models = 0.4213
Random Decision Forests = 0.5741

From this comparison, we concluded that the Random Decision Forests Model is the best prediction model for our analysis.

# Using our Random Decision Forests Model on Test Data
testData <- as.data.frame(lapply(testData, as.numeric))
Results <- predict(modRF1, newdata=testData, type = "raw")
testData[, "buying_price"] <- Results
write.csv(testData,'./car_result.csv', row.names = FALSE)

Our Random Forests Model was able to predict the buying price of the car for our test case (testData).

Essentially, for a car that has a high maintenance price, with 4 doors, a 4-seater capacity, a big luggage boot size, high estimated safety, and a good class value, its buying price was predicted to be low. This could be due to many various factors that could attribute to this correlation between buying price and the other predictive attributes used in this prediction model.

The generated output (Results) contains this information and it has been added back into the test data accordingly. You may refer to the new test data file (car_result.csv) for more information.