Introduction

Evaluating the conditions of a car before purchasing plays a crucial role in decision making. Manually, classifying a good or acceptable condition car from an unacceptable conditioned car is time-consuming and labor-intensive. We can leverage Machine Learning techniques to develop an automatic system for car evaluation.A decision to buy a car or not according to its physical qualifications is being discussed in this project. The dataset is taken from kaggle. This data set is composed of 1727 rows and 7 different attributes. Based on the information provided by the data set, each car will be classified, using the six attributes, into unacceptable, acceptable, good or very good. The variables of the dataset are as follows:

  • Buying Price : v-high, high, med, low
  • Maintenance Cost : v-high, high, med, low
  • Number of doors : 2, 3, 4, 5-more
  • Number of persons : 2, 4, more
  • lug_boot : small, med, big
  • Safety : low, med, high
  • Decision : unacceptable, acceptable, good or very good

Through this project we aim to :

  • Analyse the different parameters for the car evaluation
  • Plot the visualizations for better understanding of the dataset
  • Build the model using supervised learning methods like Decision Tree and Random Forest
  • Understand and compare the accuracy of the both models

Packages Required

library(ggplot2)
library(gplots)
library(dplyr)
library(tidyverse)
library(reshape2)
library(rpart)
library(rpart.plot)
library(caret)
library(randomForest)

Importing the Dataset

car_data <- read.csv("/Users/sindhuherle/Documents/Data mining/car_evaluation.csv")
dim(car_data)
## [1] 1727    7

Exploratory Data Analysis

Before analyzing, let us examine the dataset using head() and str() functions

head(car_data,10)
##    vhigh vhigh.1 X2 X2.1 small  low unacc
## 1  vhigh   vhigh  2    2 small  med unacc
## 2  vhigh   vhigh  2    2 small high unacc
## 3  vhigh   vhigh  2    2   med  low unacc
## 4  vhigh   vhigh  2    2   med  med unacc
## 5  vhigh   vhigh  2    2   med high unacc
## 6  vhigh   vhigh  2    2   big  low unacc
## 7  vhigh   vhigh  2    2   big  med unacc
## 8  vhigh   vhigh  2    2   big high unacc
## 9  vhigh   vhigh  2    4 small  low unacc
## 10 vhigh   vhigh  2    4 small  med unacc
str(car_data)
## 'data.frame':    1727 obs. of  7 variables:
##  $ vhigh  : chr  "vhigh" "vhigh" "vhigh" "vhigh" ...
##  $ vhigh.1: chr  "vhigh" "vhigh" "vhigh" "vhigh" ...
##  $ X2     : chr  "2" "2" "2" "2" ...
##  $ X2.1   : chr  "2" "2" "2" "2" ...
##  $ small  : chr  "small" "small" "med" "med" ...
##  $ low    : chr  "med" "high" "low" "med" ...
##  $ unacc  : chr  "unacc" "unacc" "unacc" "unacc" ...

We see that the column names are not descriptive, so we assign new column names based on the dataset and also check if there are any missing values in the dataset

colnames(car_data)=c("buying","maint","doors","persons","lug_boot","safety","class")
colSums(is.na(car_data))
##   buying    maint    doors  persons lug_boot   safety    class 
##        0        0        0        0        0        0        0

The dataset looks clean with no missing values. Basic insights of the data can be obtained by exploring the data through visualizations.

Bar charts

Let us examine how the cars are classified as good, acceptable or unacceptable based on different car parameters using bar charts.

ggplot(car_data,aes(x=class,fill=lug_boot))+geom_histogram(stat="count")+labs(title="Class Vs Luggage boot",subtitle="Histogram",y="Frequency of Luggage boot",x="Class")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

ggplot(car_data, aes(class , fill = safety )) +
  geom_bar(position = position_dodge()) + 
  ggtitle("Car class vs Safety") +
  xlab("Class") + 
  ylab("safety")

ggplot(car_data, aes(class , fill = buying )) +
  geom_bar(position = position_dodge()) + 
  ggtitle("Car class vs Buying Price") +
  xlab("Class") + 
  ylab("Buying Price")

Density Plots

A Density Plot visualizes the distribution of data over a continuous interval or time period. Let us check how the density plot looks like for different paramaters

ggplot(data = car_data,aes(fill=as.factor(doors),x=persons))+geom_density(alpha=0.3)

ggplot(data = car_data,aes(fill=as.factor(maint),x=class))+geom_density(alpha=0.3)+facet_wrap(~class)

Decision Tree

Decision trees generate classification models in tree forms. This form helps to understand the decision hierarchy and relations between the attributes by visualizing as using the possible outcomes of each attribute as a branch of the tree. Lets start with splitting the dataset into training and testing data sets,as being 70% of the data set is for training and 30% is for testing processes

set.seed(100)
classValues<-as.vector(car_data$class)
train_test_split <- createDataPartition(y=classValues, p=0.7,list =FALSE)
train_data <-car_data[train_test_split,]
test_data <- car_data[-train_test_split,]
summary(train_data)
##     buying             maint              doors             persons         
##  Length:1211        Length:1211        Length:1211        Length:1211       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    lug_boot            safety             class          
##  Length:1211        Length:1211        Length:1211       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
summary(test_data)
##     buying             maint              doors             persons         
##  Length:516         Length:516         Length:516         Length:516        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    lug_boot            safety             class          
##  Length:516         Length:516         Length:516        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
decision_tree <- train(class ~., data = train_data, method = "rpart", parms = list(split = "information"), trControl = train_control, tuneLength = 10)
decision_tree
## CART 
## 
## 1211 samples
##    6 predictor
##    4 classes: 'acc', 'good', 'unacc', 'vgood' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1090, 1092, 1089, 1089, 1089, 1090, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa    
##   0.002747253  0.8353544  0.6469669
##   0.005494505  0.8411466  0.6622520
##   0.008241758  0.8356138  0.6422486
##   0.009615385  0.8215842  0.6060176
##   0.010989011  0.8072517  0.5771825
##   0.013736264  0.8089069  0.5825292
##   0.016483516  0.8075408  0.5780949
##   0.019230769  0.8061656  0.5742093
##   0.063186813  0.7890372  0.5559210
##   0.075549451  0.7106902  0.1995604
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.005494505.
# plotting decision tree
prp(decision_tree$finalModel, type=3, main= "Probabilities per class")

The classification process is done but it is not obvious how accurate the model succeeded. The predictions of train and test sets are being compared with the data of original train and test set and their accuracy values are gathered as 87.3% for train set and 87.5% for the test set. The accuracy on the test set is the base for the study to evaluate how well is the performance of the model on the data it did not process before.

# prediction of train data
train_pred <- predict(decision_tree, train_data)
head(train_pred)
## [1] unacc unacc unacc unacc unacc unacc
## Levels: acc good unacc vgood
table(train_pred, train_data$class)
##           
## train_pred acc good unacc vgood
##      acc   227   18    75     4
##      good   13   24     4     3
##      unacc  18    0   768     0
##      vgood  11    7     0    39
mean(train_pred  == train_data$class)
## [1] 0.8736581
# prediction of test data
test_pred <- predict(decision_tree, test_data)
head(test_pred)
## [1] unacc unacc unacc unacc unacc unacc
## Levels: acc good unacc vgood
table(test_pred, test_data$class)
##          
## test_pred acc good unacc vgood
##     acc    99   10    25     2
##     good    3    2     0     3
##     unacc   8    0   337     0
##     vgood   5    8     0    14
mean(test_pred  == test_data$class)
## [1] 0.875969

The statistical variables are also calculated to understand the success of the mode by building a confusion matrix.

confusionMatrix(test_pred, as.factor(test_data$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction acc good unacc vgood
##      acc    99   10    25     2
##      good    3    2     0     3
##      unacc   8    0   337     0
##      vgood   5    8     0    14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.876           
##                  95% CI : (0.8444, 0.9032)
##     No Information Rate : 0.7016          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7359          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: acc Class: good Class: unacc Class: vgood
## Sensitivity              0.8609    0.100000       0.9309      0.73684
## Specificity              0.9077    0.987903       0.9481      0.97384
## Pos Pred Value           0.7279    0.250000       0.9768      0.51852
## Neg Pred Value           0.9579    0.964567       0.8538      0.98978
## Prevalence               0.2229    0.038760       0.7016      0.03682
## Detection Rate           0.1919    0.003876       0.6531      0.02713
## Detection Prevalence     0.2636    0.015504       0.6686      0.05233
## Balanced Accuracy        0.8843    0.543952       0.9395      0.85534
confusionMatrix(test_pred, as.factor(test_data$class), mode = "prec_recall", positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction acc good unacc vgood
##      acc    99   10    25     2
##      good    3    2     0     3
##      unacc   8    0   337     0
##      vgood   5    8     0    14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.876           
##                  95% CI : (0.8444, 0.9032)
##     No Information Rate : 0.7016          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7359          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: acc Class: good Class: unacc Class: vgood
## Precision                0.7279    0.250000       0.9768      0.51852
## Recall                   0.8609    0.100000       0.9309      0.73684
## F1                       0.7888    0.142857       0.9533      0.60870
## Prevalence               0.2229    0.038760       0.7016      0.03682
## Detection Rate           0.1919    0.003876       0.6531      0.02713
## Detection Prevalence     0.2636    0.015504       0.6686      0.05233
## Balanced Accuracy        0.8843    0.543952       0.9395      0.85534

We see that the accuracy of the model is 87.6 %

Random Forest

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. After decision tree algorithm, to increase the accuracy of the model, we build the model using random forest method.

random_forest <- randomForest(as.factor(class)~., data = train_data, importance = TRUE)
random_forest
## 
## Call:
##  randomForest(formula = as.factor(class) ~ ., data = train_data,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 3.96%
## Confusion matrix:
##       acc good unacc vgood class.error
## acc   252    1    15     1 0.063197026
## good   21   28     0     0 0.428571429
## unacc   7    0   840     0 0.008264463
## vgood   3    0     0    43 0.065217391
#fine tuning the model
random_forest_1 <- randomForest(as.factor(class)~., data = train_data, ntree = 500, mtry = 3, importance = TRUE)
random_forest_1
## 
## Call:
##  randomForest(formula = as.factor(class) ~ ., data = train_data,      ntree = 500, mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 2.48%
## Confusion matrix:
##       acc good unacc vgood class.error
## acc   257    4     7     1  0.04460967
## good    7   41     0     1  0.16326531
## unacc   9    0   838     0  0.01062574
## vgood   1    0     0    45  0.02173913

We again check the prediction of train and test sets with the data of original train and test set.

#prediction on train data set
train_pred1 <-predict(random_forest_1, train_data, type = "class")
table(train_pred1, train_data$class)
##            
## train_pred1 acc good unacc vgood
##       acc   269    0     0     0
##       good    0   49     0     0
##       unacc   0    0   847     0
##       vgood   0    0     0    46
mean(train_pred1 == train_data$class)
## [1] 1
#prediction on test data set
test_pred1 <-predict(random_forest_1, test_data, type = "class")
table(test_pred1, test_data$class)
##           
## test_pred1 acc good unacc vgood
##      acc   111    2     1     1
##      good    0   18     0     3
##      unacc   4    0   361     0
##      vgood   0    0     0    15
mean(test_pred1==test_data$class)
## [1] 0.9786822

The statistical variables are also calculated to understand the success of the mode by building a confusion matrix.

confusionMatrix(test_pred1, as.factor(test_data$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction acc good unacc vgood
##      acc   111    2     1     1
##      good    0   18     0     3
##      unacc   4    0   361     0
##      vgood   0    0     0    15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9787          
##                  95% CI : (0.9622, 0.9893)
##     No Information Rate : 0.7016          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9528          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: acc Class: good Class: unacc Class: vgood
## Sensitivity              0.9652     0.90000       0.9972      0.78947
## Specificity              0.9900     0.99395       0.9740      1.00000
## Pos Pred Value           0.9652     0.85714       0.9890      1.00000
## Neg Pred Value           0.9900     0.99596       0.9934      0.99202
## Prevalence               0.2229     0.03876       0.7016      0.03682
## Detection Rate           0.2151     0.03488       0.6996      0.02907
## Detection Prevalence     0.2229     0.04070       0.7074      0.02907
## Balanced Accuracy        0.9776     0.94698       0.9856      0.89474
confusionMatrix(test_pred1, as.factor(test_data$class), mode = "prec_recall", positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction acc good unacc vgood
##      acc   111    2     1     1
##      good    0   18     0     3
##      unacc   4    0   361     0
##      vgood   0    0     0    15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9787          
##                  95% CI : (0.9622, 0.9893)
##     No Information Rate : 0.7016          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9528          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: acc Class: good Class: unacc Class: vgood
## Precision                0.9652     0.85714       0.9890      1.00000
## Recall                   0.9652     0.90000       0.9972      0.78947
## F1                       0.9652     0.87805       0.9931      0.88235
## Prevalence               0.2229     0.03876       0.7016      0.03682
## Detection Rate           0.2151     0.03488       0.6996      0.02907
## Detection Prevalence     0.2229     0.04070       0.7074      0.02907
## Balanced Accuracy        0.9776     0.94698       0.9856      0.89474

We see that the accuracy of the model has improved and is 97.8 %

Conclusion

This dataset was divided into four classes as very good, good, acceptable and unacceptable cars considering the six different attributes which are buying price, maintenance, number of doors, capacity in terms of persons to carry, size of luggage boot and the estimated safety value. The model we built using decision tree had the accuracy of 87.6%. and to increase the accuracy we built model using random Forest method and the accuracy improved to 97.8 %.

According to the results, safety is the key attribute for car buyers. If a customers thinks a car is not safe, he/she does not buy it. Then, the capacity of people it can carry matters, if a car has seats for more than 4 people, customers do not buy it. If it is less, maintenance fee is a consideration. If maintenance fee is low, buying price comes into evaluation. If it is acceptably low again, luggage capacity is the final consideration in evaluating the car.