Objective

To develop and create a suitable machine learning model for the assigned dataset and compare the results of at least 2 machine learning algorithms.

Literature Survey of Given Dataset

Dataset: https://archive.ics.uci.edu/ml/datasets/Automobile

The dataset describes used automobiles in 3 ways:

“symboling”, corresponds to a car’s insurance risk level. Cars are initially assigned a risk factor symbol that corresponds to their price. If an automobile is more dangerous, this symbol is adjusted by increasing it. A value of +3 indicates that the vehicle is risky, while -3 indicates that it is likely safe to insure.

The second attribute, “normalized-losses,” is the relative average loss payment per insured vehicle year. This figure is normalised for all vehicles within a given size category (two-door, small, station wagons, sports/specialty, etc…) and represents the average loss per vehicle per year.

All the other attributes are self-explanatory and define the price and technical specifications of the vehicles like the size, weight, horsepower, engine-type etc. From the dataset, we can see that many of the attributes can be used for prediction.

In this case, we will try and predict the symboling level of an automobile.

Attribute Information:

     Attribute:                Attribute Range:
     ------------------        -----------------------------------------------
  1. symboling:                -3, -2, -1, 0, 1, 2, 3.
  2. normalized-losses:        continuous from 65 to 256.
  3. make:                     alfa-romero, audi, bmw, chevrolet, dodge, honda,
                               isuzu, jaguar, mazda, mercedes-benz, mercury,
                               mitsubishi, nissan, peugot, plymouth, porsche,
                               renault, saab, subaru, toyota, volkswagen, volvo
  4. fuel-type:                diesel, gas.
  5. aspiration:               std, turbo.
  6. num-of-doors:             four, two.
  7. body-style:               hardtop, wagon, sedan, hatchback, convertible.
  8. drive-wheels:             4wd, fwd, rwd.
  9. engine-location:          front, rear.
 10. wheel-base:               continuous from 86.6 120.9.
 11. length:                   continuous from 141.1 to 208.1.
 12. width:                    continuous from 60.3 to 72.3.
 13. height:                   continuous from 47.8 to 59.8.
 14. curb-weight:              continuous from 1488 to 4066.
 15. engine-type:              dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
 16. num-of-cylinders:         eight, five, four, six, three, twelve, two.
 17. engine-size:              continuous from 61 to 326.
 18. fuel-system:              1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
 19. bore:                     continuous from 2.54 to 3.94.
 20. stroke:                   continuous from 2.07 to 4.17.
 21. compression-ratio:        continuous from 7 to 23.
 22. horsepower:               continuous from 48 to 288.
 23. peak-rpm:                 continuous from 4150 to 6600.
 24. city-mpg:                 continuous from 13 to 49.
 25. highway-mpg:              continuous from 16 to 54.
 26. price:                    continuous from 5118 to 45400.

Processing the dataset

Loading the dataset

library(dplyr)
data = read.csv('imports-85.data', sep=',', 
                header=F,
                col.names=c('symboling', 'normalized.losses','make',
                         'fuel.type','aspiration','num.of.doors',
                         'body.style','drive.wheels','engine.location',
                         'wheel.base','length','width','height','curb.weight',
                         'engine.type','num.of.cylinders','engine.size',
                         'fuel.system','bore','stroke','compression.ratio',
                         'horsepower','peak.rpm','city.mpg','highway.mpg',
                         'price'))

Taking a glimpse at the data

glimpse(data)
## Rows: 205
## Columns: 26
## $ symboling         <int> 3, 3, 1, 2, 2, 2, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0~
## $ normalized.losses <chr> "?", "?", "?", "164", "164", "?", "158", "?", "158",~
## $ make              <chr> "alfa-romero", "alfa-romero", "alfa-romero", "audi",~
## $ fuel.type         <chr> "gas", "gas", "gas", "gas", "gas", "gas", "gas", "ga~
## $ aspiration        <chr> "std", "std", "std", "std", "std", "std", "std", "st~
## $ num.of.doors      <chr> "two", "two", "two", "four", "four", "two", "four", ~
## $ body.style        <chr> "convertible", "convertible", "hatchback", "sedan", ~
## $ drive.wheels      <chr> "rwd", "rwd", "rwd", "fwd", "4wd", "fwd", "fwd", "fw~
## $ engine.location   <chr> "front", "front", "front", "front", "front", "front"~
## $ wheel.base        <dbl> 88.6, 88.6, 94.5, 99.8, 99.4, 99.8, 105.8, 105.8, 10~
## $ length            <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192~
## $ width             <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4~
## $ height            <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9~
## $ curb.weight       <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086~
## $ engine.type       <chr> "dohc", "dohc", "ohcv", "ohc", "ohc", "ohc", "ohc", ~
## $ num.of.cylinders  <chr> "four", "four", "six", "four", "five", "five", "five~
## $ engine.size       <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 131, 10~
## $ fuel.system       <chr> "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpf~
## $ bore              <chr> "3.47", "3.47", "2.68", "3.19", "3.19", "3.19", "3.1~
## $ stroke            <chr> "2.68", "2.68", "3.47", "3.40", "3.40", "3.40", "3.4~
## $ compression.ratio <dbl> 9.00, 9.00, 9.00, 10.00, 8.00, 8.50, 8.50, 8.50, 8.3~
## $ horsepower        <chr> "111", "111", "154", "102", "115", "110", "110", "11~
## $ peak.rpm          <chr> "5000", "5000", "5000", "5500", "5500", "5500", "550~
## $ city.mpg          <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 16, 23, 23, 21, ~
## $ highway.mpg       <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 22, 29, 29, 28, ~
## $ price             <chr> "13495", "16500", "16500", "13950", "17450", "15250"~

Cleaning the data

We can see that many attributes are not of the correct datatype. Also the data contains ? instead of NA values so these need to be replaced with NA

data[data == '?'] <- NA
NAsByFeature<-apply(data,2,function(x){length(which(is.na(x)))})
NAsByFeature
##         symboling normalized.losses              make         fuel.type 
##                 0                41                 0                 0 
##        aspiration      num.of.doors        body.style      drive.wheels 
##                 0                 2                 0                 0 
##   engine.location        wheel.base            length             width 
##                 0                 0                 0                 0 
##            height       curb.weight       engine.type  num.of.cylinders 
##                 0                 0                 0                 0 
##       engine.size       fuel.system              bore            stroke 
##                 0                 0                 4                 4 
## compression.ratio        horsepower          peak.rpm          city.mpg 
##                 0                 2                 2                 0 
##       highway.mpg             price 
##                 0                 4

We see that high number of NA Values are present in normalized-losses attribute. Hence we will not consider this feature for analysis.

data = data[-2]

Omit any remaining rows with NA Values

data = data %>% na.omit()
NAsByFeature<-apply(data,2,function(x){length(which(is.na(x)))})
NAsByFeature
##         symboling              make         fuel.type        aspiration 
##                 0                 0                 0                 0 
##      num.of.doors        body.style      drive.wheels   engine.location 
##                 0                 0                 0                 0 
##        wheel.base            length             width            height 
##                 0                 0                 0                 0 
##       curb.weight       engine.type  num.of.cylinders       engine.size 
##                 0                 0                 0                 0 
##       fuel.system              bore            stroke compression.ratio 
##                 0                 0                 0                 0 
##        horsepower          peak.rpm          city.mpg       highway.mpg 
##                 0                 0                 0                 0 
##             price 
##                 0

Correcting datatypes for features.

factorCols = c('symboling','make',
               'fuel.type','aspiration','num.of.doors',
               'body.style','drive.wheels','engine.location',
               'engine.type','num.of.cylinders',
               'fuel.system')
intCols =c('horsepower','peak.rpm','city.mpg','highway.mpg',
           'price','curb.weight','engine.size')
numCols = c('bore','stroke','compression.ratio','wheel.base','length','width','height')
data = data %>% mutate_at(factorCols, factor) %>% 
  mutate_at(intCols, as.integer) %>% mutate_at(numCols, as.numeric)

Finally the data is as follows.

glimpse(data)
## Rows: 193
## Columns: 25
## $ symboling         <fct> 3, 3, 1, 2, 2, 2, 1, 1, 1, 2, 0, 0, 0, 1, 0, 0, 0, 2~
## $ make              <fct> alfa-romero, alfa-romero, alfa-romero, audi, audi, a~
## $ fuel.type         <fct> gas, gas, gas, gas, gas, gas, gas, gas, gas, gas, ga~
## $ aspiration        <fct> std, std, std, std, std, std, std, std, turbo, std, ~
## $ num.of.doors      <fct> two, two, two, four, four, two, four, four, four, tw~
## $ body.style        <fct> convertible, convertible, hatchback, sedan, sedan, s~
## $ drive.wheels      <fct> rwd, rwd, rwd, fwd, 4wd, fwd, fwd, fwd, fwd, rwd, rw~
## $ engine.location   <fct> front, front, front, front, front, front, front, fro~
## $ wheel.base        <dbl> 88.6, 88.6, 94.5, 99.8, 99.4, 99.8, 105.8, 105.8, 10~
## $ length            <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192~
## $ width             <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4~
## $ height            <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9~
## $ curb.weight       <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086~
## $ engine.type       <fct> dohc, dohc, ohcv, ohc, ohc, ohc, ohc, ohc, ohc, ohc,~
## $ num.of.cylinders  <fct> four, four, six, four, five, five, five, five, five,~
## $ engine.size       <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 108, 10~
## $ fuel.system       <fct> mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi~
## $ bore              <dbl> 3.47, 3.47, 2.68, 3.19, 3.19, 3.19, 3.19, 3.19, 3.13~
## $ stroke            <dbl> 2.68, 2.68, 3.47, 3.40, 3.40, 3.40, 3.40, 3.40, 3.40~
## $ compression.ratio <dbl> 9.00, 9.00, 9.00, 10.00, 8.00, 8.50, 8.50, 8.50, 8.3~
## $ horsepower        <int> 111, 111, 154, 102, 115, 110, 110, 110, 140, 101, 10~
## $ peak.rpm          <int> 5000, 5000, 5000, 5500, 5500, 5500, 5500, 5500, 5500~
## $ city.mpg          <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 23, 23, 21, 21, ~
## $ highway.mpg       <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 29, 29, 28, 28, ~
## $ price             <int> 13495, 16500, 16500, 13950, 17450, 15250, 17710, 189~
summary(data)
##  symboling         make     fuel.type   aspiration  num.of.doors
##  -2: 3     toyota    :32   diesel: 19   std  :158   four:112    
##  -1:22     nissan    :18   gas   :174   turbo: 35   two : 81    
##  0 :63     honda     :13                                        
##  1 :51     mitsubishi:13                                        
##  2 :31     mazda     :12                                        
##  3 :23     subaru    :12                                        
##            (Other)   :93                                        
##        body.style drive.wheels engine.location   wheel.base         length     
##  convertible: 6   4wd:  8      front:190       Min.   : 86.60   Min.   :141.1  
##  hardtop    : 8   fwd:114      rear :  3       1st Qu.: 94.50   1st Qu.:166.3  
##  hatchback  :63   rwd: 71                      Median : 97.00   Median :173.2  
##  sedan      :92                                Mean   : 98.92   Mean   :174.3  
##  wagon      :24                                3rd Qu.:102.40   3rd Qu.:184.6  
##                                                Max.   :120.90   Max.   :208.1  
##                                                                                
##      width           height       curb.weight   engine.type num.of.cylinders
##  Min.   :60.30   Min.   :47.80   Min.   :1488   dohc: 12    eight :  4      
##  1st Qu.:64.10   1st Qu.:52.00   1st Qu.:2145   l   : 12    five  : 10      
##  Median :65.40   Median :54.10   Median :2414   ohc :141    four  :153      
##  Mean   :65.89   Mean   :53.87   Mean   :2562   ohcf: 15    six   : 24      
##  3rd Qu.:66.90   3rd Qu.:55.70   3rd Qu.:2952   ohcv: 13    three :  1      
##  Max.   :72.00   Max.   :59.80   Max.   :4066               twelve:  1      
##                                                                             
##   engine.size    fuel.system      bore           stroke      compression.ratio
##  Min.   : 61.0   1bbl:11     Min.   :2.540   Min.   :2.070   Min.   : 7.00    
##  1st Qu.: 98.0   2bbl:64     1st Qu.:3.150   1st Qu.:3.110   1st Qu.: 8.50    
##  Median :120.0   idi :19     Median :3.310   Median :3.290   Median : 9.00    
##  Mean   :128.1   mfi : 1     Mean   :3.331   Mean   :3.249   Mean   :10.14    
##  3rd Qu.:146.0   mpfi:88     3rd Qu.:3.590   3rd Qu.:3.410   3rd Qu.: 9.40    
##  Max.   :326.0   spdi: 9     Max.   :3.940   Max.   :4.170   Max.   :23.00    
##                  spfi: 1                                                      
##    horsepower       peak.rpm       city.mpg      highway.mpg        price      
##  Min.   : 48.0   Min.   :4150   Min.   :13.00   Min.   :16.00   Min.   : 5118  
##  1st Qu.: 70.0   1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00   1st Qu.: 7738  
##  Median : 95.0   Median :5100   Median :25.00   Median :30.00   Median :10245  
##  Mean   :103.5   Mean   :5100   Mean   :25.33   Mean   :30.79   Mean   :13285  
##  3rd Qu.:116.0   3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00   3rd Qu.:16515  
##  Max.   :262.0   Max.   :6600   Max.   :49.00   Max.   :54.00   Max.   :45400  
## 

Visualizing the data

Importing ggplot to visualize the data

library(ggplot2)
#Density Plot
filter_cyl <- data %>% filter(num.of.cylinders %in% c('four', 'five' ,'six'))
ggplot(data = filter_cyl,
       aes(x=city.mpg, fill=num.of.cylinders)) + geom_density(alpha=0.4)+
  labs(title="Density Plot     20MID0006")

#Histogram
ggplot(data = data,aes(x=horsepower, fill=fuel.type)) + geom_histogram() + 
  facet_wrap(~fuel.type)+labs(title="Histogram     20MID0006")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Box Plot
data = data %>% mutate(volume = length*width*height/1000)
ggplot(data = data,aes(x=body.style, y=volume)) + geom_boxplot()+
  labs(title="Box Plot     20MID0006")

#Scatter Plot
ggplot(data = data,aes(x=curb.weight, y=city.mpg)) + 
  geom_point()+labs(title="Scatter Plot     20MID0006")

#Bar Plot
ggplot(data = data,aes(x=make, y=city.mpg)) + geom_bar(stat = "identity") +
  coord_flip()+labs(title="Bar Plot     20MID0006")

#Bubble Plot
ggplot(data = data, aes(x=peak.rpm, y=horsepower, size=city.mpg, 
                        color=symboling)) + geom_point(alpha=0.4)+
  labs(title="Bubble Plot     20MID0006")

#Stacked Area Chart
ggplot(data = data, aes(x=curb.weight, y=city.mpg, fill=drive.wheels)) + 
  geom_area(alpha = 0.4,size=0.5, colour="black")+
  labs(title="Stacked Area Chart     20MID0006")

#Correlation Plot
library(corrplot)
corrplot(cor(data[c(intCols,numCols)]), method="square")

Modelling the Data

K-Nearest Neighbour Classifier

The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point

Preparing the data

We need to use only numeric data because we will use the euclidean distance metric to calculate the nearest neighbour

#Filtering for numeric and integer data
numData = data[c(intCols,numCols)]
numData = numData %>% mutate(symboling = data$symboling)
glimpse(numData)
## Rows: 193
## Columns: 15
## $ horsepower        <int> 111, 111, 154, 102, 115, 110, 110, 110, 140, 101, 10~
## $ peak.rpm          <int> 5000, 5000, 5000, 5500, 5500, 5500, 5500, 5500, 5500~
## $ city.mpg          <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 23, 23, 21, 21, ~
## $ highway.mpg       <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 29, 29, 28, 28, ~
## $ price             <int> 13495, 16500, 16500, 13950, 17450, 15250, 17710, 189~
## $ curb.weight       <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086~
## $ engine.size       <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 108, 10~
## $ bore              <dbl> 3.47, 3.47, 2.68, 3.19, 3.19, 3.19, 3.19, 3.19, 3.13~
## $ stroke            <dbl> 2.68, 2.68, 3.47, 3.40, 3.40, 3.40, 3.40, 3.40, 3.40~
## $ compression.ratio <dbl> 9.00, 9.00, 9.00, 10.00, 8.00, 8.50, 8.50, 8.50, 8.3~
## $ wheel.base        <dbl> 88.6, 88.6, 94.5, 99.8, 99.4, 99.8, 105.8, 105.8, 10~
## $ length            <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192~
## $ width             <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4~
## $ height            <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9~
## $ symboling         <fct> 3, 3, 1, 2, 2, 2, 1, 1, 1, 2, 0, 0, 0, 1, 0, 0, 0, 2~
#Splitting Data
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
set.seed(200)
split=sample.split(Y=numData$symboling,SplitRatio=0.7)
KNN.train_set=subset(x=numData,split==T) 
KNN.test_set=subset(x=numData,split==F)

#Feature Scaling
targetCol=length(numData) #target feature
KNN.train_set[-targetCol]=scale(x=KNN.train_set[-targetCol])
KNN.test_set[-targetCol]=scale(x=KNN.test_set[-targetCol])

Building the Classifier and Prediction

#Building the K-NN Classifier and predicting test data
library(class)
KNN.pred=knn(train=KNN.train_set[-targetCol],
           test=KNN.test_set[-targetCol],
           cl=KNN.train_set[,targetCol],
           k=3)
KNN.actual = KNN.test_set[,targetCol]

Performance Evaluation Metrics

#KNN Confusion matrix
KNN.cm=table(KNN.actual,KNN.pred)
KNN.cm
##           KNN.pred
## KNN.actual -2 -1  0  1  2  3
##         -2  0  1  0  0  0  0
##         -1  1  2  1  1  2  0
##         0   0  1 14  3  1  0
##         1   0  0  0 11  3  1
##         2   0  0  0  3  6  0
##         3   0  0  2  1  1  3
# Accuracy
KNN.accuracy = sum(diag(KNN.cm)/sum(KNN.cm))
KNN.accuracy
## [1] 0.6206897
#Precision
KNN.precision = diag(KNN.cm)/colSums(KNN.cm)
KNN.precision
##        -2        -1         0         1         2         3 
## 0.0000000 0.5000000 0.8235294 0.5789474 0.4615385 0.7500000
#Recall
KNN.recall = diag(KNN.cm)/rowSums(KNN.cm)
KNN.recall
##        -2        -1         0         1         2         3 
## 0.0000000 0.2857143 0.7368421 0.7333333 0.6666667 0.4285714
#F1-score
KNN.F1 = 2*(KNN.precision *KNN.recall)/(KNN.precision+KNN.recall)
KNN.F1
##        -2        -1         0         1         2         3 
##       NaN 0.3636364 0.7777778 0.6470588 0.5454545 0.5454545
#Mean F1-score
KNN.F1_Mean = mean(KNN.F1,na.rm = TRUE)
KNN.F1_Mean
## [1] 0.5758764

Decision Tree Classifier

A decision tree is a non-parametric, supervised learning algorithm which uses a tree structure to predict a result from a series of feature based splits.

#Splitting the data
set.seed(202)
split = sample.split(Y=data$symboling, SplitRatio=0.75)
DT.train_set = subset(data, split==T)
DT.test_set=subset(data, split==F)

Building the Classifier and Prediction

#DT Classifier
library(rpart)
DT = rpart(formula=symboling~.,
            data=DT.train_set, method ='class')
DT.pred = predict(object = DT,
                  newdata= DT.test_set, type='class')
#Plotting DT
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.3
rpart.plot(DT)

Performance Evaluation Metrics

# DT confusion matrix
DT.cm = table(DT.test_set$symboling,DT.pred)
DT.cm
##     DT.pred
##      -2 -1  0  1  2  3
##   -2  0  1  0  0  0  0
##   -1  0  3  0  3  0  0
##   0   0  0 14  1  1  0
##   1   0  1  3  7  2  0
##   2   0  0  0  0  8  0
##   3   0  0  0  0  3  3
# Accuracy
DT.accuracy = sum(diag(DT.cm))/sum(DT.cm)
DT.accuracy
## [1] 0.7
# Precision
DT.precision = diag(DT.cm)/colSums(DT.cm)
DT.precision
##        -2        -1         0         1         2         3 
##       NaN 0.6000000 0.8235294 0.6363636 0.5714286 1.0000000
# Recall 
DT.recall = diag(DT.cm)/rowSums(DT.cm)
DT.recall
##        -2        -1         0         1         2         3 
## 0.0000000 0.5000000 0.8750000 0.5384615 1.0000000 0.5000000
# F1-score
DT.F1 = 2*(DT.precision *DT.recall)/(DT.precision+DT.recall)
DT.F1
##        -2        -1         0         1         2         3 
##       NaN 0.5454545 0.8484848 0.5833333 0.7272727 0.6666667
#Mean F1-score
DT.F1_Mean = mean(DT.F1,na.rm = TRUE)
DT.F1_Mean
## [1] 0.6742424

Comparing the models based on calculated metrics

pem1 = data.frame(Metrics = c(KNN.accuracy,DT.accuracy,KNN.F1_Mean,DT.F1_Mean), 
                  MetricName=c("Accuracy","Accuracy","F1-score","F1-score"),
                  Classifier=c("KNN","DT","KNN","DT"))
ggplot(pem1,aes(x=Classifier, y=Metrics, fill=MetricName)) + geom_bar(stat = "identity") +
  labs(title="Metric Comparison     20MID0006")+facet_wrap(~MetricName)

As we can see the accuracy and average F1-score is higher for the Decision Tree classifier when compared to K-NN Classifier.

Conclusion

We modelled the given dataset using two classification algorithms namely K-Nearest Neighbours and Decision Tree. Using them to classify the data, we found that they have an accuracy of 62% and 70% respectively in predicting the insurance risk of an automobile. With this we can conclude that the Decision Tree classifier is better model for the given dataset.