To develop and create a suitable machine learning model for the assigned dataset and compare the results of at least 2 machine learning algorithms.
Dataset: https://archive.ics.uci.edu/ml/datasets/Automobile
The dataset describes used automobiles in 3 ways:
The technical specification of the automobile
The loss per vehicle per year given as “normalized-losses”
The insurance risk rating of the automobile given as “symboling”
“symboling”, corresponds to a car’s insurance risk level. Cars are initially assigned a risk factor symbol that corresponds to their price. If an automobile is more dangerous, this symbol is adjusted by increasing it. A value of +3 indicates that the vehicle is risky, while -3 indicates that it is likely safe to insure.
The second attribute, “normalized-losses,” is the relative average loss payment per insured vehicle year. This figure is normalised for all vehicles within a given size category (two-door, small, station wagons, sports/specialty, etc…) and represents the average loss per vehicle per year.
All the other attributes are self-explanatory and define the price and technical specifications of the vehicles like the size, weight, horsepower, engine-type etc. From the dataset, we can see that many of the attributes can be used for prediction.
In this case, we will try and predict the symboling level of an automobile.
Attribute: Attribute Range:
------------------ -----------------------------------------------
1. symboling: -3, -2, -1, 0, 1, 2, 3.
2. normalized-losses: continuous from 65 to 256.
3. make: alfa-romero, audi, bmw, chevrolet, dodge, honda,
isuzu, jaguar, mazda, mercedes-benz, mercury,
mitsubishi, nissan, peugot, plymouth, porsche,
renault, saab, subaru, toyota, volkswagen, volvo
4. fuel-type: diesel, gas.
5. aspiration: std, turbo.
6. num-of-doors: four, two.
7. body-style: hardtop, wagon, sedan, hatchback, convertible.
8. drive-wheels: 4wd, fwd, rwd.
9. engine-location: front, rear.
10. wheel-base: continuous from 86.6 120.9.
11. length: continuous from 141.1 to 208.1.
12. width: continuous from 60.3 to 72.3.
13. height: continuous from 47.8 to 59.8.
14. curb-weight: continuous from 1488 to 4066.
15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
16. num-of-cylinders: eight, five, four, six, three, twelve, two.
17. engine-size: continuous from 61 to 326.
18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
19. bore: continuous from 2.54 to 3.94.
20. stroke: continuous from 2.07 to 4.17.
21. compression-ratio: continuous from 7 to 23.
22. horsepower: continuous from 48 to 288.
23. peak-rpm: continuous from 4150 to 6600.
24. city-mpg: continuous from 13 to 49.
25. highway-mpg: continuous from 16 to 54.
26. price: continuous from 5118 to 45400.
library(dplyr)
data = read.csv('imports-85.data', sep=',',
header=F,
col.names=c('symboling', 'normalized.losses','make',
'fuel.type','aspiration','num.of.doors',
'body.style','drive.wheels','engine.location',
'wheel.base','length','width','height','curb.weight',
'engine.type','num.of.cylinders','engine.size',
'fuel.system','bore','stroke','compression.ratio',
'horsepower','peak.rpm','city.mpg','highway.mpg',
'price'))
glimpse(data)
## Rows: 205
## Columns: 26
## $ symboling <int> 3, 3, 1, 2, 2, 2, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0~
## $ normalized.losses <chr> "?", "?", "?", "164", "164", "?", "158", "?", "158",~
## $ make <chr> "alfa-romero", "alfa-romero", "alfa-romero", "audi",~
## $ fuel.type <chr> "gas", "gas", "gas", "gas", "gas", "gas", "gas", "ga~
## $ aspiration <chr> "std", "std", "std", "std", "std", "std", "std", "st~
## $ num.of.doors <chr> "two", "two", "two", "four", "four", "two", "four", ~
## $ body.style <chr> "convertible", "convertible", "hatchback", "sedan", ~
## $ drive.wheels <chr> "rwd", "rwd", "rwd", "fwd", "4wd", "fwd", "fwd", "fw~
## $ engine.location <chr> "front", "front", "front", "front", "front", "front"~
## $ wheel.base <dbl> 88.6, 88.6, 94.5, 99.8, 99.4, 99.8, 105.8, 105.8, 10~
## $ length <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192~
## $ width <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4~
## $ height <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9~
## $ curb.weight <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086~
## $ engine.type <chr> "dohc", "dohc", "ohcv", "ohc", "ohc", "ohc", "ohc", ~
## $ num.of.cylinders <chr> "four", "four", "six", "four", "five", "five", "five~
## $ engine.size <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 131, 10~
## $ fuel.system <chr> "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpf~
## $ bore <chr> "3.47", "3.47", "2.68", "3.19", "3.19", "3.19", "3.1~
## $ stroke <chr> "2.68", "2.68", "3.47", "3.40", "3.40", "3.40", "3.4~
## $ compression.ratio <dbl> 9.00, 9.00, 9.00, 10.00, 8.00, 8.50, 8.50, 8.50, 8.3~
## $ horsepower <chr> "111", "111", "154", "102", "115", "110", "110", "11~
## $ peak.rpm <chr> "5000", "5000", "5000", "5500", "5500", "5500", "550~
## $ city.mpg <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 16, 23, 23, 21, ~
## $ highway.mpg <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 22, 29, 29, 28, ~
## $ price <chr> "13495", "16500", "16500", "13950", "17450", "15250"~
We can see that many attributes are not of the correct datatype. Also
the data contains ? instead of NA values so these need to
be replaced with NA
data[data == '?'] <- NA
NAsByFeature<-apply(data,2,function(x){length(which(is.na(x)))})
NAsByFeature
## symboling normalized.losses make fuel.type
## 0 41 0 0
## aspiration num.of.doors body.style drive.wheels
## 0 2 0 0
## engine.location wheel.base length width
## 0 0 0 0
## height curb.weight engine.type num.of.cylinders
## 0 0 0 0
## engine.size fuel.system bore stroke
## 0 0 4 4
## compression.ratio horsepower peak.rpm city.mpg
## 0 2 2 0
## highway.mpg price
## 0 4
We see that high number of NA Values are present in normalized-losses attribute. Hence we will not consider this feature for analysis.
data = data[-2]
Omit any remaining rows with NA Values
data = data %>% na.omit()
NAsByFeature<-apply(data,2,function(x){length(which(is.na(x)))})
NAsByFeature
## symboling make fuel.type aspiration
## 0 0 0 0
## num.of.doors body.style drive.wheels engine.location
## 0 0 0 0
## wheel.base length width height
## 0 0 0 0
## curb.weight engine.type num.of.cylinders engine.size
## 0 0 0 0
## fuel.system bore stroke compression.ratio
## 0 0 0 0
## horsepower peak.rpm city.mpg highway.mpg
## 0 0 0 0
## price
## 0
Correcting datatypes for features.
factorCols = c('symboling','make',
'fuel.type','aspiration','num.of.doors',
'body.style','drive.wheels','engine.location',
'engine.type','num.of.cylinders',
'fuel.system')
intCols =c('horsepower','peak.rpm','city.mpg','highway.mpg',
'price','curb.weight','engine.size')
numCols = c('bore','stroke','compression.ratio','wheel.base','length','width','height')
data = data %>% mutate_at(factorCols, factor) %>%
mutate_at(intCols, as.integer) %>% mutate_at(numCols, as.numeric)
Finally the data is as follows.
glimpse(data)
## Rows: 193
## Columns: 25
## $ symboling <fct> 3, 3, 1, 2, 2, 2, 1, 1, 1, 2, 0, 0, 0, 1, 0, 0, 0, 2~
## $ make <fct> alfa-romero, alfa-romero, alfa-romero, audi, audi, a~
## $ fuel.type <fct> gas, gas, gas, gas, gas, gas, gas, gas, gas, gas, ga~
## $ aspiration <fct> std, std, std, std, std, std, std, std, turbo, std, ~
## $ num.of.doors <fct> two, two, two, four, four, two, four, four, four, tw~
## $ body.style <fct> convertible, convertible, hatchback, sedan, sedan, s~
## $ drive.wheels <fct> rwd, rwd, rwd, fwd, 4wd, fwd, fwd, fwd, fwd, rwd, rw~
## $ engine.location <fct> front, front, front, front, front, front, front, fro~
## $ wheel.base <dbl> 88.6, 88.6, 94.5, 99.8, 99.4, 99.8, 105.8, 105.8, 10~
## $ length <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192~
## $ width <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4~
## $ height <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9~
## $ curb.weight <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086~
## $ engine.type <fct> dohc, dohc, ohcv, ohc, ohc, ohc, ohc, ohc, ohc, ohc,~
## $ num.of.cylinders <fct> four, four, six, four, five, five, five, five, five,~
## $ engine.size <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 108, 10~
## $ fuel.system <fct> mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi~
## $ bore <dbl> 3.47, 3.47, 2.68, 3.19, 3.19, 3.19, 3.19, 3.19, 3.13~
## $ stroke <dbl> 2.68, 2.68, 3.47, 3.40, 3.40, 3.40, 3.40, 3.40, 3.40~
## $ compression.ratio <dbl> 9.00, 9.00, 9.00, 10.00, 8.00, 8.50, 8.50, 8.50, 8.3~
## $ horsepower <int> 111, 111, 154, 102, 115, 110, 110, 110, 140, 101, 10~
## $ peak.rpm <int> 5000, 5000, 5000, 5500, 5500, 5500, 5500, 5500, 5500~
## $ city.mpg <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 23, 23, 21, 21, ~
## $ highway.mpg <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 29, 29, 28, 28, ~
## $ price <int> 13495, 16500, 16500, 13950, 17450, 15250, 17710, 189~
summary(data)
## symboling make fuel.type aspiration num.of.doors
## -2: 3 toyota :32 diesel: 19 std :158 four:112
## -1:22 nissan :18 gas :174 turbo: 35 two : 81
## 0 :63 honda :13
## 1 :51 mitsubishi:13
## 2 :31 mazda :12
## 3 :23 subaru :12
## (Other) :93
## body.style drive.wheels engine.location wheel.base length
## convertible: 6 4wd: 8 front:190 Min. : 86.60 Min. :141.1
## hardtop : 8 fwd:114 rear : 3 1st Qu.: 94.50 1st Qu.:166.3
## hatchback :63 rwd: 71 Median : 97.00 Median :173.2
## sedan :92 Mean : 98.92 Mean :174.3
## wagon :24 3rd Qu.:102.40 3rd Qu.:184.6
## Max. :120.90 Max. :208.1
##
## width height curb.weight engine.type num.of.cylinders
## Min. :60.30 Min. :47.80 Min. :1488 dohc: 12 eight : 4
## 1st Qu.:64.10 1st Qu.:52.00 1st Qu.:2145 l : 12 five : 10
## Median :65.40 Median :54.10 Median :2414 ohc :141 four :153
## Mean :65.89 Mean :53.87 Mean :2562 ohcf: 15 six : 24
## 3rd Qu.:66.90 3rd Qu.:55.70 3rd Qu.:2952 ohcv: 13 three : 1
## Max. :72.00 Max. :59.80 Max. :4066 twelve: 1
##
## engine.size fuel.system bore stroke compression.ratio
## Min. : 61.0 1bbl:11 Min. :2.540 Min. :2.070 Min. : 7.00
## 1st Qu.: 98.0 2bbl:64 1st Qu.:3.150 1st Qu.:3.110 1st Qu.: 8.50
## Median :120.0 idi :19 Median :3.310 Median :3.290 Median : 9.00
## Mean :128.1 mfi : 1 Mean :3.331 Mean :3.249 Mean :10.14
## 3rd Qu.:146.0 mpfi:88 3rd Qu.:3.590 3rd Qu.:3.410 3rd Qu.: 9.40
## Max. :326.0 spdi: 9 Max. :3.940 Max. :4.170 Max. :23.00
## spfi: 1
## horsepower peak.rpm city.mpg highway.mpg price
## Min. : 48.0 Min. :4150 Min. :13.00 Min. :16.00 Min. : 5118
## 1st Qu.: 70.0 1st Qu.:4800 1st Qu.:19.00 1st Qu.:25.00 1st Qu.: 7738
## Median : 95.0 Median :5100 Median :25.00 Median :30.00 Median :10245
## Mean :103.5 Mean :5100 Mean :25.33 Mean :30.79 Mean :13285
## 3rd Qu.:116.0 3rd Qu.:5500 3rd Qu.:30.00 3rd Qu.:34.00 3rd Qu.:16515
## Max. :262.0 Max. :6600 Max. :49.00 Max. :54.00 Max. :45400
##
Importing ggplot to visualize the data
library(ggplot2)
#Density Plot
filter_cyl <- data %>% filter(num.of.cylinders %in% c('four', 'five' ,'six'))
ggplot(data = filter_cyl,
aes(x=city.mpg, fill=num.of.cylinders)) + geom_density(alpha=0.4)+
labs(title="Density Plot 20MID0006")
#Histogram
ggplot(data = data,aes(x=horsepower, fill=fuel.type)) + geom_histogram() +
facet_wrap(~fuel.type)+labs(title="Histogram 20MID0006")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Box Plot
data = data %>% mutate(volume = length*width*height/1000)
ggplot(data = data,aes(x=body.style, y=volume)) + geom_boxplot()+
labs(title="Box Plot 20MID0006")
#Scatter Plot
ggplot(data = data,aes(x=curb.weight, y=city.mpg)) +
geom_point()+labs(title="Scatter Plot 20MID0006")
#Bar Plot
ggplot(data = data,aes(x=make, y=city.mpg)) + geom_bar(stat = "identity") +
coord_flip()+labs(title="Bar Plot 20MID0006")
#Bubble Plot
ggplot(data = data, aes(x=peak.rpm, y=horsepower, size=city.mpg,
color=symboling)) + geom_point(alpha=0.4)+
labs(title="Bubble Plot 20MID0006")
#Stacked Area Chart
ggplot(data = data, aes(x=curb.weight, y=city.mpg, fill=drive.wheels)) +
geom_area(alpha = 0.4,size=0.5, colour="black")+
labs(title="Stacked Area Chart 20MID0006")
#Correlation Plot
library(corrplot)
corrplot(cor(data[c(intCols,numCols)]), method="square")
The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point
We need to use only numeric data because we will use the euclidean distance metric to calculate the nearest neighbour
#Filtering for numeric and integer data
numData = data[c(intCols,numCols)]
numData = numData %>% mutate(symboling = data$symboling)
glimpse(numData)
## Rows: 193
## Columns: 15
## $ horsepower <int> 111, 111, 154, 102, 115, 110, 110, 110, 140, 101, 10~
## $ peak.rpm <int> 5000, 5000, 5000, 5500, 5500, 5500, 5500, 5500, 5500~
## $ city.mpg <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 23, 23, 21, 21, ~
## $ highway.mpg <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 29, 29, 28, 28, ~
## $ price <int> 13495, 16500, 16500, 13950, 17450, 15250, 17710, 189~
## $ curb.weight <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086~
## $ engine.size <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 108, 10~
## $ bore <dbl> 3.47, 3.47, 2.68, 3.19, 3.19, 3.19, 3.19, 3.19, 3.13~
## $ stroke <dbl> 2.68, 2.68, 3.47, 3.40, 3.40, 3.40, 3.40, 3.40, 3.40~
## $ compression.ratio <dbl> 9.00, 9.00, 9.00, 10.00, 8.00, 8.50, 8.50, 8.50, 8.3~
## $ wheel.base <dbl> 88.6, 88.6, 94.5, 99.8, 99.4, 99.8, 105.8, 105.8, 10~
## $ length <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192~
## $ width <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4~
## $ height <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9~
## $ symboling <fct> 3, 3, 1, 2, 2, 2, 1, 1, 1, 2, 0, 0, 0, 1, 0, 0, 0, 2~
#Splitting Data
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
set.seed(200)
split=sample.split(Y=numData$symboling,SplitRatio=0.7)
KNN.train_set=subset(x=numData,split==T)
KNN.test_set=subset(x=numData,split==F)
#Feature Scaling
targetCol=length(numData) #target feature
KNN.train_set[-targetCol]=scale(x=KNN.train_set[-targetCol])
KNN.test_set[-targetCol]=scale(x=KNN.test_set[-targetCol])
#Building the K-NN Classifier and predicting test data
library(class)
KNN.pred=knn(train=KNN.train_set[-targetCol],
test=KNN.test_set[-targetCol],
cl=KNN.train_set[,targetCol],
k=3)
KNN.actual = KNN.test_set[,targetCol]
#KNN Confusion matrix
KNN.cm=table(KNN.actual,KNN.pred)
KNN.cm
## KNN.pred
## KNN.actual -2 -1 0 1 2 3
## -2 0 1 0 0 0 0
## -1 1 2 1 1 2 0
## 0 0 1 14 3 1 0
## 1 0 0 0 11 3 1
## 2 0 0 0 3 6 0
## 3 0 0 2 1 1 3
# Accuracy
KNN.accuracy = sum(diag(KNN.cm)/sum(KNN.cm))
KNN.accuracy
## [1] 0.6206897
#Precision
KNN.precision = diag(KNN.cm)/colSums(KNN.cm)
KNN.precision
## -2 -1 0 1 2 3
## 0.0000000 0.5000000 0.8235294 0.5789474 0.4615385 0.7500000
#Recall
KNN.recall = diag(KNN.cm)/rowSums(KNN.cm)
KNN.recall
## -2 -1 0 1 2 3
## 0.0000000 0.2857143 0.7368421 0.7333333 0.6666667 0.4285714
#F1-score
KNN.F1 = 2*(KNN.precision *KNN.recall)/(KNN.precision+KNN.recall)
KNN.F1
## -2 -1 0 1 2 3
## NaN 0.3636364 0.7777778 0.6470588 0.5454545 0.5454545
#Mean F1-score
KNN.F1_Mean = mean(KNN.F1,na.rm = TRUE)
KNN.F1_Mean
## [1] 0.5758764
A decision tree is a non-parametric, supervised learning algorithm which uses a tree structure to predict a result from a series of feature based splits.
#Splitting the data
set.seed(202)
split = sample.split(Y=data$symboling, SplitRatio=0.75)
DT.train_set = subset(data, split==T)
DT.test_set=subset(data, split==F)
#DT Classifier
library(rpart)
DT = rpart(formula=symboling~.,
data=DT.train_set, method ='class')
DT.pred = predict(object = DT,
newdata= DT.test_set, type='class')
#Plotting DT
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.3
rpart.plot(DT)
# DT confusion matrix
DT.cm = table(DT.test_set$symboling,DT.pred)
DT.cm
## DT.pred
## -2 -1 0 1 2 3
## -2 0 1 0 0 0 0
## -1 0 3 0 3 0 0
## 0 0 0 14 1 1 0
## 1 0 1 3 7 2 0
## 2 0 0 0 0 8 0
## 3 0 0 0 0 3 3
# Accuracy
DT.accuracy = sum(diag(DT.cm))/sum(DT.cm)
DT.accuracy
## [1] 0.7
# Precision
DT.precision = diag(DT.cm)/colSums(DT.cm)
DT.precision
## -2 -1 0 1 2 3
## NaN 0.6000000 0.8235294 0.6363636 0.5714286 1.0000000
# Recall
DT.recall = diag(DT.cm)/rowSums(DT.cm)
DT.recall
## -2 -1 0 1 2 3
## 0.0000000 0.5000000 0.8750000 0.5384615 1.0000000 0.5000000
# F1-score
DT.F1 = 2*(DT.precision *DT.recall)/(DT.precision+DT.recall)
DT.F1
## -2 -1 0 1 2 3
## NaN 0.5454545 0.8484848 0.5833333 0.7272727 0.6666667
#Mean F1-score
DT.F1_Mean = mean(DT.F1,na.rm = TRUE)
DT.F1_Mean
## [1] 0.6742424
pem1 = data.frame(Metrics = c(KNN.accuracy,DT.accuracy,KNN.F1_Mean,DT.F1_Mean),
MetricName=c("Accuracy","Accuracy","F1-score","F1-score"),
Classifier=c("KNN","DT","KNN","DT"))
ggplot(pem1,aes(x=Classifier, y=Metrics, fill=MetricName)) + geom_bar(stat = "identity") +
labs(title="Metric Comparison 20MID0006")+facet_wrap(~MetricName)
As we can see the accuracy and average F1-score is higher for the Decision Tree classifier when compared to K-NN Classifier.
We modelled the given dataset using two classification algorithms namely K-Nearest Neighbours and Decision Tree. Using them to classify the data, we found that they have an accuracy of 62% and 70% respectively in predicting the insurance risk of an automobile. With this we can conclude that the Decision Tree classifier is better model for the given dataset.