To evaluate and compare 2 different classification models on Wine prediction dataset from the UCI dataset library.
|
Data Set Characteristics: |
Multivariate |
Number of Instances: |
178 |
Area: |
Physical |
|
Attribute Characteristics: |
Integer, Real |
Number of Attributes: |
13 |
Date Donated |
1991-07-01 |
|
Associated Tasks: |
Classification |
Missing Values? |
No |
Number of Web Hits: |
2044123 |
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set.
The attributes are:
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline
In a classification context, this is a well posed problem with “well behaved” class structures. A good data set for first testing of a new classifier, but not very challenging.
setwd("E://sujith//vit files//3-1//DSP")
dataset=read.csv("wine.csv")
head(dataset)
## Output Alcohol Malic.acid Ash Alcalinity.of.ash Magnesium Total.phenols
## 1 1 14.23 1.71 2.43 15.6 127 2.80
## 2 1 13.20 1.78 2.14 11.2 100 2.65
## 3 1 13.16 2.36 2.67 18.6 101 2.80
## 4 1 14.37 1.95 2.50 16.8 113 3.85
## 5 1 13.24 2.59 2.87 21.0 118 2.80
## 6 1 14.20 1.76 2.45 15.2 112 3.27
## Flavanoids Nonflavanoid.phenols Proanthocyanins Color.intensity Hue
## 1 3.06 0.28 2.29 5.64 1.04
## 2 2.76 0.26 1.28 4.38 1.05
## 3 3.24 0.30 2.81 5.68 1.03
## 4 3.49 0.24 2.18 7.80 0.86
## 5 2.69 0.39 1.82 4.32 1.04
## 6 3.39 0.34 1.97 6.75 1.05
## OD280.OD315.of.diluted.wines Proline
## 1 3.92 1065
## 2 3.40 1050
## 3 3.17 1185
## 4 3.45 1480
## 5 2.93 735
## 6 2.85 1450
str(dataset)
## 'data.frame': 178 obs. of 14 variables:
## $ Output : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
## $ Malic.acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ Alcalinity.of.ash : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
## $ Total.phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ Nonflavanoid.phenols : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ Color.intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ OD280.OD315.of.diluted.wines: num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
The following graph explains how the dataset is correlated to one other:
pairs(dataset)
This is a histogram that gives the occurrences of the “output”:
ggplot(data=dataset,aes(x=Output,fill=as.factor(Output)))+geom_histogram()+ labs(title = "B.S.SUJITH 20MID0180")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a density plot that tells us again about the Output Factor visually:
ggplot(data=dataset,aes(x=Output,fill=as.factor(Output)))+geom_density(alpha=0.4)+ labs(title = "B.S.SUJITH 20MID0180")
This is a box-plot that generates a graph between Output and Alcohol factors in the data set:
ggplot(data=dataset,aes(x=as.factor(Output),y=Alcohol))+geom_boxplot()+labs(title = "B.S.SUJITH 20MID0180")
This is a line bar that generates a graph between Alcohol and Proline
from our dataset:
ggplot(data=dataset,aes(x=Alcohol,y=Proline))+geom_line()+ labs(title = "B.S.SUJITH 20MID0180")
Now we make a scatterplot between Proline , Alcohol and Output factors
of our dataset :
ggplot(data = dataset,aes(x=Proline,y=Alcohol,color=Output))+geom_point(alpha=0.4,size=3)+ labs(title = "B.S.SUJITH 20MID0180")
This is a Histogram that consist various columns of our dataset :
ggplot(dataset, aes(x = Alcohol)) + geom_histogram(fill = "cornflowerblue", color = "white") + facet_wrap(~Output, ncol = 1)+ labs(title = "B.S.SUJITH 20MID0180")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We will first split the data :
library(caTools)
set.seed(123)
split=sample.split(Y=dataset$Output,SplitRatio = 2/3)
train_set=subset(x=dataset,split==TRUE)
test_set=subset(x=dataset,split==FALSE)
We Create a model now :
library(rpart)
fit=rpart(formula=Output~.,data=train_set,method="class")
Now Predict the data:
predict_unseen=predict(object=fit,newdata=test_set,type="class")
Let’s Create confusion matrix and check accuracy :
cm=table(test_set$Output,predict_unseen)
cm
## predict_unseen
## 1 2 3
## 1 17 3 0
## 2 0 24 0
## 3 0 3 13
sum(diag(cm))/sum(cm)
## [1] 0.9
We shall first train the classifier using naiveBayes() method in ‘e1071’ package
library(e1071)
## Warning: package 'e1071' was built under R version 4.2.2
classifier = naiveBayes(Output~., data =
train_set)
summary(classifier)
## Length Class Mode
## apriori 3 table numeric
## tables 13 -none- list
## levels 3 -none- character
## isnumeric 13 -none- logical
## call 4 -none- call
To test the accuracy of this classifier, let’s create a confusion matrix containing predicted values from the classifier and the actual values from the data set itself.
cm = table(predict(classifier,dataset[,-1]),dataset[,1],dnn=list("predicted","actual"))
cm
## actual
## predicted 1 2 3
## 1 58 0 0
## 2 1 69 0
## 3 0 2 48
The accuracy of this model is as follows :
accuracy = sum(diag(cm)) / sum(cm)
accuracy
## [1] 0.9831461
The accuracy of Naive-Bayes is 0.96621 and the accuracy of Decision tree algorithm is 0.90
Decision tree algorithm has better prediction analysis than Naive-Bayes