AIM:

To evaluate and compare 2 different classification models on Wine prediction dataset from the UCI dataset library.

LITERATURE ANALYSIS:

Data Set Characteristics:	Multivariate	Number of Instances:	178	Area:	Physical
Attribute Characteristics:	Integer, Real	Number of Attributes:	13	Date Donated	1991-07-01
Associated Tasks:	Classification	Missing Values?	No	Number of Web Hits:	2044123

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set.

The attributes are:
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline

In a classification context, this is a well posed problem with “well behaved” class structures. A good data set for first testing of a new classifier, but not very challenging.

Wine Dataset

setwd("E://sujith//vit files//3-1//DSP")
dataset=read.csv("wine.csv")
head(dataset)

##   Output Alcohol Malic.acid  Ash Alcalinity.of.ash Magnesium Total.phenols
## 1      1   14.23       1.71 2.43              15.6       127          2.80
## 2      1   13.20       1.78 2.14              11.2       100          2.65
## 3      1   13.16       2.36 2.67              18.6       101          2.80
## 4      1   14.37       1.95 2.50              16.8       113          3.85
## 5      1   13.24       2.59 2.87              21.0       118          2.80
## 6      1   14.20       1.76 2.45              15.2       112          3.27
##   Flavanoids Nonflavanoid.phenols Proanthocyanins Color.intensity  Hue
## 1       3.06                 0.28            2.29            5.64 1.04
## 2       2.76                 0.26            1.28            4.38 1.05
## 3       3.24                 0.30            2.81            5.68 1.03
## 4       3.49                 0.24            2.18            7.80 0.86
## 5       2.69                 0.39            1.82            4.32 1.04
## 6       3.39                 0.34            1.97            6.75 1.05
##   OD280.OD315.of.diluted.wines Proline
## 1                         3.92    1065
## 2                         3.40    1050
## 3                         3.17    1185
## 4                         3.45    1480
## 5                         2.93     735
## 6                         2.85    1450

str(dataset)

## 'data.frame':    178 obs. of  14 variables:
##  $ Output                      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol                     : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic.acid                  : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash                         : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity.of.ash           : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium                   : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Total.phenols               : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids                  : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoid.phenols        : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins             : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color.intensity             : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue                         : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD280.OD315.of.diluted.wines: num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline                     : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

VISUALIZING THE DATA:

The following graph explains how the dataset is correlated to one other:

pairs(dataset)

ANALYSIS OF DATA USING VISUALIZATION

This is a histogram that gives the occurrences of the “output”:

ggplot(data=dataset,aes(x=Output,fill=as.factor(Output)))+geom_histogram()+ labs(title = "B.S.SUJITH 20MID0180")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a density plot that tells us again about the Output Factor visually:

ggplot(data=dataset,aes(x=Output,fill=as.factor(Output)))+geom_density(alpha=0.4)+ labs(title = "B.S.SUJITH 20MID0180")

This is a box-plot that generates a graph between Output and Alcohol factors in the data set:

ggplot(data=dataset,aes(x=as.factor(Output),y=Alcohol))+geom_boxplot()+labs(title = "B.S.SUJITH 20MID0180")

This is a line bar that generates a graph between Alcohol and Proline from our dataset:

ggplot(data=dataset,aes(x=Alcohol,y=Proline))+geom_line()+ labs(title = "B.S.SUJITH 20MID0180")

Now we make a scatterplot between Proline , Alcohol and Output factors of our dataset :

ggplot(data = dataset,aes(x=Proline,y=Alcohol,color=Output))+geom_point(alpha=0.4,size=3)+ labs(title = "B.S.SUJITH 20MID0180")

This is a Histogram that consist various columns of our dataset :

ggplot(dataset, aes(x = Alcohol)) + geom_histogram(fill = "cornflowerblue", color = "white") + facet_wrap(~Output, ncol = 1)+ labs(title = "B.S.SUJITH 20MID0180")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ALGORITHM -1 : DECISION TREE ANALYSIS

We will first split the data :

library(caTools)
set.seed(123)
split=sample.split(Y=dataset$Output,SplitRatio = 2/3)
train_set=subset(x=dataset,split==TRUE)
test_set=subset(x=dataset,split==FALSE)

We Create a model now :

library(rpart)
fit=rpart(formula=Output~.,data=train_set,method="class")

Now Predict the data:

predict_unseen=predict(object=fit,newdata=test_set,type="class")

Let’s Create confusion matrix and check accuracy :

cm=table(test_set$Output,predict_unseen)
cm

##    predict_unseen
##      1  2  3
##   1 17  3  0
##   2  0 24  0
##   3  0  3 13

sum(diag(cm))/sum(cm)

## [1] 0.9

ALGORITHM -2 : NAIVE BAYES ALGORITHM

We shall first train the classifier using naiveBayes() method in ‘e1071’ package

library(e1071)

## Warning: package 'e1071' was built under R version 4.2.2

classifier = naiveBayes(Output~., data =
                          train_set)
summary(classifier)

##           Length Class  Mode     
## apriori    3     table  numeric  
## tables    13     -none- list     
## levels     3     -none- character
## isnumeric 13     -none- logical  
## call       4     -none- call

To test the accuracy of this classifier, let’s create a confusion matrix containing predicted values from the classifier and the actual values from the data set itself.

cm = table(predict(classifier,dataset[,-1]),dataset[,1],dnn=list("predicted","actual"))
cm

##          actual
## predicted  1  2  3
##         1 58  0  0
##         2  1 69  0
##         3  0  2 48

The accuracy of this model is as follows :

accuracy = sum(diag(cm)) / sum(cm)
accuracy

## [1] 0.9831461

COMPARATIVE STATEMENT:

The accuracy of Naive-Bayes is 0.96621 and the accuracy of Decision tree algorithm is 0.90

RESULT:

Decision tree algorithm has better prediction analysis than Naive-Bayes

Data Science Programming - Wine Dataset

Sujith Bysani Santhosh-20MID0180

2022-11-11