Wine-Quality

DATA DICTIONARY

The Wine-Quality data frame has 4898 rows and 12 columns.This is a non continuous dataset.

This data frame contains the following columns:

fixed.acidity :

most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

volatile.acidity :

the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

citric.acid :

found in small quantities, citric acid can add ‘freshness’ and flavor to wines

residual.sugar :

the amount of sugar remaining after fermentation stops.

chlorides :

the amount of salt in the wine.

free.sulfur.dioxide :

the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion.

total.sulfur.dioxide :

amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

density :

the density of wine is close to that of water depending on the percent alcohol and sugar content

pH :

describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

sulphates :

a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

alcohol :

the percent alcohol content of the wine.

quality :

Dependent variable (based on sensory data, score between 0 and 10).0 signifies very bad quality and 10 signifies very good quality.

Importing the dataset

dataset = read.csv("winequality-white.csv")
str(dataset)

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

names(dataset)

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Before I convert the dependent variable into categorical variables,I wanted to check a quick correlation between all the independent variables But our dependent variable is integer, so we need to convert it into numeric data type, Because correlation is only for numeric variables

dataset$quality = as.numeric(dataset$quality)

checking correlation using pearson as default method

cr=cor(dataset)

Plotting correlation

library(corrplot)

## corrplot 0.84 loaded

corrplot(cr,method= "number")

finding :

Quality is mostly correlated with amount of alcohol. No surprise there !!

Lets check the distribution of our dependent variable by plotting a histogram

table(dataset$quality)

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

hist(dataset$quality)

So the dependent variable is highly concentrated minimum value is 3 and highest value is 9 and most of the values are 5,6,7.

To make sure a good model we need to categories our dependent variable properly.

dataset$quality = ifelse(dataset$quality <=5, "0", ifelse(dataset$quality <=7,"1","2"))

Dependent Variable values less than 5 shows low quality wine,6 or 7 shows medium quality and 8 or 9 shows high quality wine.

checking for distribution now

table(dataset$quality)

## 
##    0    1    2 
## 1640 3078  180

Now our dependent variable is reasonably well distributed.Let’s categories it as factor as R still considers it as integers.

dataset$quality = factor(dataset$quality)
class(dataset$quality)

## [1] "factor"

checking for NA values in whole dataset

sum(is.na(dataset))

## [1] 0

Splitting the dataset into the Training set and Test set #install.packages(‘caTools’)

library(caTools)
set.seed(150)
split = sample.split(dataset$quality, SplitRatio = 0.8)

Create training and testing sets

training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Decision Tree Regression

Fitting Decision Tree Regression to the dataset. We don’t need feature scaling here, because decision tree doesn’t use euclidean distance method to do the classification

library(rpart)
classifier = rpart(formula = quality ~ .,
                   data = training_set)

Predicting the test dataset with Decision Tree Regression

y_pred = predict(classifier, newdata = test_set[-12],type='class')

cm = as.matrix(table(actual=test_set$quality,predicted=y_pred))
(cm)

##       predicted
## actual   0   1   2
##      0 203 125   0
##      1 112 504   0
##      2   0  36   0

So decision tree gives an accuracy of 72.1%. We can not plot the classification using graph because we have 11 independent variables in our dataset. We need to apply dimensionality reduction tools like CPA to bring it down to 2 independent variables to plot it.

RANDOM FOREST

Fitting Random Forest Classification to the Training set

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

set.seed(123)
classifier1 = randomForest(x = training_set[-12],
                          y = training_set$quality,
                          ntree = 500)

Predicting the Test set results

y_pred1 = predict(classifier1, newdata = test_set[-12])

Making the Confusion Matrix

cm1 = table(actual=test_set[, 12],predicted= y_pred1)
(cm1)

##       predicted
## actual   0   1   2
##      0 227 101   0
##      1  60 556   0
##      2   0  19  17

So an accuracy of 81.6%

Kernel SVM

We need feature scaling for Kernel SVM

training_set1 = training_set
training_set1[-12] = scale(training_set[-12])
test_set1 = test_set
test_set1[-12] = scale(test_set[-12])

Fitting Kernel SVM to the Training set

library(e1071)
classifier2 = svm(formula = quality ~ .,
                 data = training_set1,
                 type = 'C-classification',
                 kernel = 'radial')

Predicting the Test set results

y_pred2 = predict(classifier2, newdata = test_set1[-12])

Making the Confusion Matrix

cm2 = table(actual=test_set1[, 12], predicted= y_pred2)
(cm2)

##       predicted
## actual   0   1   2
##      0 198 130   0
##      1  77 539   0
##      2   1  35   0

so an accuracy of 75%

So the best model is proved to be Random forest model,but there were a lot of hyper parameter we could tune to improve our model.

IMPROVING THE RANDOM FOREST MODEL

Applying Grid Search to find the best parameters

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

classifier3 = train(form = quality ~ ., data = training_set, method = 'rf')

(classifier3$bestTune)

##   mtry
## 1    2

so according to Grid search the random forest model is tuned or most optimized at mtry = 2

set.seed(123)
classifier4 = randomForest(x = training_set[-12],
                           y = training_set$quality,
                           ntree = 1000, mtry = 2)

Predicting the Test set results

y_pred4 = predict(classifier4, newdata = test_set[-12])

Making the Confusion Matrix

cm3 = table(actual=test_set[, 12],predicted= y_pred4)
accuracy = ((223+554+17)/980)*100
(cm3)

##       predicted
## actual   0   1   2
##      0 223 105   0
##      1  62 554   0
##      2   0  19  17

Conclusion

1.Random forest gives the highest accuracy of 81.6%,where as SVM kernel and decision tree give the accuracy of 75% and 72.1% respectively.

2.The Quality of wine highly dependent on the amount of alcohol it has.

3.Even after tuning the Random Forest model , the accuracy more or less stays the same.

4.So a neural network might be best suited for this dataset to get higher accuracy.Which I have decided to do on some other dataset.

Wine-Quality

Nikesh Dubey

7/18/2021

DATA DICTIONARY