The Wine-Quality data frame has 4898 rows and 12 columns.This is a non continuous dataset.
This data frame contains the following columns:
fixed.acidity :
most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile.acidity :
the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric.acid :
found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual.sugar :
the amount of sugar remaining after fermentation stops.
chlorides :
the amount of salt in the wine.
free.sulfur.dioxide :
the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion.
total.sulfur.dioxide :
amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
density :
the density of wine is close to that of water depending on the percent alcohol and sugar content
pH :
describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
sulphates :
a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
alcohol :
the percent alcohol content of the wine.
quality :
Dependent variable (based on sensory data, score between 0 and 10).0 signifies very bad quality and 10 signifies very good quality.
Importing the dataset
dataset = read.csv("winequality-white.csv")
str(dataset)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
names(dataset)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
Before I convert the dependent variable into categorical variables,I wanted to check a quick correlation between all the independent variables But our dependent variable is integer, so we need to convert it into numeric data type, Because correlation is only for numeric variables
dataset$quality = as.numeric(dataset$quality)
checking correlation using pearson as default method
cr=cor(dataset)
Plotting correlation
library(corrplot)
## corrplot 0.84 loaded
corrplot(cr,method= "number")
finding :
Quality is mostly correlated with amount of alcohol. No surprise there !!
Lets check the distribution of our dependent variable by plotting a histogram
table(dataset$quality)
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
hist(dataset$quality)
So the dependent variable is highly concentrated minimum value is 3 and highest value is 9 and most of the values are 5,6,7.
To make sure a good model we need to categories our dependent variable properly.
dataset$quality = ifelse(dataset$quality <=5, "0", ifelse(dataset$quality <=7,"1","2"))
Dependent Variable values less than 5 shows low quality wine,6 or 7 shows medium quality and 8 or 9 shows high quality wine.
checking for distribution now
table(dataset$quality)
##
## 0 1 2
## 1640 3078 180
Now our dependent variable is reasonably well distributed.Let’s categories it as factor as R still considers it as integers.
dataset$quality = factor(dataset$quality)
class(dataset$quality)
## [1] "factor"
checking for NA values in whole dataset
sum(is.na(dataset))
## [1] 0
Splitting the dataset into the Training set and Test set #install.packages(‘caTools’)
library(caTools)
set.seed(150)
split = sample.split(dataset$quality, SplitRatio = 0.8)
Create training and testing sets
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
Decision Tree Regression
Fitting Decision Tree Regression to the dataset. We don’t need feature scaling here, because decision tree doesn’t use euclidean distance method to do the classification
library(rpart)
classifier = rpart(formula = quality ~ .,
data = training_set)
Predicting the test dataset with Decision Tree Regression
y_pred = predict(classifier, newdata = test_set[-12],type='class')
cm = as.matrix(table(actual=test_set$quality,predicted=y_pred))
(cm)
## predicted
## actual 0 1 2
## 0 203 125 0
## 1 112 504 0
## 2 0 36 0
So decision tree gives an accuracy of 72.1%. We can not plot the classification using graph because we have 11 independent variables in our dataset. We need to apply dimensionality reduction tools like CPA to bring it down to 2 independent variables to plot it.
RANDOM FOREST
Fitting Random Forest Classification to the Training set
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(123)
classifier1 = randomForest(x = training_set[-12],
y = training_set$quality,
ntree = 500)
Predicting the Test set results
y_pred1 = predict(classifier1, newdata = test_set[-12])
Making the Confusion Matrix
cm1 = table(actual=test_set[, 12],predicted= y_pred1)
(cm1)
## predicted
## actual 0 1 2
## 0 227 101 0
## 1 60 556 0
## 2 0 19 17
So an accuracy of 81.6%
Kernel SVM
We need feature scaling for Kernel SVM
training_set1 = training_set
training_set1[-12] = scale(training_set[-12])
test_set1 = test_set
test_set1[-12] = scale(test_set[-12])
Fitting Kernel SVM to the Training set
library(e1071)
classifier2 = svm(formula = quality ~ .,
data = training_set1,
type = 'C-classification',
kernel = 'radial')
Predicting the Test set results
y_pred2 = predict(classifier2, newdata = test_set1[-12])
Making the Confusion Matrix
cm2 = table(actual=test_set1[, 12], predicted= y_pred2)
(cm2)
## predicted
## actual 0 1 2
## 0 198 130 0
## 1 77 539 0
## 2 1 35 0
so an accuracy of 75%
So the best model is proved to be Random forest model,but there were a lot of hyper parameter we could tune to improve our model.
IMPROVING THE RANDOM FOREST MODEL
Applying Grid Search to find the best parameters
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
classifier3 = train(form = quality ~ ., data = training_set, method = 'rf')
(classifier3$bestTune)
## mtry
## 1 2
so according to Grid search the random forest model is tuned or most optimized at mtry = 2
set.seed(123)
classifier4 = randomForest(x = training_set[-12],
y = training_set$quality,
ntree = 1000, mtry = 2)
Predicting the Test set results
y_pred4 = predict(classifier4, newdata = test_set[-12])
Making the Confusion Matrix
cm3 = table(actual=test_set[, 12],predicted= y_pred4)
accuracy = ((223+554+17)/980)*100
(cm3)
## predicted
## actual 0 1 2
## 0 223 105 0
## 1 62 554 0
## 2 0 19 17
Conclusion
1.Random forest gives the highest accuracy of 81.6%,where as SVM kernel and decision tree give the accuracy of 75% and 72.1% respectively.
2.The Quality of wine highly dependent on the amount of alcohol it has.
3.Even after tuning the Random Forest model , the accuracy more or less stays the same.
4.So a neural network might be best suited for this dataset to get higher accuracy.Which I have decided to do on some other dataset.