Wine Data Analysis Part III

Introduction

This is the last part of wine data analysis. In the part I and part II, I have analyzed the difference between red wines and white wines and explored the relationship between each chemical properties and wine quality. In this part, I will build models to predict wine quality based on wine chemical properties. I choose ordinal logistic regression, decision tree, and random forest as my methods. In all of my models, I include all chemical properties as the independent variables.

Data Preparation

I put wines into three groups based on wine quality. Wines with quality score less than 6 are grouped as 1, wines with quality score 6 are grouped as 2, and wines with quality score better than 6 are grouped as 3. Also, I split data into train data and test data to check the performance of my models.

red <- read.csv("wineQualityReds.csv")
white <- read.csv("wineQualityWhites.csv")
red$X <- NULL
white$X <- NULL
red$category[red$quality<6] <- 1
red$category[red$quality==6] <- 2
red$category[red$quality>6] <- 3
white$category[white$quality<6] <- 1
white$category[white$quality==6] <- 2
white$category[white$quality>6] <- 3
red$category <- as.factor(red$category)
white$category <- as.factor(white$category)
set.seed(3000)
spl = sample.split(red$category, SplitRatio = 0.7)
redtrain = subset(red, spl==TRUE)
redtest = subset(red, spl==FALSE)
set.seed(3000)
spl = sample.split(white$category, SplitRatio = 0.7)
whitetrain = subset(white, spl==TRUE)
whitetest = subset(white, spl==FALSE)

Ordinal Logistic Regression

redordinal <- polr(category ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol, 
                  data=redtrain)
redordinalpred <- predict(redordinal, redtest, type="class")
table(redordinalpred, redtest$category)

##               
## redordinalpred   1   2   3
##              1 164  68   2
##              2  58 112  40
##              3   1  11  23

The accuracy rate of ordinal logistic regression model for red wines is (164+112+23)/479=0.6242171.

whiteordinal <- polr(category ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol, 
                   data=whitetrain)
whiteordinalpred <- predict(whiteordinal, whitetest, type="class")
table(whiteordinalpred, whitetest$category)

##                 
## whiteordinalpred   1   2   3
##                1 260 124  23
##                2 222 431 185
##                3  10 104 110

The accuracy rate of ordinal logistic regression model for white wines is (260+431+110)/1469=0.5452689.

Decision Tree Model

redtree = rpart(category ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol, 
                data=redtrain, method="class")
redtreepred <- predict(redtree, redtest, type="class")
table(redtreepred, redtest$category)

##            
## redtreepred   1   2   3
##           1 153  61   4
##           2  66 109  30
##           3   4  21  31

The accuracy rate of decision tree model for red wines is (153+109+31)/479=0.611691.

whitetree = rpart(category ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol, 
                data=whitetrain, method="class")
whitetreepred <- predict(whitetree, whitetest, type="class")
table(whitetreepred, whitetest$category)

##              
## whitetreepred   1   2   3
##             1 264 121  19
##             2 217 444 186
##             3  11  94 113

The accuracy rate of decision tree model for white wines is (264+444+113)/1469=0.5588836.

Random Forest Model

redrf = randomForest(category ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol, 
                     data=redtrain)
redrfpred <- predict(redrf, redtest, type="class")
table(redrfpred, redtest$category)

##          
## redrfpred   1   2   3
##         1 172  55   0
##         2  48 127  27
##         3   3   9  38

The accuracy rate of random forest model for red wines is (172+127+38)/479=0.7035491.

whiterf = randomForest(category ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol, 
                     data=whitetrain)
whiterfpred <- predict(whiterf, whitetest, type="class")
table(whiterfpred, whitetest$category)

##            
## whiterfpred   1   2   3
##           1 337  86   7
##           2 145 506 116
##           3  10  67 195

The accuracy rate of random forest model for white wines is (337+506+195)/1469=0.7066031.

Summary

To make a long story short, random forest model is the best model to predict wine quality according to wine chemical properties, as the accuracy rates of both red wines and white wines exceed 0.7. However, in my study, I simplified the problem by sorting wine scores into three groups. To make more accurate prediction, such as predicting the exact score, we will need better models and the accuracy rates may decrease.

Thanks for reading.