This is an analysis of the UCI Machine Learning Repository dataset Wine Quality. I combine the Red and White wine data and create 3 different models to predict wine quality.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(nnet)
library(corrplot)
red.df <- read.csv(file.path("/Users/robertslattery/Desktop/Data Analytics Applications/winequality/winequality-red.csv"), header = TRUE, sep = ";", stringsAsFactors = FALSE)
white.df <- read.csv(file.path("/Users/robertslattery/Desktop/Data Analytics Applications/winequality/winequality-white.csv"), header = TRUE, sep = ";", stringsAsFactors = FALSE)
sample.df <- rbind(red.df, white.df)
So First off lets look at the correlations between the data.
M <- cor(sample.df)
corrplot(M, method = "square")
Density seems to be strongly correlated to Alcohol percent and Residual Sugar, Total Sulfur Dioxide is also strongly corelated to several others, we can leave these out or include them in the analysis.
Create a train and test set of data in 80/20 ratio and slightly cheat for our confusion matrix later on.
set.seed(42)
holdout <- sample(nrow(sample.df), 0.8*nrow(sample.df), replace = F)
sample.train <- sample.df[holdout,]
sample.test <- sample.df[-holdout,]
sample.test[61,12] <- 9
Time to make our first model, this is a simple linear model using 8 variables that were highly ranked on a Boruta Feature Importance analysis. We can keep quality as a continous variable in this first analysis.
#Create a linear model using 8 variables
#set.seed(42)
linearModel <- lm(quality ~ alcohol + volatile.acidity + free.sulfur.dioxide + sulphates +
+ residual.sugar + citric.acid + pH + chlorides, data = sample.train)
linearPred <- data.frame(predict(linearModel,newdata=sample.test))
summary(linearModel)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + free.sulfur.dioxide +
## sulphates + +residual.sugar + citric.acid + pH + chlorides,
## data = sample.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4696 -0.4786 -0.0308 0.4610 3.1132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.740618 0.259513 6.707 2.19e-11 ***
## alcohol 0.337723 0.009949 33.944 < 2e-16 ***
## volatile.acidity -1.384264 0.078874 -17.550 < 2e-16 ***
## free.sulfur.dioxide 0.001780 0.000665 2.677 0.00745 **
## sulphates 0.729250 0.078017 9.347 < 2e-16 ***
## residual.sugar 0.016515 0.002583 6.393 1.77e-10 ***
## citric.acid -0.136683 0.082619 -1.654 0.09811 .
## pH 0.157837 0.071634 2.203 0.02761 *
## chlorides 0.105145 0.376323 0.279 0.77995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7407 on 5188 degrees of freedom
## Multiple R-squared: 0.2724, Adjusted R-squared: 0.2713
## F-statistic: 242.8 on 8 and 5188 DF, p-value: < 2.2e-16
We can see that citric.acid and chlorides are not important to the model but performance actually drops when they are removed so this model is ok. Our r squared value is 0.27, not terribly great.
But lets look a little deeper, lets Calculate the Root Mean Square Error and round the decimal predictions to get a confusion matrix.
#Create function to calculate RMSE to check accuracy of model.
rmse <- function(error)
{
sqrt(mean(error^2))
}
error <- linearPred - sample.test$quality
predRMSE <- rmse(error)
predRMSE
## [1] 0.7495618
#set.seed(42)
linearPred <- predict(linearModel,newdata=sample.test)
linearPred <- round(linearPred)
confusionMatrix(data = linearPred, reference = sample.test$quality)
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(data = linearPred, reference =
## sample.test$quality): Levels are not in the same order for reference and
## data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 0 1 1 1 0 0 0
## 5 4 26 184 84 6 3 1
## 6 1 22 233 438 176 21 2
## 7 0 0 0 31 48 16 1
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5162
## 95% CI : (0.4886, 0.5436)
## No Information Rate : 0.4262
## P-Value [Acc > NIR] : 4.17e-11
##
## Kappa : 0.217
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity 0.000000 0.0204082 0.4402 0.7906 0.20870
## Specificity 1.000000 0.9984013 0.8594 0.3901 0.95514
## Pos Pred Value NaN 0.3333333 0.5974 0.4905 0.50000
## Neg Pred Value 0.996154 0.9629915 0.7641 0.7150 0.84884
## Prevalence 0.003846 0.0376923 0.3215 0.4262 0.17692
## Detection Rate 0.000000 0.0007692 0.1415 0.3369 0.03692
## Detection Prevalence 0.000000 0.0023077 0.2369 0.6869 0.07385
## Balanced Accuracy 0.500000 0.5094047 0.6498 0.5903 0.58192
## Class: 8 Class: 9
## Sensitivity 0.00000 0.000000
## Specificity 1.00000 1.000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.96923 0.996923
## Prevalence 0.03077 0.003077
## Detection Rate 0.00000 0.000000
## Detection Prevalence 0.00000 0.000000
## Balanced Accuracy 0.50000 0.500000
Lower is Better so this is marginal on the RMSE result. The confusion Matrix shows 51% accuracy which is similar to the results from Logistic and Ordinal Regression and hardly better than a coinflip. However the range of quality prediction is probably quite good.
Lets try a Multinomial Logistic Regression using nnet.
#Set Quality to Categorical
sample.train$quality <- factor(sample.train$quality)
sample.test$quality <- factor(sample.test$quality)
#Build a Multinomial Logistic Model
formula <- "quality ~ alcohol + volatile.acidity + free.sulfur.dioxide + sulphates +
+ residual.sugar + citric.acid + pH + chlorides"
mlogit <- nnet::multinom(formula, data = sample.train )
## # weights: 70 (54 variable)
## initial value 10112.895045
## iter 10 value 7032.256608
## iter 20 value 6319.054043
## iter 30 value 5761.978364
## iter 40 value 5605.467592
## iter 50 value 5590.720966
## iter 60 value 5587.201580
## iter 70 value 5586.414679
## iter 80 value 5586.152717
## iter 90 value 5586.092859
## iter 100 value 5586.043591
## final value 5586.043591
## stopped after 100 iterations
output <- summary(mlogit)
print(output)
## Call:
## nnet::multinom(formula = formula, data = sample.train)
##
## Coefficients:
## (Intercept) alcohol volatile.acidity free.sulfur.dioxide sulphates
## 4 14.350244 -0.4004995 -2.041950 -0.07103896 2.524403
## 5 15.857224 -0.6183478 -3.228687 -0.02927586 4.797024
## 6 6.122122 0.2375441 -7.015575 -0.02861481 6.486124
## 7 -2.402020 0.8613607 -9.451825 -0.02897915 8.033250
## 8 -8.719364 1.1468820 -9.257779 -0.01376682 7.353041
## 9 -24.720285 1.0134455 -9.991008 -0.04540124 -1.124997
## residual.sugar citric.acid pH chlorides
## 4 -0.03850832 -1.847471 -1.4149963 -23.082376
## 5 -0.01792816 -1.303123 -1.3451585 -9.150199
## 6 0.02260898 -2.107629 -0.8404690 -7.385859
## 7 0.04868897 -1.905383 -0.6297689 -13.210430
## 8 0.08558745 -1.999104 -0.2693408 -17.410263
## 9 0.11776476 2.335004 4.7983145 -15.206927
##
## Std. Errors:
## (Intercept) alcohol volatile.acidity free.sulfur.dioxide sulphates
## 4 1.7023924 0.2132641 1.0367783 0.009622085 1.9871044
## 5 0.9108664 0.1997596 0.9590786 0.006717728 1.8697428
## 6 0.7991185 0.1983432 0.9771836 0.006591257 1.8687365
## 7 0.9953686 0.2008955 1.0433268 0.006902895 1.8802093
## 8 1.7426061 0.2122842 1.2910147 0.007033635 1.9549638
## 9 0.4246168 0.5865401 0.3579240 0.050554913 0.9116172
## residual.sugar citric.acid pH chlorides
## 4 0.05230186 1.362778 0.8843354 2.70046535
## 5 0.04709909 1.236562 0.7564718 1.68136065
## 6 0.04698566 1.241277 0.7486893 1.66451872
## 7 0.04771358 1.274603 0.7660478 2.34952747
## 8 0.05019110 1.428220 0.8650459 0.93503383
## 9 0.09793323 2.467312 1.9685448 0.05780455
##
## Residual Deviance: 11172.09
## AIC: 11280.09
Neato
Lets look at the Z-Statistic and P-Values of each variable on the different quality levels.
#Calculate Statistical Paramaters for each Variable in Model
z <- output$coefficients/output$standard.errors
p <- (1 - pnorm(abs(z), 0, 1))*2
#View Tables of stats for each Quality Level
qual.3 <- rbind(output$coefficients[1,], output$standard.errors[1,], z[1,], p[1,])
rownames(qual.3) <- c("Coefficients", "Std. Errors", "Z Stat", "P-Value")
knitr::kable(qual.3)
| (Intercept) | alcohol | volatile.acidity | free.sulfur.dioxide | sulphates | residual.sugar | citric.acid | pH | chlorides | |
|---|---|---|---|---|---|---|---|---|---|
| Coefficients | 14.350244 | -0.4004995 | -2.0419495 | -0.0710390 | 2.5244026 | -0.0385083 | -1.8474709 | -1.4149963 | -23.082376 |
| Std. Errors | 1.702392 | 0.2132641 | 1.0367783 | 0.0096221 | 1.9871044 | 0.0523019 | 1.3627777 | 0.8843354 | 2.700465 |
| Z Stat | 8.429457 | -1.8779503 | -1.9695141 | -7.3829075 | 1.2703925 | -0.7362707 | -1.3556656 | -1.6000674 | -8.547555 |
| P-Value | 0.000000 | 0.0603880 | 0.0488941 | 0.0000000 | 0.2039449 | 0.4615660 | 0.1752056 | 0.1095836 | 0.000000 |
qual.4 <- rbind(output$coefficients[2,], output$standard.errors[2,], z[2,], p[2,])
rownames(qual.4) <- c("Coefficients", "Std. Errors", "Z Stat", "P-Value")
knitr::kable(qual.4)
| (Intercept) | alcohol | volatile.acidity | free.sulfur.dioxide | sulphates | residual.sugar | citric.acid | pH | chlorides | |
|---|---|---|---|---|---|---|---|---|---|
| Coefficients | 15.8572239 | -0.6183478 | -3.2286868 | -0.0292759 | 4.7970244 | -0.0179282 | -1.303123 | -1.3451585 | -9.1501986 |
| Std. Errors | 0.9108664 | 0.1997596 | 0.9590786 | 0.0067177 | 1.8697428 | 0.0470991 | 1.236562 | 0.7564718 | 1.6813607 |
| Z Stat | 17.4089463 | -3.0954603 | -3.3664465 | -4.3580003 | 2.5656066 | -0.3806477 | -1.053827 | -1.7782005 | -5.4421391 |
| P-Value | 0.0000000 | 0.0019651 | 0.0007614 | 0.0000131 | 0.0102996 | 0.7034647 | 0.291962 | 0.0753709 | 0.0000001 |
qual.5 <- rbind(output$coefficients[3,], output$standard.errors[3,], z[3,], p[3,])
rownames(qual.5) <- c("Coefficients", "Std. Errors", "Z Stat", "P-Value")
knitr::kable(qual.5)
| (Intercept) | alcohol | volatile.acidity | free.sulfur.dioxide | sulphates | residual.sugar | citric.acid | pH | chlorides | |
|---|---|---|---|---|---|---|---|---|---|
| Coefficients | 6.1221218 | 0.2375441 | -7.0155747 | -0.0286148 | 6.4861236 | 0.0226090 | -2.1076294 | -0.8404690 | -7.3858590 |
| Std. Errors | 0.7991185 | 0.1983432 | 0.9771836 | 0.0065913 | 1.8687365 | 0.0469857 | 1.2412774 | 0.7486893 | 1.6645187 |
| Z Stat | 7.6610933 | 1.1976419 | -7.1793826 | -4.3413279 | 3.4708605 | 0.4811890 | -1.6979519 | -1.1225872 | -4.4372340 |
| P-Value | 0.0000000 | 0.2310565 | 0.0000000 | 0.0000142 | 0.0005188 | 0.6303822 | 0.0895168 | 0.2616129 | 0.0000091 |
qual.6 <- rbind(output$coefficients[4,], output$standard.errors[4,], z[4,], p[4,])
rownames(qual.6) <- c("Coefficients", "Std. Errors", "Z Stat", "P-Value")
knitr::kable(qual.6)
| (Intercept) | alcohol | volatile.acidity | free.sulfur.dioxide | sulphates | residual.sugar | citric.acid | pH | chlorides | |
|---|---|---|---|---|---|---|---|---|---|
| Coefficients | -2.4020196 | 0.8613607 | -9.451825 | -0.0289792 | 8.0332504 | 0.0486890 | -1.9053831 | -0.6297689 | -13.210430 |
| Std. Errors | 0.9953686 | 0.2008955 | 1.043327 | 0.0069029 | 1.8802093 | 0.0477136 | 1.2746030 | 0.7660478 | 2.349528 |
| Z Stat | -2.4131961 | 4.2876062 | -9.059314 | -4.1981157 | 4.2725299 | 1.0204426 | -1.4948836 | -0.8221013 | -5.622590 |
| P-Value | 0.0158133 | 0.0000181 | 0.000000 | 0.0000269 | 0.0000193 | 0.3075186 | 0.1349448 | 0.4110192 | 0.000000 |
qual.7 <- rbind(output$coefficients[5,], output$standard.errors[5,], z[5,], p[5,])
rownames(qual.7) <- c("Coefficients", "Std. Errors", "Z Stat", "P-Value")
knitr::kable(qual.7)
| (Intercept) | alcohol | volatile.acidity | free.sulfur.dioxide | sulphates | residual.sugar | citric.acid | pH | chlorides | |
|---|---|---|---|---|---|---|---|---|---|
| Coefficients | -8.7193637 | 1.1468820 | -9.257779 | -0.0137668 | 7.3530413 | 0.0855875 | -1.9991039 | -0.2693408 | -17.4102629 |
| Std. Errors | 1.7426061 | 0.2122842 | 1.291015 | 0.0070336 | 1.9549638 | 0.0501911 | 1.4282195 | 0.8650459 | 0.9350338 |
| Z Stat | -5.0036342 | 5.4025783 | -7.170933 | -1.9572838 | 3.7612161 | 1.7052316 | -1.3997176 | -0.3113601 | -18.6199283 |
| P-Value | 0.0000006 | 0.0000001 | 0.000000 | 0.0503141 | 0.0001691 | 0.0881512 | 0.1615979 | 0.7555269 | 0.0000000 |
qual.8 <- rbind(output$coefficients[6,], output$standard.errors[6,], z[6,], p[6,])
rownames(qual.8) <- c("Coefficients", "Std. Errors", "Z Stat", "P-Value")
knitr::kable(qual.8)
| (Intercept) | alcohol | volatile.acidity | free.sulfur.dioxide | sulphates | residual.sugar | citric.acid | pH | chlorides | |
|---|---|---|---|---|---|---|---|---|---|
| Coefficients | -24.7202851 | 1.0134455 | -9.991008 | -0.0454012 | -1.1249967 | 0.1177648 | 2.3350039 | 4.7983145 | -15.2069274 |
| Std. Errors | 0.4246168 | 0.5865401 | 0.357924 | 0.0505549 | 0.9116172 | 0.0979332 | 2.4673122 | 1.9685448 | 0.0578046 |
| Z Stat | -58.2178670 | 1.7278365 | -27.913772 | -0.8980580 | -1.2340670 | 1.2025006 | 0.9463755 | 2.4374932 | -263.0749083 |
| P-Value | 0.0000000 | 0.0840175 | 0.000000 | 0.3691546 | 0.2171780 | 0.2291696 | 0.3439571 | 0.0147895 | 0.0000000 |
Kind of interesting but lets look at the predicted probabilites for each level.
#Check Accuracy of Predictions if column with highest probability is taken as the predicted quality
head(pp <- fitted(mlogit))
## 3 4 5 6 7 8
## 5944 0.015151363 0.005981380 0.67235453 0.2916282 0.01421369 0.0006560692
## 6088 0.009940440 0.028291233 0.08905514 0.4517590 0.32942848 0.0901994100
## 1859 0.004564583 0.145623310 0.23164831 0.4602252 0.14017292 0.0162664184
## 5393 0.004669775 0.005308907 0.33408266 0.5216070 0.10980013 0.0245007018
## 4167 0.002288883 0.012861662 0.11909610 0.5330841 0.27672917 0.0553680979
## 3370 0.006381129 0.025348612 0.27440152 0.5410762 0.12958106 0.0231034137
## 9
## 5944 1.476938e-05
## 6088 1.326291e-03
## 1859 1.499243e-03
## 5393 3.079611e-05
## 4167 5.720224e-04
## 3370 1.080698e-04
Now that is interesting, few of the probabilites are over 50% so thats not very good but it seems like the cumulative probabilities for qualities + and - 1 from the predicted quality point are usually over 70%. So we can’t predict the specific quality value very well but we can predict the value within + or - 1 point very nicely.
Now if we wanted to be more concrete we could find out which quality level had the highest probability and assume that is our predicted value. That is done below.
q1 <- colnames(pp)[max.col(pp, ties.method = "first")]
Results from the Confusion Matrix
confusionMatrix(data = q1, reference = sample.train$quality)
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(data = q1, reference = sample.train
## $quality): Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 2 0 0 0 0 0 0
## 4 0 1 0 0 0 0 0
## 5 10 94 1024 532 57 14 0
## 6 13 71 686 1636 646 102 1
## 7 0 1 10 113 146 37 1
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5405
## 95% CI : (0.5268, 0.5541)
## No Information Rate : 0.4389
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2512
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity 0.0800000 0.0059880 0.5953 0.7172 0.17197
## Specificity 1.0000000 1.0000000 0.7967 0.4791 0.96274
## Pos Pred Value 1.0000000 1.0000000 0.5916 0.5185 0.47403
## Neg Pred Value 0.9955727 0.9680523 0.7992 0.6841 0.85621
## Prevalence 0.0048105 0.0321339 0.3310 0.4389 0.16336
## Detection Rate 0.0003848 0.0001924 0.1970 0.3148 0.02809
## Detection Prevalence 0.0003848 0.0001924 0.3331 0.6071 0.05926
## Balanced Accuracy 0.5400000 0.5029940 0.6960 0.5982 0.56735
## Class: 8 Class: 9
## Sensitivity 0.00000 0.0000000
## Specificity 1.00000 1.0000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.97056 0.9996152
## Prevalence 0.02944 0.0003848
## Detection Rate 0.00000 0.0000000
## Detection Prevalence 0.00000 0.0000000
## Balanced Accuracy 0.50000 0.5000000
54% Accuracy, slightly better than a coinflip. However as said earlier if we were interested in the range of quality and if we set a cutoff for Bad, Good and Great wines we could probably predict those ranges quite well with this model.
Finally lets finish with a SVM using a Radial Basis Function in Caret. Rather than spend many hours waiting for this to finish running again I simply set the parameters Cost and Sigma to 12 and 2 respectively, these were found to be the best performing parameters using Caret’s Tune Grid search which tries each combination of given parameters and selects the best combination. This model is also cross validated using 5 folds repeated 5 times.
#Building a SVM Model with Caret
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 5,number = 5)
train.grid <- expand.grid(.C = c(12), .sigma = c(2))
set.seed(42)
train.model <-train(quality ~.,
data=sample.train,
nrounds=10,
method="svmRadial",
metric = "Accuracy",
trControl=cv.ctrl,
tuneGrid=train.grid
)
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
print(train.model)
## Support Vector Machines with Radial Basis Function Kernel
##
## 5197 samples
## 11 predictor
## 7 classes: '3', '4', '5', '6', '7', '8', '9'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 4158, 4158, 4157, 4158, 4157, 4158, ...
## Resampling results:
##
## Accuracy Kappa
## 0.627712 0.3988437
##
## Tuning parameter 'sigma' was held constant at a value of 2
##
## Tuning parameter 'C' was held constant at a value of 12
##
Using the training data we get an accuracy of 63%.
Let’s look at the results on our holdout.
#Make predictions on Test Data and check accuracy
svmpredict <- predict(train.model, sample.test, type = "raw")
confusionMatrix(data = svmpredict, reference = sample.test$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 0 4 2 0 0 0 0
## 5 0 10 247 70 7 0 0
## 6 5 35 166 464 105 23 3
## 7 0 0 3 19 117 3 1
## 8 0 0 0 1 1 14 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6508
## 95% CI : (0.6242, 0.6767)
## No Information Rate : 0.4262
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4499
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.081633 0.5909 0.8375 0.5087 0.35000
## Specificity 1.000000 0.998401 0.9014 0.5483 0.9757 0.99841
## Pos Pred Value NaN 0.666667 0.7395 0.5793 0.8182 0.87500
## Neg Pred Value 0.996154 0.965224 0.8230 0.8196 0.9023 0.97975
## Prevalence 0.003846 0.037692 0.3215 0.4262 0.1769 0.03077
## Detection Rate 0.000000 0.003077 0.1900 0.3569 0.0900 0.01077
## Detection Prevalence 0.000000 0.004615 0.2569 0.6162 0.1100 0.01231
## Balanced Accuracy 0.500000 0.540017 0.7461 0.6929 0.7422 0.67421
## Class: 9
## Sensitivity 0.000000
## Specificity 1.000000
## Pos Pred Value NaN
## Neg Pred Value 0.996923
## Prevalence 0.003077
## Detection Rate 0.000000
## Detection Prevalence 0.000000
## Balanced Accuracy 0.500000
Accuracy of about 65% on the holdout data, not bad at all. With a larger CV, more expansive tune.grid, and more iterations we could probably get the accuracy above 70% but I enjoy having a non-melted CPU so this will have to do for now. Similar to the previous models, if we were predicting ranges of quality this would be a very good model. However predicting the exact quality of each wine seems to be very difficult with the data we have.
It’s also likely that transforming the data with the centering and scaling tools in Caret will improve the performance of the models since the data will be standardized and normal and this will improve the models.
We can also treat quality as an Ordinal ranking and try to predict the rank of each wine. We can use Ordinal Regression to do multiclass classification when each has class has a specific ranking in the group.
library(MASS)
library(ordinal)
##
## Attaching package: 'ordinal'
## The following object is masked from 'package:kernlab':
##
## convergence
library(erer)
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
sample.train$quality <- ordered(sample.train$quality, levels = c(1,2,3,4,5,6,7,8,9))
sample.test$quality <- ordered(sample.test$quality, levels = c(1,2,3,4,5,6,7,8,9))
om1 <- polr(quality ~ alcohol + volatile.acidity + free.sulfur.dioxide + sulphates +
+ residual.sugar + citric.acid + pH + chlorides, data = sample.train, Hess = TRUE)
summary(om1)
## Call:
## polr(formula = quality ~ alcohol + volatile.acidity + free.sulfur.dioxide +
## sulphates + +residual.sugar + citric.acid + pH + chlorides,
## data = sample.train, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## alcohol 0.909147 0.028873 31.4883
## volatile.acidity -3.758080 0.217134 -17.3076
## free.sulfur.dioxide 0.005056 0.001807 2.7974
## sulphates 2.001866 0.205798 9.7273
## residual.sugar 0.041353 0.006747 6.1295
## citric.acid -0.351863 0.214352 -1.6415
## pH 0.507271 0.188250 2.6947
## chlorides 0.309260 0.974792 0.3173
##
## Intercepts:
## Value Std. Error t value
## 1|2 -1.2583 1.4520 -0.8666
## 2|3 -0.4521 1.3749 -0.3288
## 3|4 5.2093 0.7198 7.2376
## 4|5 7.3295 0.6967 10.5201
## 5|6 10.5080 0.6993 15.0271
## 6|7 13.0736 0.7094 18.4299
## 7|8 15.3828 0.7187 21.4025
## 8|9 19.8207 1.0080 19.6639
##
## Residual Deviance: 11397.02
## AIC: 11429.02
OrdinalPred <- predict(om1,newdata=sample.test)
confusionMatrix(data = OrdinalPred, reference = sample.test$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8 9
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 1 0 0 0 0
## 5 0 0 4 29 230 124 10 3 1
## 6 0 0 1 20 187 402 175 23 2
## 7 0 0 0 0 0 28 45 14 1
## 8 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5208
## 95% CI : (0.4932, 0.5482)
## No Information Rate : 0.4262
## P-Value [Acc > NIR] : 4.318e-12
##
## Kappa : 0.2311
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity NA NA 0.000000 0.0000000 0.5502
## Specificity 1 1 1.000000 0.9992006 0.8061
## Pos Pred Value NA NA NaN 0.0000000 0.5736
## Neg Pred Value NA NA 0.996154 0.9622787 0.7909
## Prevalence 0 0 0.003846 0.0376923 0.3215
## Detection Rate 0 0 0.000000 0.0000000 0.1769
## Detection Prevalence 0 0 0.000000 0.0007692 0.3085
## Balanced Accuracy NA NA 0.500000 0.4996003 0.6782
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.7256 0.19565 0.00000 0.000000
## Specificity 0.4531 0.95981 1.00000 1.000000
## Pos Pred Value 0.4963 0.51136 NaN NaN
## Neg Pred Value 0.6898 0.84736 0.96923 0.996923
## Prevalence 0.4262 0.17692 0.03077 0.003077
## Detection Rate 0.3092 0.03462 0.00000 0.000000
## Detection Prevalence 0.6231 0.06769 0.00000 0.000000
## Balanced Accuracy 0.5894 0.57773 0.50000 0.500000
52% Accuracy which is on par with the Linear and Logistic results.