This is the solutoin for your homework #1. The homework questions are available here: http://rpubs.com/malshe/224662
library(caret)
library(dplyr)
red <- read.csv("/Volumes/Transcend/Dropbox/Work/Teaching/DA 6813/Course Documents/Data/Wine data/winequality-red.csv", stringsAsFactors = F)
red$wine <- "red"
white <- read.csv("/Volumes/Transcend/Dropbox/Work/Teaching/DA 6813/Course Documents/Data/Wine data/winequality-white.csv", stringsAsFactors = F)
white$wine <- "white"
wine <- rbind(red,white)
wine$wine <- as.factor(wine$wine)
Get the summary of the data
summary(wine)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free_sulfur_dioxide total_sulfur_dioxide
## Min. :0.00900 Min. : 1.00 Min. : 6.0
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0
## Median :0.04700 Median : 29.00 Median :118.0
## Mean :0.05603 Mean : 30.53 Mean :115.7
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0
## Max. :0.61100 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50
## Median :0.9949 Median :3.210 Median :0.5100 Median :10.30
## Mean :0.9947 Mean :3.219 Mean :0.5313 Mean :10.49
## 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30
## Max. :1.0390 Max. :4.010 Max. :2.0000 Max. :14.90
## quality wine
## Min. :3.000 red :1599
## 1st Qu.:5.000 white:4898
## Median :6.000
## Mean :5.818
## 3rd Qu.:6.000
## Max. :9.000
Our dependent variable is quality
. As we will be using it as a categorical variable in 3 of the 4 models, let’s look at its frequency distribution.
table(wine$quality)
##
## 3 4 5 6 7 8 9
## 30 216 2138 2836 1079 193 5
Clearly, the categories at the extreme have very few observations. This will lead to problems in correctly categorizing these values. In order to overcome this problem, I will create two new variables. For support vector machines (SVM) and multinomial logistic model (MNL), I will create a new variable labeled quality.c
which will be a factor variable with groups 3, 4, 8, and 9 combined in another group. We can label it anything we want as the labeling is meaningless for these methods. I will label this new group 3489
thus preserving the knowledge that this group came from 4 separate groups.
For ordinal regression, we have a little bit more information about the ordering of the groups. I will create a new variable quality.o
. In this variable, I will combine 3, 4, and 5 and label it 5 to indicate this is 5 and lower. Similarly, I will combine 7, 8, and 9 and label it 7 to indicate that this group is 7 and above. Thus, we will effectively have only 3 groups.
Clearly, this will make model comparison a little bit tough but we have to give each model the best chance to perform even at this lower level of analysis.
wine <- wine %>%
mutate(quality.c = quality) %>%
mutate(quality.o = quality) %>%
mutate(quality.c = replace(quality.c, quality %in% c(3,4,8,9), 3489)) %>%
mutate(quality.o = replace(quality.o, quality %in% c(3,4), 5)) %>%
mutate(quality.o = replace(quality.o, quality %in% c(8,9), 7)) %>%
mutate(quality.c = factor(quality.c)) %>%
mutate(quality.o = ordered(quality.o))
str(wine$quality.c)
## Factor w/ 4 levels "5","6","7","3489": 1 1 1 2 1 1 1 3 3 1 ...
str(wine$quality.o)
## Ord.factor w/ 3 levels "5"<"6"<"7": 1 1 1 2 1 1 1 3 3 1 ...
Now that we have new dependent variables, let’s take a look at the independent variables. As we are doing a predictive analysis (we have no plan to do inference), let’s understand the distribution and correlations of the variables and explore the need for transformation.
For this we will first get the descriptive statistics and correlations for all the numeric variables.
cormat <- round(cor(as.matrix(wine[,-c(13,14,15)])),2)
cormat[upper.tri(cormat)] <- ""
as.data.frame(cormat)
In the above heat map, the crosses indicate non significant correlations. From the correlations it seems that most variables have their own unique information set. However, it appears that quality
is strongly related to only a few variables. This is not great news. At this point one can think of transformations to increase the correlations between the variables. However, I am not going to do it as this will make this exercise broader than I expected.
Let’s get the descriptive staistics
library(moments)
desc <- as.data.frame(cbind(Mean = sapply(wine[,-c(13,14,15)], mean),
Median = sapply(wine[,-c(13,14,15)], median),
Std.Dev = sapply(wine[,-c(13,14,15)], sd),
CV = sapply(wine[,-c(13,14,15)], sd)/ sapply(wine[,-c(13,14,15)], mean),
Skewness = sapply(wine[,-c(13,14,15)], skewness),
Kurtosis = sapply(wine[,-c(13,14,15)], kurtosis)))
round(desc,2)
## Mean Median Std.Dev CV Skewness Kurtosis
## fixed_acidity 7.22 7.00 1.30 0.18 1.72 8.06
## volatile_acidity 0.34 0.29 0.16 0.48 1.49 5.82
## citric_acid 0.32 0.31 0.15 0.46 0.47 5.39
## residual_sugar 5.44 3.00 4.76 0.87 1.44 7.35
## chlorides 0.06 0.05 0.04 0.63 5.40 53.86
## free_sulfur_dioxide 30.53 29.00 17.75 0.58 1.22 10.90
## total_sulfur_dioxide 115.74 118.00 56.52 0.49 0.00 2.63
## density 0.99 0.99 0.00 0.00 0.50 9.60
## pH 3.22 3.21 0.16 0.05 0.39 3.37
## sulphates 0.53 0.51 0.15 0.28 1.80 11.65
## alcohol 10.49 10.30 1.19 0.11 0.57 2.47
## quality 5.82 6.00 0.87 0.15 0.19 3.23
The most interesting column for me is the CV (coefficient of variation). This is the ratio of standard deviation to the mean. We have certain observations where CV is very low (e.g., 0 or 0.05). This means that the standard deviation is extremely small compared to the mean. Clearly we need some scaling here to remove the effect of the mean. One way to do that is to mean center all the variables so that we have zero mean all across. It retains the low standard deviation, however. To overcome this issue, we can divide all the variables by their standard deviations. This way we will normalize our data such that all the variables will have mean = 0 and standard deviation = 1.
Will that help with reducing skewness and kurtosis? Think about it for a moment before you read on.
I am going to scale the numeric variables.
wine2 <- wine
wine2[,c(1:12)] <- scale(wine[,c(1:12)])
desc2 <- as.data.frame(cbind(Mean = sapply(wine2[,c(1:12)], mean),
Median = sapply(wine2[,c(1:12)], median),
Std.Dev = sapply(wine2[,c(1:12)], sd),
Skewness = sapply(wine2[,c(1:12)], skewness),
Kurtosis = sapply(wine2[,c(1:12)], kurtosis)))
round(desc2,2)
## Mean Median Std.Dev Skewness Kurtosis
## fixed_acidity 0 -0.17 1 1.72 8.06
## volatile_acidity 0 -0.30 1 1.49 5.82
## citric_acid 0 -0.06 1 0.47 5.39
## residual_sugar 0 -0.51 1 1.44 7.35
## chlorides 0 -0.26 1 5.40 53.86
## free_sulfur_dioxide 0 -0.09 1 1.22 10.90
## total_sulfur_dioxide 0 0.04 1 0.00 2.63
## density 0 0.06 1 0.50 9.60
## pH 0 -0.05 1 0.39 3.37
## sulphates 0 -0.14 1 1.80 11.65
## alcohol 0 -0.16 1 0.57 2.47
## quality 0 0.21 1 0.19 3.23
Scaling doesn’t affect skewness or kurtosis. In order to alter these two moments, we need to use non linear transformation such as logarithmic or square root transformations. I’m going to do it through trial and error.
Normal distribution has skewness = 0 and kurtosis = 3. total_sulfur_dioxide
, pH
, alcohol
, and quality
seem to have this shape (a good idea is to plot distributions). I am concerned about fixed_acidity
, chlorides
, free_sulfur_dioxide
, density
, and sulphates
due to high kurtosis (and skewness in some cases). Let’s take their log transform first and then scale these variables.
wine2[,c(1,5,6,8,10)] <- scale(log(wine[,c(1,5,6,8,10)]))
moments::skewness(wine2[,c(1,5,6,8,10)])
## fixed_acidity chlorides free_sulfur_dioxide
## 0.8889319 0.8762698 -0.8340045
## density sulphates
## 0.4672599 0.4048986
moments::kurtosis(wine2[,c(1,5,6,8,10)])
## fixed_acidity chlorides free_sulfur_dioxide
## 4.896783 5.305355 3.429675
## density sulphates
## 9.008338 3.701296
Except for density
the remaining 4 variables benefited from log transformation. After plotting the distribution for density
it appears that this might be because of a couple of extreme values. Although log transformation should have gotten rid of them, it seems it didn’t work out. So I replaced the two extreme values (there were 3 observations) with the next highest observation and then took log. As it turns out, the transformation paid off.
wine2$density <- ifelse(wine$density > 1.00369, 1.00369, wine$density) # replace the values > 1.00369 by 1.00369
wine2$density <- scale(log(wine2$density))
moments::skewness(wine2$density)
## [1] -0.02258351
moments::kurtosis(wine2$density)
## [1] 2.254121
Now we have a dataset wine2
which has all the transformed variables. I am going to use it for the rest of the analysis.
I am using caret
package for a few models. It doesn’t matter what packages you used.
model.lm <- caret::train(quality ~ ., data = wine2[,-c(14,15)], method = "lm")
summary(model.lm)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8267 -0.5347 -0.0440 0.5187 3.4109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.317675 0.051352 6.186 6.54e-10 ***
## fixed_acidity 0.123372 0.023010 5.362 8.54e-08 ***
## volatile_acidity -0.275112 0.015059 -18.269 < 2e-16 ***
## citric_acid -0.004644 0.012951 -0.359 0.7199
## residual_sugar 0.304732 0.031516 9.669 < 2e-16 ***
## chlorides -0.039373 0.015838 -2.486 0.0129 *
## free_sulfur_dioxide 0.194606 0.016048 12.126 < 2e-16 ***
## total_sulfur_dioxide -0.155376 0.020783 -7.476 8.65e-14 ***
## density -0.317525 0.050289 -6.314 2.90e-10 ***
## pH 0.080127 0.016593 4.829 1.40e-06 ***
## sulphates 0.120871 0.012814 9.433 < 2e-16 ***
## alcohol 0.305010 0.025422 11.998 < 2e-16 ***
## winewhite -0.421383 0.066724 -6.315 2.87e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8328 on 6484 degrees of freedom
## Multiple R-squared: 0.3077, Adjusted R-squared: 0.3065
## F-statistic: 240.2 on 12 and 6484 DF, p-value: < 2.2e-16
I get a decent model with adjusted R2 = 0.3065. The model is highly significant with p-value of F-statistic < 2.2e-16. Except for citric_acid
all the other variables are statistically significant at conventional levels.
Note that I also included the variable wine
in the data set while estimating the model. As it turns out, white wines have on average lower rating than red wines (all else equal).
I would have liked to tweak this model a little bit to understand if there are any interactions present. However, that was not in the scope of the homework so I will leave it for you guys to do it in the future.
For MNL, we will use quality.c
as the dependent variable. Recall that this is a categorical variable with groups 3, 4, 8, and 9 bundled together.
table(wine2$quality.c)
##
## 5 6 7 3489
## 2138 2836 1079 444
Right up front I can tell you that this bundling is going to cause some problems. This is because we have assumed no ordering in quality
. But in reality there is ordering. We will repeat this analysis with quality.o
later.
model.mnl.c <- nnet::multinom(quality.c ~ ., data = wine2[,-c(12,15)], maxit = 100)
## # weights: 56 (39 variable)
## initial value 9006.754464
## iter 10 value 7029.581084
## iter 20 value 6959.947641
## iter 30 value 6788.213119
## iter 40 value 6728.847472
## final value 6698.104301
## converged
sum.mnl.c <- summary(model.mnl.c)
Get the z-statistics and p-values
z <- sum.mnl.c$coefficients/sum.mnl.c$standard.errors
p <- (1 - pnorm(abs(z), 0, 1))*2
# Coefficients for quality = 6
coeff.mnl.c1 <- rbind(sum.mnl.c$coefficients[1,],sum.mnl.c$standard.errors[1,],z[1,],p[1,])
rownames(coeff.mnl.c1) <- c("Coefficient","Std. Errors","z stat","p value")
# Coefficients for quality = 7
coeff.mnl.c2 <- rbind(sum.mnl.c$coefficients[1,],sum.mnl.c$standard.errors[2,],z[2,],p[2,])
rownames(coeff.mnl.c2) <- c("Coefficient","Std. Errors","z stat","p value")
# Coefficients for quality = 3489
coeff.mnl.c3 <- rbind(sum.mnl.c$coefficients[1,],sum.mnl.c$standard.errors[3,],z[3,],p[3,])
rownames(coeff.mnl.c3) <- c("Coefficient","Std. Errors","z stat","p value")
quality.c = 6
quality.c = 7
quality.c = 3489
It’s a good exercise to compare the coefficients of MNL and linear regressions. However, MNL has 3 equations with different variables showing up significant across the 3 models. What can you do now? One way to handle this problem is to first identify the variables that are significant across the board. I will retain them. Then look at the ones that are non significant in all of the 3 equations. Most likely these variables cana be dropped from the next iteration of MNL. The variables that are significant in a few equations are the most difficult ones. You will have to take a call depending on their overall importance. If you drop them and the performance of the model deteriorates severely, perhaps adding them back is a better choice. You can also be concervative and keep any variable that’s significant at least once in the 3 models.
MNL model performance can be assessed on several different metrics. I will select classification accuracy as the relevant metric. As such I get the confusion matrix by using the same samples that we used for estimating the model. You could do this using cross-validation.
caret::confusionMatrix(reference = wine2$quality.c,
predict(model.mnl.c, data = wine2[,-c(12,15)], type = "class"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 5 6 7 3489
## 5 1274 647 67 159
## 6 848 2017 782 217
## 7 9 161 229 51
## 3489 7 11 1 17
##
## Overall Statistics
##
## Accuracy : 0.5444
## 95% CI : (0.5322, 0.5566)
## No Information Rate : 0.4365
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2649
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 5 Class: 6 Class: 7 Class: 3489
## Sensitivity 0.5959 0.7112 0.21223 0.038288
## Specificity 0.7997 0.4955 0.95921 0.996861
## Pos Pred Value 0.5934 0.5220 0.50889 0.472222
## Neg Pred Value 0.8014 0.6889 0.85943 0.933911
## Prevalence 0.3291 0.4365 0.16608 0.068339
## Detection Rate 0.1961 0.3105 0.03525 0.002617
## Detection Prevalence 0.3305 0.5947 0.06926 0.005541
## Balanced Accuracy 0.6978 0.6034 0.58572 0.517575
The model accuracy is 54%, which is not too bad. Note that class 6 has a prevalence of around 44%. Thus, you would be right 44% of the times if you labeled all wines as 6. However, the model did a particularly poor job of detecting classes 7 and 3489. But this is the first MNL model you have estimated. You will ideally tweak this model to yield better results.
I am going to use radial basis function (RBF) kernel for SVM. RBF takes two hyper parameters and we need to tune them before we can estimate SVM. But it takes a long time to tune. Therefore, in this example I won’t actually tune it because I have already done it previously using e1071
package. I am also writing down the code using caret
package in case you are interested.
Use the following code to estimate SVM using e1071
package.
library(e1071)
library(doParallel)
registerDoParallel(cores = 4)
# First pass
set.seed(1234)
svm.tune <- tune.svm(quality.c ~ ., data = wine2[,-c(12,15)],
gamma = c(0.05,0.1,0.5,1,2),
cost = 10^(0:3))
svm.tune$best.parameters
Using the best parameters from this model (gamma = 0.05, cost = 10) we can do a second pass and tune the hyperparameters more but I won’t do it because of the time it takes to execute this code.
The following code is simply sample code which will not execute.
set.seed(2345)
svm.tune2 <- tune.svm(quality.c ~ ., data = wine2[,-c(12,15)],
gamma = seq(svm.tune$best.parameters$gamma - 0.1,
svm.tune$best.parameters$gamma + 0.1,
by = 0.01),
cost = seq(svm.tune$best.parameters$cost - 1,
svm.tune$best.parameters$cost + 1,
by = .1))
And finally, using the best hyperparameters from the second pass we can estimate the model. In the next code I will use the hyperparameters from the first pass.
model.svm.c <- e1071::svm(quality.c ~ ., data = wine2[,-c(12,15)],
gamma = 0.05,
cost = 10)
summary(model.svm.c)
##
## Call:
## svm(formula = quality.c ~ ., data = wine2[, -c(12, 15)], gamma = 0.05,
## cost = 10)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
## gamma: 0.05
##
## Number of Support Vectors: 5510
##
## ( 1649 2392 1027 442 )
##
##
## Number of Classes: 4
##
## Levels:
## 5 6 7 3489
caret::confusionMatrix(reference = wine2$quality.c,
predict(model.svm.c, data = wine2[,-c(12,15)], type = "class"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 5 6 7 3489
## 5 1485 509 39 132
## 6 638 2214 672 179
## 7 10 109 368 72
## 3489 5 4 0 61
##
## Overall Statistics
##
## Accuracy : 0.6354
## 95% CI : (0.6235, 0.6471)
## No Information Rate : 0.4365
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.418
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 5 Class: 6 Class: 7 Class: 3489
## Sensitivity 0.6946 0.7807 0.34106 0.137387
## Specificity 0.8440 0.5933 0.96475 0.998513
## Pos Pred Value 0.6859 0.5979 0.65832 0.871429
## Neg Pred Value 0.8493 0.7774 0.88026 0.940408
## Prevalence 0.3291 0.4365 0.16608 0.068339
## Detection Rate 0.2286 0.3408 0.05664 0.009389
## Detection Prevalence 0.3332 0.5700 0.08604 0.010774
## Balanced Accuracy 0.7693 0.6870 0.65290 0.567950
Well, it looks like SVM classification on the data is better than MNL! However, as I didn’t do train vs test comparison, I can’t say with confidence whether there is overfitting. We will test this below with caret
.
Which variables are important in SVM? I couldn’t figure out a way to do that in e1071
package. So I am using caret
package for this. The following code will tune the model but it’s actually not going to run in this tutorial because it will take a lot of time.
ctrl <- caret::trainControl(method = "cv")
set.seed(1492)
grid <- expand.grid(sigma = 10^(-1:4),
C = 10^(0:4))
model.svm.c <- train(quality.c ~ ., data = wine2[,-c(12,15)],
method = "svmRadial",
tuneGrid = grid,
trControl=ctrl)
I will use sigma = 0.05
and cost = 10
and estimate the model. I will then use varImp
in caret
to get the variable importance.
ctrl <- caret::trainControl(method = "cv")
grid <- expand.grid(sigma = 0.05, C = 10)
set.seed(89753)
model.svm.c <- train(quality.c ~ ., data = wine2[,-c(12,15)],
method = "svmRadial", tuneGrid = grid, trControl=ctrl)
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
var.imp <- varImp(model.svm.c, scale = T)
var.imp2 <- var.imp$importance
var.imp2$Chemistry <- rownames(var.imp2)
var.imp2 <- reshape2::melt(var.imp2, id = c("Chemistry"))
ggplot(data = var.imp2, aes(x = Chemistry, y = value)) +
geom_bar(stat = "identity", aes(fill = Chemistry)) + facet_wrap(~variable) +
theme_bw() + xlab(NULL) + ylab("Importance Weight") +
theme(axis.text.x = element_text(angle = 60, hjust = 1),
legend.position="none")
Similar to MNL, we have each predictor having different effectiveness in predicting different classes. Clearly alcohol
content looks like the most important predictor for all the classes.
The next part is purely for satisfying my curiosity. You can ignore it and directly go to the ordinal logistic regression.
I won’t explain this code as you are capable of understanding it.
set.seed(1234)
trind <- caret::createDataPartition(wine2$quality.c, p = 4/5, list = FALSE)
svm.train <- wine2[trind,-c(12,15)]
svm.test <- wine2[-trind,-c(12,15)]
model.svm.c2 <- train(quality.c ~ ., data = svm.train,
method = "svmRadial",
tuneGrid = grid,
trControl=ctrl)
# In-sample accuracy
caret::confusionMatrix(reference = svm.train$quality.c,
predict(model.svm.c2, newdata = svm.train))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 5 6 7 3489
## 5 1188 387 25 111
## 6 506 1779 515 133
## 7 11 96 324 59
## 3489 6 7 0 53
##
## Overall Statistics
##
## Accuracy : 0.6431
## 95% CI : (0.6299, 0.6561)
## No Information Rate : 0.4363
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4326
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 5 Class: 6 Class: 7 Class: 3489
## Sensitivity 0.6943 0.7840 0.37500 0.14888
## Specificity 0.8501 0.6063 0.96172 0.99732
## Pos Pred Value 0.6943 0.6065 0.66122 0.80303
## Neg Pred Value 0.8501 0.7839 0.88535 0.94098
## Prevalence 0.3290 0.4363 0.16615 0.06846
## Detection Rate 0.2285 0.3421 0.06231 0.01019
## Detection Prevalence 0.3290 0.5640 0.09423 0.01269
## Balanced Accuracy 0.7722 0.6952 0.66836 0.57310
# Out-of sample accuracy
caret::confusionMatrix(reference = svm.test$quality.c,
predict(model.svm.c2, newdata = svm.test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 5 6 7 3489
## 5 276 125 9 25
## 6 146 405 149 41
## 7 4 32 55 14
## 3489 1 5 2 8
##
## Overall Statistics
##
## Accuracy : 0.5736
## 95% CI : (0.5462, 0.6007)
## No Information Rate : 0.4372
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3184
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 5 Class: 6 Class: 7 Class: 3489
## Sensitivity 0.6464 0.7143 0.25581 0.090909
## Specificity 0.8172 0.5397 0.95379 0.993383
## Pos Pred Value 0.6345 0.5466 0.52381 0.500000
## Neg Pred Value 0.8248 0.7086 0.86577 0.937549
## Prevalence 0.3292 0.4372 0.16577 0.067849
## Detection Rate 0.2128 0.3123 0.04241 0.006168
## Detection Prevalence 0.3354 0.5713 0.08096 0.012336
## Balanced Accuracy 0.7318 0.6270 0.60480 0.542146
It looks like we didn’t have much overfitting. The model is not going to perform out of sample as well as it did in sample. But there is no dramatic reduction in the prediction accuracy.
In this last section I will use quality.o
for estimating an ordinal model. Ordinal model can be estimated using several link functions. I will use a logit link.
We will use ordinal
package and clm
function.
library(ordinal)
model.ordinal <- clm(quality.o ~ ., data = wine2[,-c(12,14)])
summary(model.ordinal)
## formula:
## quality.o ~ fixed_acidity + volatile_acidity + citric_acid + residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + density + pH + sulphates + alcohol + wine
## data: wine2[, -c(12, 14)]
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 6497 -5581.85 11191.69 5(0) 6.80e-07 3.6e+02
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## fixed_acidity 0.29976 0.05644 5.311 1.09e-07 ***
## volatile_acidity -0.67167 0.04065 -16.525 < 2e-16 ***
## citric_acid -0.03372 0.03161 -1.067 0.286005
## residual_sugar 0.65024 0.07965 8.164 3.24e-16 ***
## chlorides -0.10306 0.03942 -2.614 0.008945 **
## free_sulfur_dioxide 0.40563 0.03990 10.165 < 2e-16 ***
## total_sulfur_dioxide -0.41053 0.05065 -8.106 5.25e-16 ***
## density -0.62716 0.12610 -4.974 6.57e-07 ***
## pH 0.21431 0.04062 5.277 1.32e-07 ***
## sulphates 0.31462 0.03122 10.078 < 2e-16 ***
## alcohol 0.81086 0.06387 12.695 < 2e-16 ***
## winewhite -0.61873 0.16572 -3.734 0.000189 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 5|6 -1.2200 0.1291 -9.45
## 6|7 1.4270 0.1293 11.04
We have a really nice ordinal model here. Similar to the linear regression model, we have all except citric_acid
turning up significant at 5% level. The two threshold are also statistically significant indicating that our model identified distinct thresholds to isolate 6 from 5 and 7 from 6. Let’s study the model performance using a confusion matrix.
confusionMatrix(reference = wine2$quality.o,
unlist(predict(model.ordinal, newdata = wine2, type = "class")))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 5 6 7
## 5 1507 676 100
## 6 854 1858 772
## 7 23 302 405
##
## Overall Statistics
##
## Accuracy : 0.5803
## 95% CI : (0.5682, 0.5923)
## No Information Rate : 0.4365
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3174
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 5 Class: 6 Class: 7
## Sensitivity 0.6321 0.6551 0.31715
## Specificity 0.8113 0.5559 0.93774
## Pos Pred Value 0.6601 0.5333 0.55479
## Neg Pred Value 0.7919 0.6754 0.84879
## Prevalence 0.3669 0.4365 0.19655
## Detection Rate 0.2320 0.2860 0.06234
## Detection Prevalence 0.3514 0.5362 0.11236
## Balanced Accuracy 0.7217 0.6055 0.62744
As it turns out at 58% the model accuracy is reasonable but not that great. Ordinal regression is the most appropriate model in this case, however. This is because quality
is actually ordinal.