Homework #1 Solution

This is the solutoin for your homework #1. The homework questions are available here: http://rpubs.com/malshe/224662

library(caret)
library(dplyr)

Get the data in

red <- read.csv("/Volumes/Transcend/Dropbox/Work/Teaching/DA 6813/Course Documents/Data/Wine data/winequality-red.csv", stringsAsFactors = F)

red$wine <- "red"

white <- read.csv("/Volumes/Transcend/Dropbox/Work/Teaching/DA 6813/Course Documents/Data/Wine data/winequality-white.csv", stringsAsFactors = F)

white$wine <- "white"

wine <- rbind(red,white)
wine$wine <- as.factor(wine$wine)

Get the summary of the data

summary(wine)

##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0       
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0       
##  Median :0.04700   Median : 29.00      Median :118.0       
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7       
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0       
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50  
##  Median :0.9949   Median :3.210   Median :0.5100   Median :10.30  
##  Mean   :0.9947   Mean   :3.219   Mean   :0.5313   Mean   :10.49  
##  3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30  
##  Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality         wine     
##  Min.   :3.000   red  :1599  
##  1st Qu.:5.000   white:4898  
##  Median :6.000               
##  Mean   :5.818               
##  3rd Qu.:6.000               
##  Max.   :9.000

Our dependent variable is quality. As we will be using it as a categorical variable in 3 of the 4 models, let’s look at its frequency distribution.

table(wine$quality)

## 
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5

Clearly, the categories at the extreme have very few observations. This will lead to problems in correctly categorizing these values. In order to overcome this problem, I will create two new variables. For support vector machines (SVM) and multinomial logistic model (MNL), I will create a new variable labeled quality.c which will be a factor variable with groups 3, 4, 8, and 9 combined in another group. We can label it anything we want as the labeling is meaningless for these methods. I will label this new group 3489 thus preserving the knowledge that this group came from 4 separate groups.

For ordinal regression, we have a little bit more information about the ordering of the groups. I will create a new variable quality.o. In this variable, I will combine 3, 4, and 5 and label it 5 to indicate this is 5 and lower. Similarly, I will combine 7, 8, and 9 and label it 7 to indicate that this group is 7 and above. Thus, we will effectively have only 3 groups.

Clearly, this will make model comparison a little bit tough but we have to give each model the best chance to perform even at this lower level of analysis.

wine <- wine %>%
          mutate(quality.c = quality) %>%
          mutate(quality.o = quality) %>%
          mutate(quality.c = replace(quality.c, quality %in% c(3,4,8,9), 3489)) %>%
          mutate(quality.o = replace(quality.o, quality %in% c(3,4), 5)) %>%
          mutate(quality.o = replace(quality.o, quality %in% c(8,9), 7)) %>%
          mutate(quality.c = factor(quality.c)) %>%
          mutate(quality.o = ordered(quality.o))

str(wine$quality.c)

##  Factor w/ 4 levels "5","6","7","3489": 1 1 1 2 1 1 1 3 3 1 ...

str(wine$quality.o)

##  Ord.factor w/ 3 levels "5"<"6"<"7": 1 1 1 2 1 1 1 3 3 1 ...

Now that we have new dependent variables, let’s take a look at the independent variables. As we are doing a predictive analysis (we have no plan to do inference), let’s understand the distribution and correlations of the variables and explore the need for transformation.

For this we will first get the descriptive statistics and correlations for all the numeric variables.

cormat <- round(cor(as.matrix(wine[,-c(13,14,15)])),2)
cormat[upper.tri(cormat)] <- ""
as.data.frame(cormat)

In the above heat map, the crosses indicate non significant correlations. From the correlations it seems that most variables have their own unique information set. However, it appears that quality is strongly related to only a few variables. This is not great news. At this point one can think of transformations to increase the correlations between the variables. However, I am not going to do it as this will make this exercise broader than I expected.

Let’s get the descriptive staistics

library(moments)

desc <- as.data.frame(cbind(Mean = sapply(wine[,-c(13,14,15)], mean),
                      Median = sapply(wine[,-c(13,14,15)], median),
                      Std.Dev = sapply(wine[,-c(13,14,15)], sd),
                      CV = sapply(wine[,-c(13,14,15)], sd)/ sapply(wine[,-c(13,14,15)], mean),
                      Skewness = sapply(wine[,-c(13,14,15)], skewness),
                      Kurtosis = sapply(wine[,-c(13,14,15)], kurtosis)))

round(desc,2)

##                        Mean Median Std.Dev   CV Skewness Kurtosis
## fixed_acidity          7.22   7.00    1.30 0.18     1.72     8.06
## volatile_acidity       0.34   0.29    0.16 0.48     1.49     5.82
## citric_acid            0.32   0.31    0.15 0.46     0.47     5.39
## residual_sugar         5.44   3.00    4.76 0.87     1.44     7.35
## chlorides              0.06   0.05    0.04 0.63     5.40    53.86
## free_sulfur_dioxide   30.53  29.00   17.75 0.58     1.22    10.90
## total_sulfur_dioxide 115.74 118.00   56.52 0.49     0.00     2.63
## density                0.99   0.99    0.00 0.00     0.50     9.60
## pH                     3.22   3.21    0.16 0.05     0.39     3.37
## sulphates              0.53   0.51    0.15 0.28     1.80    11.65
## alcohol               10.49  10.30    1.19 0.11     0.57     2.47
## quality                5.82   6.00    0.87 0.15     0.19     3.23

The most interesting column for me is the CV (coefficient of variation). This is the ratio of standard deviation to the mean. We have certain observations where CV is very low (e.g., 0 or 0.05). This means that the standard deviation is extremely small compared to the mean. Clearly we need some scaling here to remove the effect of the mean. One way to do that is to mean center all the variables so that we have zero mean all across. It retains the low standard deviation, however. To overcome this issue, we can divide all the variables by their standard deviations. This way we will normalize our data such that all the variables will have mean = 0 and standard deviation = 1.

Will that help with reducing skewness and kurtosis? Think about it for a moment before you read on.

I am going to scale the numeric variables.

wine2 <- wine
wine2[,c(1:12)] <- scale(wine[,c(1:12)])

desc2 <- as.data.frame(cbind(Mean = sapply(wine2[,c(1:12)], mean),
                      Median = sapply(wine2[,c(1:12)], median),
                      Std.Dev = sapply(wine2[,c(1:12)], sd),
                      Skewness = sapply(wine2[,c(1:12)], skewness),
                      Kurtosis = sapply(wine2[,c(1:12)], kurtosis)))

round(desc2,2)

##                      Mean Median Std.Dev Skewness Kurtosis
## fixed_acidity           0  -0.17       1     1.72     8.06
## volatile_acidity        0  -0.30       1     1.49     5.82
## citric_acid             0  -0.06       1     0.47     5.39
## residual_sugar          0  -0.51       1     1.44     7.35
## chlorides               0  -0.26       1     5.40    53.86
## free_sulfur_dioxide     0  -0.09       1     1.22    10.90
## total_sulfur_dioxide    0   0.04       1     0.00     2.63
## density                 0   0.06       1     0.50     9.60
## pH                      0  -0.05       1     0.39     3.37
## sulphates               0  -0.14       1     1.80    11.65
## alcohol                 0  -0.16       1     0.57     2.47
## quality                 0   0.21       1     0.19     3.23

Scaling doesn’t affect skewness or kurtosis. In order to alter these two moments, we need to use non linear transformation such as logarithmic or square root transformations. I’m going to do it through trial and error.

Normal distribution has skewness = 0 and kurtosis = 3. total_sulfur_dioxide, pH, alcohol, and quality seem to have this shape (a good idea is to plot distributions). I am concerned about fixed_acidity, chlorides, free_sulfur_dioxide, density, and sulphates due to high kurtosis (and skewness in some cases). Let’s take their log transform first and then scale these variables.

wine2[,c(1,5,6,8,10)] <- scale(log(wine[,c(1,5,6,8,10)]))

moments::skewness(wine2[,c(1,5,6,8,10)])

##       fixed_acidity           chlorides free_sulfur_dioxide 
##           0.8889319           0.8762698          -0.8340045 
##             density           sulphates 
##           0.4672599           0.4048986

moments::kurtosis(wine2[,c(1,5,6,8,10)])

##       fixed_acidity           chlorides free_sulfur_dioxide 
##            4.896783            5.305355            3.429675 
##             density           sulphates 
##            9.008338            3.701296

Except for density the remaining 4 variables benefited from log transformation. After plotting the distribution for density it appears that this might be because of a couple of extreme values. Although log transformation should have gotten rid of them, it seems it didn’t work out. So I replaced the two extreme values (there were 3 observations) with the next highest observation and then took log. As it turns out, the transformation paid off.

wine2$density <- ifelse(wine$density > 1.00369, 1.00369, wine$density) # replace the values > 1.00369 by 1.00369
wine2$density <- scale(log(wine2$density))

moments::skewness(wine2$density)

## [1] -0.02258351

moments::kurtosis(wine2$density)

## [1] 2.254121

Now we have a dataset wine2 which has all the transformed variables. I am going to use it for the rest of the analysis.

I am using caret package for a few models. It doesn’t matter what packages you used.

Linear model

model.lm <- caret::train(quality ~ ., data = wine2[,-c(14,15)], method = "lm")
summary(model.lm)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8267 -0.5347 -0.0440  0.5187  3.4109 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.317675   0.051352   6.186 6.54e-10 ***
## fixed_acidity         0.123372   0.023010   5.362 8.54e-08 ***
## volatile_acidity     -0.275112   0.015059 -18.269  < 2e-16 ***
## citric_acid          -0.004644   0.012951  -0.359   0.7199    
## residual_sugar        0.304732   0.031516   9.669  < 2e-16 ***
## chlorides            -0.039373   0.015838  -2.486   0.0129 *  
## free_sulfur_dioxide   0.194606   0.016048  12.126  < 2e-16 ***
## total_sulfur_dioxide -0.155376   0.020783  -7.476 8.65e-14 ***
## density              -0.317525   0.050289  -6.314 2.90e-10 ***
## pH                    0.080127   0.016593   4.829 1.40e-06 ***
## sulphates             0.120871   0.012814   9.433  < 2e-16 ***
## alcohol               0.305010   0.025422  11.998  < 2e-16 ***
## winewhite            -0.421383   0.066724  -6.315 2.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8328 on 6484 degrees of freedom
## Multiple R-squared:  0.3077, Adjusted R-squared:  0.3065 
## F-statistic: 240.2 on 12 and 6484 DF,  p-value: < 2.2e-16

I get a decent model with adjusted R² = 0.3065. The model is highly significant with p-value of F-statistic < 2.2e-16. Except for citric_acid all the other variables are statistically significant at conventional levels.

Note that I also included the variable wine in the data set while estimating the model. As it turns out, white wines have on average lower rating than red wines (all else equal).

I would have liked to tweak this model a little bit to understand if there are any interactions present. However, that was not in the scope of the homework so I will leave it for you guys to do it in the future.

Multinomial logistic regression (MNL)

For MNL, we will use quality.c as the dependent variable. Recall that this is a categorical variable with groups 3, 4, 8, and 9 bundled together.

table(wine2$quality.c)

## 
##    5    6    7 3489 
## 2138 2836 1079  444

Right up front I can tell you that this bundling is going to cause some problems. This is because we have assumed no ordering in quality. But in reality there is ordering. We will repeat this analysis with quality.o later.

model.mnl.c <- nnet::multinom(quality.c ~ ., data = wine2[,-c(12,15)], maxit = 100)

## # weights:  56 (39 variable)
## initial  value 9006.754464 
## iter  10 value 7029.581084
## iter  20 value 6959.947641
## iter  30 value 6788.213119
## iter  40 value 6728.847472
## final  value 6698.104301 
## converged

sum.mnl.c <- summary(model.mnl.c)

Get the z-statistics and p-values

z <- sum.mnl.c$coefficients/sum.mnl.c$standard.errors
p <- (1 - pnorm(abs(z), 0, 1))*2

# Coefficients for quality = 6

coeff.mnl.c1 <- rbind(sum.mnl.c$coefficients[1,],sum.mnl.c$standard.errors[1,],z[1,],p[1,])
rownames(coeff.mnl.c1) <- c("Coefficient","Std. Errors","z stat","p value")

# Coefficients for quality = 7

coeff.mnl.c2 <- rbind(sum.mnl.c$coefficients[1,],sum.mnl.c$standard.errors[2,],z[2,],p[2,])
rownames(coeff.mnl.c2) <- c("Coefficient","Std. Errors","z stat","p value")

# Coefficients for quality = 3489

coeff.mnl.c3 <- rbind(sum.mnl.c$coefficients[1,],sum.mnl.c$standard.errors[3,],z[3,],p[3,])
rownames(coeff.mnl.c3) <- c("Coefficient","Std. Errors","z stat","p value")

quality.c = 6

quality.c = 7

quality.c = 3489

It’s a good exercise to compare the coefficients of MNL and linear regressions. However, MNL has 3 equations with different variables showing up significant across the 3 models. What can you do now? One way to handle this problem is to first identify the variables that are significant across the board. I will retain them. Then look at the ones that are non significant in all of the 3 equations. Most likely these variables cana be dropped from the next iteration of MNL. The variables that are significant in a few equations are the most difficult ones. You will have to take a call depending on their overall importance. If you drop them and the performance of the model deteriorates severely, perhaps adding them back is a better choice. You can also be concervative and keep any variable that’s significant at least once in the 3 models.

Performance of the model

MNL model performance can be assessed on several different metrics. I will select classification accuracy as the relevant metric. As such I get the confusion matrix by using the same samples that we used for estimating the model. You could do this using cross-validation.

caret::confusionMatrix(reference = wine2$quality.c,
                       predict(model.mnl.c, data = wine2[,-c(12,15)], type = "class"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    5    6    7 3489
##       5    1274  647   67  159
##       6     848 2017  782  217
##       7       9  161  229   51
##       3489    7   11    1   17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5444          
##                  95% CI : (0.5322, 0.5566)
##     No Information Rate : 0.4365          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2649          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 5 Class: 6 Class: 7 Class: 3489
## Sensitivity            0.5959   0.7112  0.21223    0.038288
## Specificity            0.7997   0.4955  0.95921    0.996861
## Pos Pred Value         0.5934   0.5220  0.50889    0.472222
## Neg Pred Value         0.8014   0.6889  0.85943    0.933911
## Prevalence             0.3291   0.4365  0.16608    0.068339
## Detection Rate         0.1961   0.3105  0.03525    0.002617
## Detection Prevalence   0.3305   0.5947  0.06926    0.005541
## Balanced Accuracy      0.6978   0.6034  0.58572    0.517575

The model accuracy is 54%, which is not too bad. Note that class 6 has a prevalence of around 44%. Thus, you would be right 44% of the times if you labeled all wines as 6. However, the model did a particularly poor job of detecting classes 7 and 3489. But this is the first MNL model you have estimated. You will ideally tweak this model to yield better results.

Support Vector Machines (SVM)

I am going to use radial basis function (RBF) kernel for SVM. RBF takes two hyper parameters and we need to tune them before we can estimate SVM. But it takes a long time to tune. Therefore, in this example I won’t actually tune it because I have already done it previously using e1071 package. I am also writing down the code using caret package in case you are interested.

Use the following code to estimate SVM using e1071 package.

library(e1071)
library(doParallel)
registerDoParallel(cores = 4)
# First pass
set.seed(1234)
svm.tune <- tune.svm(quality.c ~ ., data = wine2[,-c(12,15)],
                     gamma = c(0.05,0.1,0.5,1,2),
                     cost = 10^(0:3))

svm.tune$best.parameters

Using the best parameters from this model (gamma = 0.05, cost = 10) we can do a second pass and tune the hyperparameters more but I won’t do it because of the time it takes to execute this code.

The following code is simply sample code which will not execute.

set.seed(2345)
svm.tune2 <- tune.svm(quality.c ~ ., data = wine2[,-c(12,15)],
                     gamma = seq(svm.tune$best.parameters$gamma - 0.1,
                                 svm.tune$best.parameters$gamma + 0.1,
                                 by = 0.01),
                     cost = seq(svm.tune$best.parameters$cost - 1,
                                svm.tune$best.parameters$cost + 1,
                                by = .1))

And finally, using the best hyperparameters from the second pass we can estimate the model. In the next code I will use the hyperparameters from the first pass.

model.svm.c <- e1071::svm(quality.c ~ ., data = wine2[,-c(12,15)],
                   gamma = 0.05,
                   cost = 10)
summary(model.svm.c)

## 
## Call:
## svm(formula = quality.c ~ ., data = wine2[, -c(12, 15)], gamma = 0.05, 
##     cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.05 
## 
## Number of Support Vectors:  5510
## 
##  ( 1649 2392 1027 442 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  5 6 7 3489

Assessing the model performance

caret::confusionMatrix(reference = wine2$quality.c,
                       predict(model.svm.c, data = wine2[,-c(12,15)], type = "class"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    5    6    7 3489
##       5    1485  509   39  132
##       6     638 2214  672  179
##       7      10  109  368   72
##       3489    5    4    0   61
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6354          
##                  95% CI : (0.6235, 0.6471)
##     No Information Rate : 0.4365          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.418           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 5 Class: 6 Class: 7 Class: 3489
## Sensitivity            0.6946   0.7807  0.34106    0.137387
## Specificity            0.8440   0.5933  0.96475    0.998513
## Pos Pred Value         0.6859   0.5979  0.65832    0.871429
## Neg Pred Value         0.8493   0.7774  0.88026    0.940408
## Prevalence             0.3291   0.4365  0.16608    0.068339
## Detection Rate         0.2286   0.3408  0.05664    0.009389
## Detection Prevalence   0.3332   0.5700  0.08604    0.010774
## Balanced Accuracy      0.7693   0.6870  0.65290    0.567950

Well, it looks like SVM classification on the data is better than MNL! However, as I didn’t do train vs test comparison, I can’t say with confidence whether there is overfitting. We will test this below with caret.

Which variables are important in SVM? I couldn’t figure out a way to do that in e1071 package. So I am using caret package for this. The following code will tune the model but it’s actually not going to run in this tutorial because it will take a lot of time.

ctrl <- caret::trainControl(method = "cv")
 
set.seed(1492)
grid <- expand.grid(sigma = 10^(-1:4),
                    C = 10^(0:4))

model.svm.c <- train(quality.c ~ ., data = wine2[,-c(12,15)],
                    method = "svmRadial",
                    tuneGrid = grid,
                    trControl=ctrl)

I will use sigma = 0.05 and cost = 10 and estimate the model. I will then use varImp in caret to get the variable importance.

ctrl <- caret::trainControl(method = "cv")
grid <- expand.grid(sigma = 0.05, C = 10)
 
set.seed(89753)
model.svm.c <- train(quality.c ~ ., data = wine2[,-c(12,15)],
                     method = "svmRadial", tuneGrid = grid, trControl=ctrl)

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

var.imp <- varImp(model.svm.c, scale = T)

var.imp2 <- var.imp$importance
var.imp2$Chemistry <- rownames(var.imp2)
var.imp2 <- reshape2::melt(var.imp2, id = c("Chemistry"))
ggplot(data = var.imp2, aes(x = Chemistry, y = value)) +
      geom_bar(stat = "identity", aes(fill = Chemistry)) + facet_wrap(~variable) +
      theme_bw() + xlab(NULL) + ylab("Importance Weight") +
      theme(axis.text.x = element_text(angle = 60, hjust = 1),
            legend.position="none")

Similar to MNL, we have each predictor having different effectiveness in predicting different classes. Clearly alcohol content looks like the most important predictor for all the classes.

The next part is purely for satisfying my curiosity. You can ignore it and directly go to the ordinal logistic regression.

Testing for overfitting

I won’t explain this code as you are capable of understanding it.

set.seed(1234)
trind <- caret::createDataPartition(wine2$quality.c, p = 4/5, list = FALSE)
svm.train <- wine2[trind,-c(12,15)]
svm.test <- wine2[-trind,-c(12,15)]

model.svm.c2 <- train(quality.c ~ ., data = svm.train,
                    method = "svmRadial",
                    tuneGrid = grid,
                    trControl=ctrl)

# In-sample accuracy

caret::confusionMatrix(reference = svm.train$quality.c,
                       predict(model.svm.c2, newdata = svm.train))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    5    6    7 3489
##       5    1188  387   25  111
##       6     506 1779  515  133
##       7      11   96  324   59
##       3489    6    7    0   53
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6431          
##                  95% CI : (0.6299, 0.6561)
##     No Information Rate : 0.4363          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4326          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 5 Class: 6 Class: 7 Class: 3489
## Sensitivity            0.6943   0.7840  0.37500     0.14888
## Specificity            0.8501   0.6063  0.96172     0.99732
## Pos Pred Value         0.6943   0.6065  0.66122     0.80303
## Neg Pred Value         0.8501   0.7839  0.88535     0.94098
## Prevalence             0.3290   0.4363  0.16615     0.06846
## Detection Rate         0.2285   0.3421  0.06231     0.01019
## Detection Prevalence   0.3290   0.5640  0.09423     0.01269
## Balanced Accuracy      0.7722   0.6952  0.66836     0.57310

# Out-of sample accuracy

caret::confusionMatrix(reference = svm.test$quality.c,
                       predict(model.svm.c2, newdata = svm.test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   5   6   7 3489
##       5    276 125   9   25
##       6    146 405 149   41
##       7      4  32  55   14
##       3489   1   5   2    8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5736          
##                  95% CI : (0.5462, 0.6007)
##     No Information Rate : 0.4372          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3184          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 5 Class: 6 Class: 7 Class: 3489
## Sensitivity            0.6464   0.7143  0.25581    0.090909
## Specificity            0.8172   0.5397  0.95379    0.993383
## Pos Pred Value         0.6345   0.5466  0.52381    0.500000
## Neg Pred Value         0.8248   0.7086  0.86577    0.937549
## Prevalence             0.3292   0.4372  0.16577    0.067849
## Detection Rate         0.2128   0.3123  0.04241    0.006168
## Detection Prevalence   0.3354   0.5713  0.08096    0.012336
## Balanced Accuracy      0.7318   0.6270  0.60480    0.542146

It looks like we didn’t have much overfitting. The model is not going to perform out of sample as well as it did in sample. But there is no dramatic reduction in the prediction accuracy.

Ordinal Regression

In this last section I will use quality.o for estimating an ordinal model. Ordinal model can be estimated using several link functions. I will use a logit link.

We will use ordinal package and clm function.

library(ordinal)

model.ordinal <- clm(quality.o ~ ., data = wine2[,-c(12,14)])
summary(model.ordinal)

## formula: 
## quality.o ~ fixed_acidity + volatile_acidity + citric_acid + residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + density + pH + sulphates + alcohol + wine
## data:    wine2[, -c(12, 14)]
## 
##  link  threshold nobs logLik   AIC      niter max.grad cond.H 
##  logit flexible  6497 -5581.85 11191.69 5(0)  6.80e-07 3.6e+02
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## fixed_acidity         0.29976    0.05644   5.311 1.09e-07 ***
## volatile_acidity     -0.67167    0.04065 -16.525  < 2e-16 ***
## citric_acid          -0.03372    0.03161  -1.067 0.286005    
## residual_sugar        0.65024    0.07965   8.164 3.24e-16 ***
## chlorides            -0.10306    0.03942  -2.614 0.008945 ** 
## free_sulfur_dioxide   0.40563    0.03990  10.165  < 2e-16 ***
## total_sulfur_dioxide -0.41053    0.05065  -8.106 5.25e-16 ***
## density              -0.62716    0.12610  -4.974 6.57e-07 ***
## pH                    0.21431    0.04062   5.277 1.32e-07 ***
## sulphates             0.31462    0.03122  10.078  < 2e-16 ***
## alcohol               0.81086    0.06387  12.695  < 2e-16 ***
## winewhite            -0.61873    0.16572  -3.734 0.000189 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 5|6  -1.2200     0.1291   -9.45
## 6|7   1.4270     0.1293   11.04

We have a really nice ordinal model here. Similar to the linear regression model, we have all except citric_acid turning up significant at 5% level. The two threshold are also statistically significant indicating that our model identified distinct thresholds to isolate 6 from 5 and 7 from 6. Let’s study the model performance using a confusion matrix.

confusionMatrix(reference = wine2$quality.o,
                unlist(predict(model.ordinal, newdata = wine2, type = "class")))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    5    6    7
##          5 1507  676  100
##          6  854 1858  772
##          7   23  302  405
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5803          
##                  95% CI : (0.5682, 0.5923)
##     No Information Rate : 0.4365          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3174          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 5 Class: 6 Class: 7
## Sensitivity            0.6321   0.6551  0.31715
## Specificity            0.8113   0.5559  0.93774
## Pos Pred Value         0.6601   0.5333  0.55479
## Neg Pred Value         0.7919   0.6754  0.84879
## Prevalence             0.3669   0.4365  0.19655
## Detection Rate         0.2320   0.2860  0.06234
## Detection Prevalence   0.3514   0.5362  0.11236
## Balanced Accuracy      0.7217   0.6055  0.62744

As it turns out at 58% the model accuracy is reasonable but not that great. Ordinal regression is the most appropriate model in this case, however. This is because quality is actually ordinal.