I am using this wine quality data to demostrate the application of some common Machine Learning tools. The dataset is downloaded from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php. The original dataset is published in the paper: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
The description of this dataset can be obtained from the UCI website. I am copying it here:
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
There are two questions in this analysis. The first question is to predict wine quality using the input variables. The second question is to specifically classify wines with excellent qualities, which is defined as wines with quality >= 7.
To solve the first question, I will use the wine quality variable as a continuous variable and build models such as linear regression, random forest, support vector machine to predict wine quality. This part is displayed in Session 1 of this analysis.
To solve the second question, I will transform the quality variable into a binary variable, and build models such as logistic regression, lasso to make classification. This part is displayed in Session 2 of this analysis.
rm(list = ls())
# For aesthetic purposes, I usually put this part in a separate .R file, which can be called through the source() function. I am displaying it here just for the purpose of showing all used code.
packages = c("tidyverse", "RCurl", "psych", "stats",
"randomForest", "glmnet", "caret","kernlab",
"rpart", "rpart.plot", "neuralnet", "C50",
"doParallel", "AUC", "ggfortify")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())))
}
invisible(lapply(packages, require, character.only = TRUE))
# customized function to evaluate model performance for continuous predictors
eval = function(pred, true, plot = F, title = "") {
rmse = sqrt(mean((pred - true)^2))
mae = mean(abs(pred - true))
cor = cor(pred, true)
if (plot == TRUE) {
par(mfrow = c(1,2), oma = c(0, 0, 2, 0))
diff = pred - true
plot(jitter(true, factor = 1),
jitter(pred, factor = 0.5), #jitter so that we can see overlapped dots
pch = 3, asp = 1,
xlab = "Truth", ylab = "Predicted")
abline(0,1, lty = 2)
hist(diff, breaks = 20, main = NULL)
mtext(paste0(title, " predicted vs. true using test set"), outer = TRUE)
par(mfrow = c(1,1))}
return(list(rmse = rmse,
mae = mae,
cor = cor))
}
# customized function to evaluate model performance for binary predictors
eval_class = function(prob, true, plot = F, title = "") {
# find cutoff with the best kappa
cuts = seq(0.01, 0.99, by=0.01)
kappa = c()
for (cut in cuts){
cat = as.factor(ifelse(prob >= cut, 1, 0))
cm = confusionMatrix(cat, true, positive = "1")
kappa = c(kappa, cm$overall[["Kappa"]])
}
opt.cut = cuts[which.max(kappa)]
# make predictions based on best kappa
pred = as.factor(ifelse(prob >= opt.cut, 1, 0))
confM = confusionMatrix(pred, true, positive = "1")
# calculate AUC
roc = roc(as.vector(prob), as.factor(true))
auc = round(AUC::auc(roc),3)
if (plot==T){
# plot area under the curve
par(mfrow = c(1,2), oma = c(0, 0, 2, 0))
plot(roc, main = "AUC curve"); abline(0,1)
text(0.8, 0.2, paste0("AUC = ", auc))
# plot confusion matrix
tab = table(true, pred)
plot(tab,
xlab = "Truth",
ylab = "Predicted",
main = "Confusion Matrix")
text(0.9, 0.9, paste0('FN:', tab[2,1]))
text(0.9, 0.05, paste0('TP:', tab[2,2]))
text(0.1, 0.9, paste0('TN:', tab[1,1]))
text(0.1, 0.05, paste0('FP:', tab[1,2]))
mtext(paste0(title, " predicted vs. true using test set"), outer = TRUE)
par(mfrow = c(1,1))
}
return(list(auc=auc,
confusionMatrix = confM))
}
myfile = getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv')
raw = read.csv(textConnection(myfile), header = T, sep = ";")
n = nrow(raw); p = ncol(raw); dim(raw)
## [1] 4898 12
head(raw) # check the first few lines
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
## 1 7.0 0.27 0.36 20.7 0.045 45
## 2 6.3 0.30 0.34 1.6 0.049 14
## 3 8.1 0.28 0.40 6.9 0.050 30
## 4 7.2 0.23 0.32 8.5 0.058 47
## 5 7.2 0.23 0.32 8.5 0.058 47
## 6 8.1 0.28 0.40 6.9 0.050 30
## total.sulfur.dioxide density pH sulphates alcohol quality
## 1 170 1.0010 3.00 0.45 8.8 6
## 2 132 0.9940 3.30 0.49 9.5 6
## 3 97 0.9951 3.26 0.44 10.1 6
## 4 186 0.9956 3.19 0.40 9.9 6
## 5 186 0.9956 3.19 0.40 9.9 6
## 6 97 0.9951 3.26 0.44 10.1 6
str(raw) # check the general structure
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
summary(raw) # check the summary
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600 Min. :0.00900
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700 1st Qu.:0.03600
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200 Median :0.04300
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391 Mean :0.04577
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900 3rd Qu.:0.05000
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800 Max. :0.34600
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## Min. : 2.00 Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median : 34.00 Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean : 35.31 Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :289.00 Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
It seems that the dataset is very clean, with no missing data and clear structure. All variables are numeric. The range of independent variables varies greatly, so when building the model I will normalize them to be within the same range.
Next step I will check the pairwise relationship of each variable. As we can see from the following figure, there is not a clear linear relationship between the quality variable and other covariants, indicating that a simple linear regression might not work. In addition, there is some collinearity between covariants. These are against the assumption of a linear model.
pairs.panels(raw)
I will split the dataset into a training and testing set, and normalize each set separately.
set.seed(1)
idx = sample(n, 0.9*n)
train = raw[idx,]; dim(train)
## [1] 4408 12
test = raw[-idx,]; dim(test)
## [1] 490 12
# normalize train set so that the range is 0 ~ 1
normalize_train = function(x) (x - min(x))/(max(x) - min(x))
train.norm = data.frame(apply(train[,-p], 2, normalize_train),
quality = train[,p])
summary(train.norm)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.2404 1st Qu.:0.1275 1st Qu.:0.1627 1st Qu.:0.01687 1st Qu.:0.07186
## Median :0.2885 Median :0.1765 Median :0.1867 Median :0.07055 Median :0.09281
## Mean :0.2939 Mean :0.1946 Mean :0.2014 Mean :0.08879 Mean :0.10075
## 3rd Qu.:0.3365 3rd Qu.:0.2353 3rd Qu.:0.2289 3rd Qu.:0.14264 3rd Qu.:0.11377
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.07317 1st Qu.:0.2297 1st Qu.:0.08849 1st Qu.:0.3372 1st Qu.:0.2405
## Median :0.11150 Median :0.2900 Median :0.12763 Median :0.4220 Median :0.3165
## Mean :0.11600 Mean :0.2999 Mean :0.13321 Mean :0.4298 Mean :0.3410
## 3rd Qu.:0.15331 3rd Qu.:0.3689 3rd Qu.:0.17332 3rd Qu.:0.5138 3rd Qu.:0.4177
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## alcohol quality
## Min. :0.0000 Min. :3.000
## 1st Qu.:0.2419 1st Qu.:5.000
## Median :0.3871 Median :6.000
## Mean :0.4068 Mean :5.883
## 3rd Qu.:0.5484 3rd Qu.:6.000
## Max. :1.0000 Max. :9.000
# normalize test set using the values from train set to make prediction comparable
train.min = apply(train[,-p], 2, min)
train.max = apply(train[,-p], 2, max)
test.norm = data.frame(sweep(test, 2, c(train.min, 0)) %>%
sweep(2, c(train.max-train.min, 1), FUN = "/"))
summary(test.norm) # test.norm might have data out of range 0~1, since it's normalized against the training set.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## Min. :0.05769 Min. :0.004902 Min. :0.0000 Min. :0.001534 Min. :-0.008982
## 1st Qu.:0.24038 1st Qu.:0.127451 1st Qu.:0.1627 1st Qu.:0.015337 1st Qu.: 0.071856
## Median :0.28846 Median :0.176471 Median :0.1928 Median :0.069018 Median : 0.089820
## Mean :0.29242 Mean :0.192047 Mean :0.2008 Mean :0.089140 Mean : 0.104369
## 3rd Qu.:0.33654 3rd Qu.:0.235294 3rd Qu.:0.2410 3rd Qu.:0.141104 3rd Qu.: 0.113772
## Max. :0.57692 Max. :0.666667 Max. :0.4458 Max. :0.475460 Max. : 0.682635
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## Min. :0.003484 Min. :0.04408 Min. :0.005591 Min. :0.06422 Min. :0.05063
## 1st Qu.:0.070557 1st Qu.:0.23492 1st Qu.:0.090418 1st Qu.:0.33945 1st Qu.:0.24051
## Median :0.108014 Median :0.29002 Median :0.128976 Median :0.42202 Median :0.32911
## Mean :0.116572 Mean :0.30213 Mean :0.134698 Mean :0.42756 Mean :0.34650
## 3rd Qu.:0.149826 3rd Qu.:0.36137 3rd Qu.:0.173077 3rd Qu.:0.51147 3rd Qu.:0.41772
## Max. :0.449477 Max. :0.70534 Max. :0.447079 Max. :1.00917 Max. :1.08861
## alcohol quality
## Min. :0.0000 Min. :3.000
## 1st Qu.:0.2419 1st Qu.:5.000
## Median :0.3710 Median :6.000
## Mean :0.3939 Mean :5.833
## 3rd Qu.:0.5161 3rd Qu.:6.000
## Max. :0.9516 Max. :8.000
This is not always the best model, but it’s okay to start with, so that we have a basic sense of the relationship between the independent variable and dependent variable.
First, I will check the normality of the dependent variable using the Shapiro-Wilk test.
hist(raw$quality)
shapiro.test(raw$quality) #Didn't pass normality test, so linear model may have a problem
##
## Shapiro-Wilk normality test
##
## data: raw$quality
## W = 0.88904, p-value < 2.2e-16
The dependent variable doesn’t pass the normality test, so one assumption of linear regression is not met. In addition, as we see from the pairwise plot, the relationship among independent variables and dependent variables are not entirely linear. There is also some collinearity among independent variables. Any of those could sabotage the performance of the linear model.
Then I will apply this linear model to the test set, and visualize the predicted value against the true value. I will also evaluate the model performance based on 3 measures: RMSE (root mean square error), MAE (mean absolute error) and cor (correlation). Smaller RMSE, MAE and larger cor are indicators of a good prediction.
tr.lm = lm(quality~., data = train.norm)
summary(tr.lm)
##
## Call:
## lm(formula = quality ~ ., data = train.norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9052 -0.4946 -0.0378 0.4684 3.0895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.522913 0.112623 49.039 < 2e-16 ***
## fixed.acidity 0.815891 0.227717 3.583 0.000343 ***
## volatile.acidity -1.937979 0.122947 -15.763 < 2e-16 ***
## citric.acid -0.007099 0.167319 -0.042 0.966157
## residual.sugar 5.497204 0.515538 10.663 < 2e-16 ***
## chlorides -0.075770 0.197262 -0.384 0.700916
## free.sulfur.dioxide 1.170572 0.259457 4.512 6.60e-06 ***
## total.sulfur.dioxide -0.148746 0.173382 -0.858 0.390990
## density -8.026818 1.032594 -7.773 9.45e-15 ***
## pH 0.762083 0.120353 6.332 2.66e-10 ***
## sulphates 0.536491 0.084862 6.322 2.84e-10 ***
## alcohol 1.194482 0.156881 7.614 3.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.756 on 4396 degrees of freedom
## Multiple R-squared: 0.2858, Adjusted R-squared: 0.284
## F-statistic: 159.9 on 11 and 4396 DF, p-value: < 2.2e-16
tr.lm.pred = predict(tr.lm, test.norm[,-p])
tr.lm.eval = eval(tr.lm.pred, test.norm$quality, plot = T, title = "lm: "); unlist(tr.lm.eval)
## rmse mae cor
## 0.7097731 0.5602135 0.4901282
As we can see above in the model summary, the \(R^2\) of the model is only 0.28, so this model doesn’t explain much of the variance component. The evaluation of comparison between predicted quality and true quality, RMSE is 0.71 and MAE is 0.56, considering that the quality range is from 0 to 10, this level of difference is not too bad. As we can see from the plot above, the model performs worse in predicting extreme cases, i.e., wines with especially high or low qualities.
To remedy this issue, there are a few thoughts:
I could check if there are many outliers or influential points.
For those variables having a non-linear relationship with the y variable, I could try higher order regression, such as quadratic regression. I could also break them into a few intervals and make them categorical variables.
I could test the interaction between each pair of the independent variables, especially those pairs with high correlations from the pairs plot.
I could use non-linear models, such as random forest (rf), support vector machine (SVM), regression trees (rt), neural network (NN).
For wines with especially high or low qualities, they are relatively rare, so a common model would not put enough weight on them. So I could boost their weight in training the model.
I will try some of these ideas in the following part.
par(mfrow=c(2,3))
lapply(1:6, function(x) plot(tr.lm, which=x, labels.id= 1:nrow(train.norm))) %>% invisible()
par(mfrow=c(1,1))
From the plots above, the observations #245, #984 and #2466 appears as outliers in most of the plots. So I will remove them in the following analysis.
rm = c(245, 984, 2466)
removed = train.norm[rm, ];removed # these observations will be removed from the training set
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
## 1932 0.3173077 0.4019608 0.1325301 0.02147239 0.1047904 0.50348432
## 4746 0.2211538 0.1764706 0.1506024 0.03527607 0.1047904 1.00000000
## 2782 0.3846154 0.8676471 0.3614458 1.00000000 0.1856287 0.02090592
## total.sulfur.dioxide density pH sulphates alcohol quality
## 1932 0.6925754 0.1019857 0.4770642 0.1898734 0.4838710 3
## 4746 1.0000000 0.1162522 0.6605505 0.5316456 0.4032258 3
## 2782 0.3503480 1.0000000 0.6146789 0.5949367 0.5967742 6
train.norm = train.norm[-rm, ]
tr.lm.rmoutlier = lm(quality~., data = train.norm)
summary(tr.lm.rmoutlier)
##
## Call:
## lm(formula = quality ~ ., data = train.norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3832 -0.5032 -0.0431 0.4640 3.0939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.723668 0.122950 46.553 < 2e-16 ***
## fixed.acidity 1.367609 0.250063 5.469 4.78e-08 ***
## volatile.acidity -1.911980 0.122176 -15.649 < 2e-16 ***
## citric.acid -0.036915 0.166163 -0.222 0.824200
## residual.sugar 6.836281 0.591479 11.558 < 2e-16 ***
## chlorides -0.031412 0.196154 -0.160 0.872778
## free.sulfur.dioxide 1.482441 0.265093 5.592 2.38e-08 ***
## total.sulfur.dioxide -0.007955 0.174900 -0.045 0.963723
## density -11.583903 1.260367 -9.191 < 2e-16 ***
## pH 0.987459 0.127692 7.733 1.29e-14 ***
## sulphates 0.594696 0.085286 6.973 3.57e-12 ***
## alcohol 0.688053 0.188469 3.651 0.000265 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7503 on 4393 degrees of freedom
## Multiple R-squared: 0.2936, Adjusted R-squared: 0.2918
## F-statistic: 166 on 11 and 4393 DF, p-value: < 2.2e-16
By only removing 3 observations from more than 4000 observations, the \(R^2\) increases from 0.286 to 0.294, so the effects of these outliers on the model are considered significant.
I will fit a quadratic regression model by feeding \(x + x^2\) into the model for each independent variables. As we can see from the model summary, comparing with the linear model the \(R^2\) increases, and RMSE MAE both decreases, but \(R^2 = 0.325\) is still not very satisfying.
# 2nd order regression (quadratic model)
tr.qm = lm(quality~ poly(fixed.acidity, 2) +
poly(volatile.acidity,2) +
poly(citric.acid,2) +
poly(residual.sugar,2) +
poly(chlorides,2) +
poly(free.sulfur.dioxide,2) +
poly(total.sulfur.dioxide,2) +
poly(density,2) +
poly(pH,2) +
poly(sulphates,2) +
poly(alcohol,2),
data = train.norm)
summary(tr.qm)
##
## Call:
## lm(formula = quality ~ poly(fixed.acidity, 2) + poly(volatile.acidity,
## 2) + poly(citric.acid, 2) + poly(residual.sugar, 2) + poly(chlorides,
## 2) + poly(free.sulfur.dioxide, 2) + poly(total.sulfur.dioxide,
## 2) + poly(density, 2) + poly(pH, 2) + poly(sulphates, 2) +
## poly(alcohol, 2), data = train.norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3719 -0.4980 -0.0193 0.4428 3.1643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.88294 0.01108 530.850 < 2e-16 ***
## poly(fixed.acidity, 2)1 8.23372 1.34170 6.137 9.16e-10 ***
## poly(fixed.acidity, 2)2 -3.75624 0.76762 -4.893 1.03e-06 ***
## poly(volatile.acidity, 2)1 -12.01445 0.83110 -14.456 < 2e-16 ***
## poly(volatile.acidity, 2)2 2.64353 0.77623 3.406 0.000666 ***
## poly(citric.acid, 2)1 -0.19872 0.79667 -0.249 0.803034
## poly(citric.acid, 2)2 -2.98193 0.77025 -3.871 0.000110 ***
## poly(residual.sugar, 2)1 31.37846 2.99353 10.482 < 2e-16 ***
## poly(residual.sugar, 2)2 -8.72643 2.12088 -4.115 3.95e-05 ***
## poly(chlorides, 2)1 -0.89190 0.83490 -1.068 0.285457
## poly(chlorides, 2)2 1.60890 0.80462 2.000 0.045607 *
## poly(free.sulfur.dioxide, 2)1 4.30472 1.00592 4.279 1.91e-05 ***
## poly(free.sulfur.dioxide, 2)2 -4.20019 0.88676 -4.737 2.24e-06 ***
## poly(total.sulfur.dioxide, 2)1 -0.64706 1.14915 -0.563 0.573412
## poly(total.sulfur.dioxide, 2)2 -5.26681 0.88050 -5.982 2.39e-09 ***
## poly(density, 2)1 -42.88176 4.65481 -9.212 < 2e-16 ***
## poly(density, 2)2 12.96425 2.25765 5.742 9.97e-09 ***
## poly(pH, 2)1 10.00738 1.17318 8.530 < 2e-16 ***
## poly(pH, 2)2 1.45152 0.75887 1.913 0.055846 .
## poly(sulphates, 2)1 5.75641 0.79662 7.226 5.84e-13 ***
## poly(sulphates, 2)2 -0.48247 0.76401 -0.631 0.527747
## poly(alcohol, 2)1 5.30659 2.52432 2.102 0.035594 *
## poly(alcohol, 2)2 3.40788 0.82761 4.118 3.90e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7358 on 4385 degrees of freedom
## Multiple R-squared: 0.3251, Adjusted R-squared: 0.3218
## F-statistic: 96.03 on 22 and 4385 DF, p-value: < 2.2e-16
tr.qm.pred = predict(tr.qm, test.norm[,-p])
tr.qm.eval = eval(tr.qm.pred, test.norm$quality, plot=T, title="quadratic model: ");unlist(tr.qm.eval)
## rmse mae cor
## 0.6865258 0.5438849 0.5354802
I will categorize each independent variable according to their value quantiles. I will split each variable into 6 categories:
0: below 10%
1: 10% to 25%
2: 25% to median
3: median to 75%
4: 75% to 90%
5: above 90%
I further fit a multi-level ANOVA model using the new categorical variables. From the model summary below, the model performance is very similar to the quadratic model.
# categorize covariates by cutoff in the quantiles of c(0.1, 0.25, 0.5, 0.75, 0.9)
low10pct = apply(train.norm, 2, function(x) quantile(x, 0.1))
q1 = apply(train.norm, 2, function(x) quantile(x, 0.25))
q2 = apply(train.norm, 2, function(x) quantile(x, 0.5))
q3 = apply(train.norm, 2, function(x) quantile(x, 0.75))
top10pct = apply(train.norm, 2, function(x) quantile(x, 0.9))
categorize = function(dataset = train.norm) {
df.cat = dataset
for (i in 1:(p-1)){
col = dataset[,i]
cat = case_when(col<low10pct[i] ~ "0",
col>=low10pct[i] & col<q1[i] ~ "1",
col>=q1[i] & col<q2[i] ~ "2",
col>=q2[i] & col<q3[i] ~ "3",
col>=q3[i] & col<top10pct[i] ~ "4",
col>=top10pct[i] ~ "5")
df.cat[,i] = cat
}
return(df.cat)
}
train.cat = categorize(train.norm)
test.cat = categorize(test.norm)
head(train.cat)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
## 1301 5 2 4 3 3 4
## 1823 5 5 2 2 1 2
## 2805 2 2 3 3 0 1
## 4446 0 4 1 3 1 2
## 988 4 3 2 3 4 3
## 4396 2 2 1 4 4 3
## total.sulfur.dioxide density pH sulphates alcohol quality
## 1301 3 3 0 0 2 6
## 1823 1 1 1 0 5 7
## 2805 0 2 2 4 3 8
## 4446 2 2 5 1 3 6
## 988 3 3 2 2 2 6
## 4396 3 4 2 5 1 5
tr.cat.lm = lm(quality~., data = train.cat)
summary(tr.cat.lm)
##
## Call:
## lm(formula = quality ~ ., data = train.cat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2450 -0.4729 -0.0221 0.4634 3.2177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.297782 0.118161 44.835 < 2e-16 ***
## fixed.acidity1 -0.061804 0.048363 -1.278 0.201346
## fixed.acidity2 0.031633 0.045541 0.695 0.487335
## fixed.acidity3 0.044674 0.047553 0.939 0.347552
## fixed.acidity4 0.118540 0.052870 2.242 0.025005 *
## fixed.acidity5 -0.006463 0.060968 -0.106 0.915582
## volatile.acidity1 -0.094588 0.048974 -1.931 0.053499 .
## volatile.acidity2 -0.284745 0.046121 -6.174 7.27e-10 ***
## volatile.acidity3 -0.443219 0.046554 -9.521 < 2e-16 ***
## volatile.acidity4 -0.451411 0.049706 -9.082 < 2e-16 ***
## volatile.acidity5 -0.539100 0.055774 -9.666 < 2e-16 ***
## citric.acid1 0.121115 0.047380 2.556 0.010615 *
## citric.acid2 0.376397 0.046651 8.068 9.13e-16 ***
## citric.acid3 0.265753 0.046331 5.736 1.04e-08 ***
## citric.acid4 0.198994 0.048989 4.062 4.95e-05 ***
## citric.acid5 0.183801 0.051776 3.550 0.000389 ***
## residual.sugar1 0.130548 0.052593 2.482 0.013094 *
## residual.sugar2 0.327616 0.052249 6.270 3.95e-10 ***
## residual.sugar3 0.528849 0.060597 8.727 < 2e-16 ***
## residual.sugar4 0.781175 0.076252 10.245 < 2e-16 ***
## residual.sugar5 0.910234 0.091810 9.914 < 2e-16 ***
## chlorides1 0.005692 0.046706 0.122 0.903002
## chlorides2 -0.091877 0.045153 -2.035 0.041931 *
## chlorides3 -0.130462 0.047873 -2.725 0.006453 **
## chlorides4 -0.192611 0.051705 -3.725 0.000198 ***
## chlorides5 -0.214358 0.056921 -3.766 0.000168 ***
## free.sulfur.dioxide1 0.288731 0.048612 5.939 3.08e-09 ***
## free.sulfur.dioxide2 0.445220 0.046431 9.589 < 2e-16 ***
## free.sulfur.dioxide3 0.481210 0.049009 9.819 < 2e-16 ***
## free.sulfur.dioxide4 0.468402 0.055183 8.488 < 2e-16 ***
## free.sulfur.dioxide5 0.417932 0.060993 6.852 8.30e-12 ***
## total.sulfur.dioxide1 0.147315 0.048440 3.041 0.002370 **
## total.sulfur.dioxide2 0.170678 0.047244 3.613 0.000306 ***
## total.sulfur.dioxide3 0.144391 0.051214 2.819 0.004834 **
## total.sulfur.dioxide4 0.062783 0.058621 1.071 0.284231
## total.sulfur.dioxide5 0.012124 0.065750 0.184 0.853716
## density1 -0.115054 0.053491 -2.151 0.031539 *
## density2 -0.264043 0.064328 -4.105 4.12e-05 ***
## density3 -0.609737 0.083033 -7.343 2.47e-13 ***
## density4 -0.895847 0.106250 -8.431 < 2e-16 ***
## density5 -0.934590 0.127772 -7.315 3.06e-13 ***
## pH1 -0.006227 0.047061 -0.132 0.894738
## pH2 0.031411 0.044958 0.699 0.484786
## pH3 0.038228 0.046640 0.820 0.412465
## pH4 0.186123 0.051526 3.612 0.000307 ***
## pH5 0.242725 0.058466 4.152 3.37e-05 ***
## sulphates1 0.059732 0.048340 1.236 0.216645
## sulphates2 0.079923 0.047004 1.700 0.089137 .
## sulphates3 0.126369 0.046352 2.726 0.006431 **
## sulphates4 0.191432 0.050493 3.791 0.000152 ***
## sulphates5 0.212021 0.054382 3.899 9.81e-05 ***
## alcohol1 -0.079556 0.053299 -1.493 0.135611
## alcohol2 -0.070079 0.056975 -1.230 0.218768
## alcohol3 0.110158 0.063874 1.725 0.084670 .
## alcohol4 0.276248 0.075264 3.670 0.000245 ***
## alcohol5 0.568417 0.088273 6.439 1.33e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7317 on 4352 degrees of freedom
## Multiple R-squared: 0.3376, Adjusted R-squared: 0.3293
## F-statistic: 40.33 on 55 and 4352 DF, p-value: < 2.2e-16
tr.cat.lm.pred = predict(tr.cat.lm, test.cat[,-p])
tr.cat.lm.eval = eval(tr.cat.lm.pred, test.cat$quality, plot=T, title="ANOVA: ");unlist(tr.cat.lm.eval)
## rmse mae cor
## 0.6808479 0.5421466 0.5484151
I will further examine the pairwise interactions between independent variables. Considering the relatively large training size and relatively small number of covariants, I can examine their interaction all at once. After feeding all interaction pairs into the model, I will perform variable selection using the stepwise method.
tr.lm.interract = lm(quality~ .^2, data = train.norm)
summary(tr.lm.interract)
##
## Call:
## lm(formula = quality ~ .^2, data = train.norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4333 -0.4898 -0.0191 0.4357 3.0801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.07864 0.71660 8.483 < 2e-16 ***
## fixed.acidity 5.24962 1.79061 2.932 0.003388 **
## volatile.acidity -6.65602 1.06780 -6.233 5.00e-10 ***
## citric.acid -0.57329 1.75719 -0.326 0.744246
## residual.sugar 16.79688 3.33288 5.040 4.85e-07 ***
## chlorides -0.21317 2.42545 -0.088 0.929968
## free.sulfur.dioxide -7.85107 2.73798 -2.867 0.004158 **
## total.sulfur.dioxide 3.28613 1.59292 2.063 0.039176 *
## density -21.93411 6.06909 -3.614 0.000305 ***
## pH 1.77909 0.97573 1.823 0.068321 .
## sulphates 1.70433 0.89972 1.894 0.058253 .
## alcohol -1.86535 1.00300 -1.860 0.062986 .
## fixed.acidity:volatile.acidity -5.46377 2.25859 -2.419 0.015599 *
## fixed.acidity:citric.acid -5.06638 3.38202 -1.498 0.134197
## fixed.acidity:residual.sugar 8.69740 4.43860 1.959 0.050119 .
## fixed.acidity:chlorides -9.31087 5.09144 -1.829 0.067508 .
## fixed.acidity:free.sulfur.dioxide 7.66103 5.58028 1.373 0.169862
## fixed.acidity:total.sulfur.dioxide -4.12905 3.28640 -1.256 0.209037
## fixed.acidity:density -15.14948 8.63142 -1.755 0.079303 .
## fixed.acidity:pH 3.21056 1.07112 2.997 0.002738 **
## fixed.acidity:sulphates 1.22283 1.58411 0.772 0.440197
## fixed.acidity:alcohol -1.93108 1.80006 -1.073 0.283426
## volatile.acidity:citric.acid 2.03214 1.48243 1.371 0.170504
## volatile.acidity:residual.sugar -10.94961 4.26643 -2.566 0.010308 *
## volatile.acidity:chlorides 1.22016 2.19746 0.555 0.578747
## volatile.acidity:free.sulfur.dioxide 3.11721 2.71744 1.147 0.251399
## volatile.acidity:total.sulfur.dioxide 0.45197 1.63279 0.277 0.781943
## volatile.acidity:density 26.19721 8.34878 3.138 0.001713 **
## volatile.acidity:pH 0.25759 1.26190 0.204 0.838262
## volatile.acidity:sulphates -0.87199 0.87996 -0.991 0.321773
## volatile.acidity:alcohol 7.81496 1.28031 6.104 1.12e-09 ***
## citric.acid:residual.sugar -8.26222 7.92146 -1.043 0.296998
## citric.acid:chlorides -0.83501 2.27321 -0.367 0.713393
## citric.acid:free.sulfur.dioxide 1.73601 3.58565 0.484 0.628299
## citric.acid:total.sulfur.dioxide -1.89037 2.17663 -0.868 0.385176
## citric.acid:density 11.27982 17.26019 0.654 0.513458
## citric.acid:pH 2.47834 1.82024 1.362 0.173411
## citric.acid:sulphates -1.10665 1.31364 -0.842 0.399594
## citric.acid:alcohol 1.37612 2.59093 0.531 0.595355
## residual.sugar:chlorides -39.06460 11.32822 -3.448 0.000569 ***
## residual.sugar:free.sulfur.dioxide -20.71341 12.74082 -1.626 0.104075
## residual.sugar:total.sulfur.dioxide -7.35648 7.45304 -0.987 0.323676
## residual.sugar:density -4.77874 3.38649 -1.411 0.158281
## residual.sugar:pH -0.87727 2.65244 -0.331 0.740856
## residual.sugar:sulphates -3.44287 3.70841 -0.928 0.353254
## residual.sugar:alcohol 0.60194 1.99708 0.301 0.763116
## chlorides:free.sulfur.dioxide 6.99704 4.18534 1.672 0.094636 .
## chlorides:total.sulfur.dioxide -6.18974 3.64383 -1.699 0.089449 .
## chlorides:density 66.66484 23.97168 2.781 0.005443 **
## chlorides:pH -8.18641 2.40929 -3.398 0.000685 ***
## chlorides:sulphates -4.01208 1.78780 -2.244 0.024873 *
## chlorides:alcohol 4.64300 3.67653 1.263 0.206702
## free.sulfur.dioxide:total.sulfur.dioxide -11.04844 1.20751 -9.150 < 2e-16 ***
## free.sulfur.dioxide:density 30.81759 26.03299 1.184 0.236561
## free.sulfur.dioxide:pH 1.29952 2.76443 0.470 0.638316
## free.sulfur.dioxide:sulphates 6.45906 1.77946 3.630 0.000287 ***
## free.sulfur.dioxide:alcohol 10.40772 4.13162 2.519 0.011803 *
## total.sulfur.dioxide:density 20.92010 14.57759 1.435 0.151335
## total.sulfur.dioxide:pH -2.69600 1.89144 -1.425 0.154123
## total.sulfur.dioxide:sulphates -5.70039 1.32263 -4.310 1.67e-05 ***
## total.sulfur.dioxide:alcohol 2.24932 2.28701 0.984 0.325408
## density:pH -8.59629 5.63886 -1.524 0.127464
## density:sulphates 1.78314 7.68131 0.232 0.816440
## density:alcohol -6.33866 2.65750 -2.385 0.017113 *
## pH:sulphates 1.10848 0.84065 1.319 0.187374
## pH:alcohol -0.03463 1.05015 -0.033 0.973694
## sulphates:alcohol -0.65384 1.14231 -0.572 0.567088
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7192 on 4339 degrees of freedom
## Multiple R-squared: 0.3612, Adjusted R-squared: 0.3515
## F-statistic: 37.17 on 66 and 4339 DF, p-value: < 2.2e-16
# variable selection using stepwise methods
lm0 = lm(quality ~ 1, data = train.norm)
tr.lm.interract.step = step(lm0, ~ (fixed.acidity + volatile.acidity +
citric.acid + residual.sugar + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + density + pH + sulphates + alcohol)^2,
direction = "both", trace = 0)
summary(tr.lm.interract.step)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## free.sulfur.dioxide + density + sulphates + pH + fixed.acidity +
## alcohol:volatile.acidity + alcohol:free.sulfur.dioxide +
## alcohol:pH + free.sulfur.dioxide:pH + pH:fixed.acidity +
## volatile.acidity:density + alcohol:density + free.sulfur.dioxide:fixed.acidity +
## residual.sugar:sulphates + residual.sugar:free.sulfur.dioxide +
## sulphates:pH + density:pH + volatile.acidity:sulphates +
## alcohol:residual.sugar + alcohol:fixed.acidity + volatile.acidity:residual.sugar +
## free.sulfur.dioxide:density + volatile.acidity:fixed.acidity,
## data = train.norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5507 -0.4919 -0.0159 0.4463 3.0600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.60805 0.42637 17.844 < 2e-16 ***
## alcohol -2.44638 0.64584 -3.788 0.000154 ***
## volatile.acidity -5.82928 0.63713 -9.149 < 2e-16 ***
## residual.sugar 9.57116 1.63763 5.845 5.45e-09 ***
## free.sulfur.dioxide -6.07640 2.07232 -2.932 0.003383 **
## density -11.82430 3.20925 -3.684 0.000232 ***
## sulphates 0.57740 0.31707 1.821 0.068663 .
## pH 0.01711 0.66194 0.026 0.979382
## fixed.acidity -1.26703 0.77222 -1.641 0.100918
## alcohol:volatile.acidity 7.14474 0.78436 9.109 < 2e-16 ***
## alcohol:free.sulfur.dioxide 13.14449 3.06526 4.288 1.84e-05 ***
## alcohol:pH 1.58120 0.77714 2.035 0.041946 *
## free.sulfur.dioxide:pH -5.72802 2.09939 -2.728 0.006389 **
## pH:fixed.acidity 4.91345 0.95413 5.150 2.72e-07 ***
## volatile.acidity:density 20.74528 4.61036 4.500 6.98e-06 ***
## alcohol:density -11.46318 2.29834 -4.988 6.35e-07 ***
## free.sulfur.dioxide:fixed.acidity 5.91850 4.02891 1.469 0.141903
## residual.sugar:sulphates -3.14400 1.10898 -2.835 0.004603 **
## residual.sugar:free.sulfur.dioxide -24.17534 9.30787 -2.597 0.009427 **
## sulphates:pH 1.39014 0.54367 2.557 0.010593 *
## density:pH -6.56424 2.56322 -2.561 0.010472 *
## volatile.acidity:sulphates -1.87283 0.71996 -2.601 0.009319 **
## alcohol:residual.sugar 4.54911 1.67718 2.712 0.006707 **
## alcohol:fixed.acidity 1.66326 0.96210 1.729 0.083920 .
## volatile.acidity:residual.sugar -6.97992 3.12919 -2.231 0.025759 *
## free.sulfur.dioxide:density 37.00826 19.08737 1.939 0.052579 .
## volatile.acidity:fixed.acidity -2.14583 1.39820 -1.535 0.124927
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7325 on 4379 degrees of freedom
## Multiple R-squared: 0.3311, Adjusted R-squared: 0.3272
## F-statistic: 83.38 on 26 and 4379 DF, p-value: < 2.2e-16
tr.lm.interract.step.pred = predict(tr.lm.interract.step, test.norm[,-p])
tr.lm.interract.step.eval = eval(tr.lm.interract.step.pred, test.norm$quality, plot=T, title="lm wiht interaction and var selection: ");unlist(tr.lm.interract.step.eval)
## rmse mae cor
## 0.7073700 0.5441501 0.5004769
I will build a random forest (rf) model using the randomForest
function from the randomForest
package. This model will ensemble 1000 decision trees, each tree with sqrt(p) = 3 variables selected.
tr.rf = randomForest(quality~., data = train.norm, ntree = 1000, mtry = sqrt(p))
tr.rf
##
## Call:
## randomForest(formula = quality ~ ., data = train.norm, ntree = 1000, mtry = sqrt(p))
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.3657041
## % Var explained: 54.13
tr.rf.pred = predict(tr.rf, test.norm[,-p])
tr.rf.eval = eval(tr.rf.pred, test.norm$quality, plot = T, title = "Random Forest: "); unlist(tr.rf.eval)
## rmse mae cor
## 0.5144320 0.3727410 0.7772215
I will use the train
function in the caret
package to automatically optimize model hyperparameters. Here I am using a 10-fold cross-validation (cv) with 2 repeats, including 2, 4 or 6 variables respectively at each tree level. I am using RMSE to compare model performance. Due to the limitation of computation power, I only choose a few combinations. The available methods in the train
function can be obtained by typing names(getModelInfo())
.
Since this step will be computationally expensive, I will use parallel computation from the doParallel
package.
ct = trainControl(method = "repeatedcv", number = 10, repeats = 2)
grid_rf = expand.grid(.mtry = c(2, 3, 6))
set.seed(1)
cl = makePSOCKcluster(4)
registerDoParallel(cl)
tr.cvrf = train(quality~., data = train.norm,
method = 'rf',
metric = "RMSE",
trControl = ct,
tuneGrid = grid_rf)
stopCluster(cl)
save(tr.cvrf, file = "~/Downloads/wine_train_cvrf.RData")
load(file = "~/Downloads/wine_train_cvrf.RData"); tr.cvrf
## Random Forest
##
## 4405 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times)
## Summary of sample sizes: 3965, 3964, 3965, 3964, 3964, 3965, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.6104983 0.5430114 0.4448273
## 3 0.6093373 0.5416098 0.4417419
## 6 0.6096012 0.5383133 0.4403383
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
tr.cvrf.pred = predict(tr.cvrf, test.norm[,-p])
tr.cvrf.eval = eval(tr.cvrf.pred, test.norm$quality, plot = T, title = "Random Forest with CV: "); unlist(tr.cvrf.eval)
## rmse mae cor
## 0.5144424 0.3724882 0.7772617
A comparison of linear regression, quadratic regression, ANOVA and the random forest is shown in the table below. We can tell that the random forest model performs much better than other models:
knitr::kable(cbind(lm = unlist(tr.lm.eval),
quadratic = unlist(tr.qm.eval),
ANOVA = unlist(tr.cat.lm.eval),
rf = unlist(tr.rf.eval),
rf.cv = unlist(tr.cvrf.eval)) %>% round(3),
title = "Comparing linear regression, quatratic regression, ANOVA and random forest")
lm | quadratic | ANOVA | rf | rf.cv | |
---|---|---|---|---|---|
rmse | 0.71 | 0.687 | 0.681 | 0.513 | 0.516 |
mae | 0.56 | 0.544 | 0.542 | 0.372 | 0.375 |
cor | 0.49 | 0.535 | 0.548 | 0.778 | 0.775 |
Next I will apply the Support Vector Machine (svm) model using the ksvm
function from the kernlab
package. I will use the default rbfdot
Radial Basis kernel (Gaussian kernel). Other available kernels options can be obtained by typing the ?ksvm
command.
tr.svm = ksvm(quality ~ .,
data = train.norm,
scaled = F,
kernel = "rbfdot",
C = 1)
tr.svm.pred = predict(tr.svm, test.norm[,-p])
tr.svm.eval = eval(tr.svm.pred, test.norm$quality, plot = T, title = "SVM: "); unlist(tr.svm.eval)
## rmse mae cor
## 0.6379821 0.4767281 0.6185084
Similarly, I will use the train
function to optimize model hyperparameters. This step will be computationally expensive.
cl = makePSOCKcluster(4)
registerDoParallel(cl)
tr = trainControl(method = "repeatedcv", number = 10, repeats = 2)
set.seed(1)
tr.svmRadial = train(quality ~.,
data = train.norm,
method = "svmRadial",
trControl=tr,
preProcess = NULL,
tuneLength = 10)
stopCluster(cl)
save(tr.svmRadial, file="~/Downloads/wine_train_cv_svmRadial.RData")
load(file = "~/Downloads/wine_train_cv_svmRadial.RData"); tr.svmRadial
## Support Vector Machines with Radial Basis Function Kernel
##
## 4405 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times)
## Summary of sample sizes: 3965, 3964, 3965, 3964, 3964, 3965, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7075534 0.3734247 0.5371903
## 0.50 0.7010562 0.3846698 0.5312537
## 1.00 0.6959988 0.3934084 0.5257879
## 2.00 0.6934331 0.3984556 0.5218951
## 4.00 0.6922973 0.4019772 0.5185848
## 8.00 0.6928832 0.4038856 0.5163878
## 16.00 0.7022211 0.3958122 0.5202775
## 32.00 0.7162806 0.3852693 0.5253449
## 64.00 0.7363789 0.3711688 0.5317107
## 128.00 0.7661621 0.3526886 0.5423699
##
## Tuning parameter 'sigma' was held constant at a value of 0.08049664
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.08049664 and C = 4.
tr.svmRadial.pred = predict(tr.svmRadial, test.norm[,-p])
tr.svmRadial.eval = eval(tr.svmRadial.pred, test.norm$quality, plot = T, title = "SVM with CV: "); unlist(tr.svmRadial.eval)
## rmse mae cor
## 0.6285840 0.4672283 0.6370404
It seems that svm model with Gaussian kernel doesn’t perform as well as the random forest model.
knitr::kable(cbind(lm = unlist(tr.lm.eval),
quadratic = unlist(tr.qm.eval),
ANOVA = unlist(tr.cat.lm.eval),
rf = unlist(tr.rf.eval),
rf.cv = unlist(tr.cvrf.eval),
svm = unlist(tr.svm.eval),
svm.cv = unlist(tr.svmRadial.eval)) %>% round(3),
caption = "")
lm | quadratic | ANOVA | rf | rf.cv | svm | svm.cv | |
---|---|---|---|---|---|---|---|
rmse | 0.71 | 0.687 | 0.681 | 0.513 | 0.516 | 0.638 | 0.628 |
mae | 0.56 | 0.544 | 0.542 | 0.372 | 0.375 | 0.477 | 0.467 |
cor | 0.49 | 0.535 | 0.548 | 0.778 | 0.775 | 0.619 | 0.638 |
Next I will try running a regression using the rpart
function from the rpart
package.
tr.rpart = rpart(quality~., data=train.norm)
rpart.plot(tr.rpart)
tr.rpart.pred = predict(tr.rpart, test.norm[,-p])
tr.rpart.eval = eval(tr.rpart.pred, test.norm$quality, plot=T, title = "Regression Tree:"); unlist(tr.rpart.eval)
## rmse mae cor
## 0.7022890 0.5609858 0.5052963
I will train a simple neural network with 1 hidden layer and with sigmoid activation function. I am using the neuralnet
function from the neuralnet
package.
tr.nn = neuralnet(quality ~ fixed.acidity + volatile.acidity + citric.acid +
residual.sugar + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + density + pH + sulphates + alcohol,
data=train.norm,
hidden = 2,
act.fct = "logistic")
save(tr.nn, file="~/Downloads/wine_train_nn.RData")
load(file="~/Downloads/wine_train_nn.RData")
plot(tr.nn) # for some reason, the plot doesn't show up here, so I am plotting it separately below.
NeuralNet Plot
tr.nn.pred = compute(tr.nn, test.norm[,-p])
tr.nn.pred = tr.nn.pred$net.result
tr.nn.eval = eval(tr.nn.pred, test.norm$quality, plot=T, title = "Neural Net: "); unlist(tr.nn.eval)
## rmse mae cor
## 0.6722937 0.5306325 0.5618459
keras
I will use the keras
package with tensorflow
embedded to build a neural network model with 4 hidden layers having 64, 32, 16 and 1 neurons each. I will use the relu
activation function and minimize the mean square error (mse
) loss. Similar to the SVM model, this model needs substantial efforts to tune model hyper-parameters. As we can see in the model performance below, the hyper-parameters I am using here don’t give great fitting. I will try other hyper-parameters in the future to improve model performance.
# see more instructions under: https://keras.rstudio.com
# devtools::install_github("rstudio/keras")
library(keras); #install_keras()
cl = makePSOCKcluster(4)
registerDoParallel(cl)
build_model = function() {
model = keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu",
input_shape = dim(train.norm[,-p])[2]) %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1)
model %>% compile(
loss = "mse",
optimizer = optimizer_rmsprop(),
metrics = list("mean_absolute_error")
)
model
}
model = build_model()
model %>% summary() # show the network structure
## ____________________________________________________________________________________________________
## Layer (type) Output Shape Param #
## ====================================================================================================
## dense_1 (Dense) (None, 64) 768
## ____________________________________________________________________________________________________
## dense_2 (Dense) (None, 32) 2080
## ____________________________________________________________________________________________________
## dense_3 (Dense) (None, 16) 528
## ____________________________________________________________________________________________________
## dense_4 (Dense) (None, 1) 17
## ====================================================================================================
## Total params: 3,393
## Trainable params: 3,393
## Non-trainable params: 0
## ____________________________________________________________________________________________________
#layer1 number of parameter = 11 * 64 + 64 = 768
#layer2 number of parameter = 64 * 32 + 32 = 2080
#layer3 number of parameter = 32 * 16 + 16 = 528
#layer4 number of parameter = 16 * 1 + 1 = 17
# Display training progress by printing a single dot for each completed epoch.
print_dot_callback = callback_lambda(
on_epoch_end = function(epoch, logs) {
if (epoch %% 100 == 0) cat("\n")
cat(".")
}
)
epochs = 500
# Fit the model and store training stats
history = model %>% fit(
train.norm[,-p] %>% as.matrix(),
train.norm[,p],
epochs = epochs,
validation_split = 0.2,
verbose = 0,
callbacks = list(print_dot_callback)
)
##
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
# plot the training history
plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
coord_cartesian(ylim = c(0, 1))
#As we can see from the above model training history, although training MAE keeps decreasing, the MAE of the validation data stops decreasing before epoch 200. So in the next step, I will detect early stop, which remedies overfitting problems.
early_stop = callback_early_stopping(monitor = "val_loss", patience = 50)
model = build_model()
history = model %>% fit(
train.norm[,-p] %>% as.matrix(),
train.norm[,p],
epochs = epochs,
validation_split = 0.2,
verbose = 0,
callbacks = list(early_stop, print_dot_callback)
)
##
## ....................................................................................................
## .................................................................................
# plot the training history
plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
coord_cartesian(xlim = c(0, 300), ylim = c(0, 1))
stopCluster(cl)
I will then make predictions and evaluate model performance using the eval() function.
# evaluate model performance using test set
tr.nnkeras.pred = model %>% predict(test.norm[, -p] %>% as.matrix())
tr.nnkeras.pred = tr.nnkeras.pred[ , 1]
tr.nnkeras.eval = eval(tr.nnkeras.pred, test.norm$quality, plot = T, title = "Neural Net using keras: "); unlist(tr.nnkeras.eval)
## rmse mae cor
## 0.6693359 0.5229692 0.5889432
Considering that the y variable type is integer
with only less than 10 choices, I will try to treat it as a categorical variable, and fit a random forest model. For simplicity, I will not normalize the datasets.
df = raw %>%
mutate(quality = as.factor(quality))
set.seed(1) #split into train & test set
idx = createDataPartition(df$quality, p = 0.9, list=F)
train = df[idx,]; table(train$quality)
##
## 3 4 5 6 7 8 9
## 18 147 1312 1979 792 158 5
test = df[-idx,]; table(test$quality)
##
## 3 4 5 6 7 8 9
## 2 16 145 219 88 17 0
# Visualizing relationship of the Y variable (quality) and other covariants
long = gather(train, key = "features", value = "value", 1:(p-1))
ggplot(long, aes(x=quality, y= value)) +
geom_boxplot(aes(fill = quality)) +
coord_flip() +
facet_wrap(~ features, ncol = 3, scales = "free") +
labs(title = "Visualizing relationship of the Y variable (quality) and other covariants")
# fit a random forest model
tr.rf.fac= randomForest(quality~., data=train, ntree=1000, mtry=sqrt(p))
tr.rf.fac
##
## Call:
## randomForest(formula = quality ~ ., data = train, ntree = 1000, mtry = sqrt(p))
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 29.43%
## Confusion matrix:
## 3 4 5 6 7 8 9 class.error
## 3 0 0 7 11 0 0 0 1.0000000
## 4 0 33 75 38 1 0 0 0.7755102
## 5 0 9 930 366 7 0 0 0.2911585
## 6 0 3 234 1636 102 4 0 0.1733199
## 7 0 0 15 328 443 6 0 0.4406566
## 8 0 0 1 53 33 71 0 0.5506329
## 9 0 0 0 3 2 0 0 1.0000000
# evaluate model performance using the test set
tr.rf.fac.prob = predict(tr.rf.fac, test[,-p])
confM = confusionMatrix(tr.rf.fac.prob, test[,p])
# Visualize the confusion matrix table
long = confM$table %>% tbl_df()
ggplot(long, aes(Reference, Prediction)) +
geom_tile(aes(fill = log1p(n)), color="black") +
geom_text(aes(label = n)) +
scale_fill_gradient(low = "white", high = "steelblue") +
guides(fill = FALSE) +
labs(title = "Confusion Matrix")
From the confusion table above, although the overall accuracy is only around 70%, most of the misclassified observations are in a fairly close range with the true quality level. Again, the model doesn’t do well when predicting extreme cases: for wines with quality 8, only 6 out of (5 + 6 + 6) = 17 observations are classified correctly; for the 2 wines with quality 3, both of them are classified as quality 6. There are no wines with quality 9 in the test set, due to the rarity of this wine kind, but I could have assigned a few observations with quality 9 in the test dataset to check the performance of the model.
Putting all results together in the following table, it seems that the random forest model has the best performance. The best MAE is 0.37 and the best RMSE is 0.51. Considering that wine quality score ranges from 0 to 10, this performance is not acceptable. However, none of these models performs well for those extreme cases, i.e., wines with especially high or low quality. To be able to specifically classify wines with superior quality, we will train classification models in the next session.
knitr::kable(cbind(lm = unlist(tr.lm.eval),
quadratic = unlist(tr.qm.eval),
ANOVA = unlist(tr.cat.lm.eval),
lm.interac = unlist(tr.lm.interract.step.eval),
rf = unlist(tr.rf.eval),
rf.cv = unlist(tr.cvrf.eval),
svm = unlist(tr.svm.eval),
svm.cv = unlist(tr.svmRadial.eval),
regression.tree = unlist(tr.rpart.eval),
neural.net = unlist(tr.nn.eval),
neural.net.keras = unlist(tr.nnkeras.eval)) ,
caption = "Comparing all models from Session 1: ")
lm | quadratic | ANOVA | lm.interac | rf | rf.cv | svm | svm.cv | regression.tree | neural.net | neural.net.keras | |
---|---|---|---|---|---|---|---|---|---|---|---|
rmse | 0.7097731 | 0.6865258 | 0.6808479 | 0.7073700 | 0.5144320 | 0.5144424 | 0.6379821 | 0.6285840 | 0.7022890 | 0.6722937 | 0.6693359 |
mae | 0.5602135 | 0.5438849 | 0.5421466 | 0.5441501 | 0.3727410 | 0.3724882 | 0.4767281 | 0.4672283 | 0.5609858 | 0.5306325 | 0.5229692 |
cor | 0.4901282 | 0.5354802 | 0.5484151 | 0.5004769 | 0.7772215 | 0.7772617 | 0.6185084 | 0.6370404 | 0.5052963 | 0.5618459 | 0.5889432 |
Suppose that my clients have a special requirement to classify wines with quality >=7, defined as excellent wines. In this situation, I will build a classification model to identify excellent wines.
I will make a new binary y variable called “excellent”. Since this label is not balanced, i.e., only about 1/5 of the wines are excellent, I will use the createDataPartition
function to partition the data while balancing the label proportion in the training and testing set. For simplicity, I will not normalize the datasets in this session.
df = raw %>%
mutate(excellent = ifelse(quality >= 7, "1", "0") %>% as.factor()) %>%
select(-quality); dim(df)
## [1] 4898 12
table(df$excellent)
##
## 0 1
## 3838 1060
set.seed(1)
idx = createDataPartition(df$excellent, p = 0.9, list=F)
train.x = df[idx, -p] %>% as.matrix(); dim(train.x)
## [1] 4409 11
train.y = df[idx, p]; table(train.y)
## train.y
## 0 1
## 3455 954
train = data.frame(train.x, excellent=train.y)
test.x = df[-idx, -p] %>% as.matrix(); dim(test.x)
## [1] 489 11
test.y = df[-idx, p]; table(test.y)
## test.y
## 0 1
## 383 106
test = data.frame(test.x, excellent=test.y)
The following plot shows the relationship of covariants and the binary Y variables. As we can see from the plots, excellent wines appear to have a smaller variation and fewer outliers for their covariants.
long = gather(train, key = "features", value = "value", 1:(p-1))
ggplot(long, aes(x=excellent, y= value)) +
geom_boxplot(aes(fill = excellent)) +
coord_flip() +
facet_wrap(~ features, ncol = 3, scales = "free") +
labs(title = "Visualizing relationship of the Y variable (excellent) and other covariants",
subtitle = "Note: Wine quality >= 7 is defined to be excellent wine")
Then I will perform a basic principal component analysis (pca) to visualize the separability of the y variables on the first 2 PCs. From the result below, although 1st PC account for 92.6% of the variance, and the first 2 PCs together account for > 99% of the variance, the y variables are not separable by them. This indicates that a model that uses a linear combination of the independent variables will probably not work well.
pca = prcomp(test[,-p]) # use test set here for easy visualization (less data point overlap)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## Standard deviation 44.8620 11.86605 4.38839 1.05198 0.76440 0.13914 0.11456 0.09686 0.08588
## Proportion of Variance 0.9256 0.06475 0.00886 0.00051 0.00027 0.00001 0.00001 0.00000 0.00000
## Cumulative Proportion 0.9256 0.99034 0.99920 0.99971 0.99998 0.99999 0.99999 1.00000 1.00000
## PC10 PC11
## Standard deviation 0.01824 0.0005052
## Proportion of Variance 0.00000 0.0000000
## Cumulative Proportion 1.00000 1.0000000
autoplot(pca, data = test, colour = 'excellent')
I will start with a basic logistic regression (glm).
tr.glm = glm(excellent ~. , data = train, family = binomial)
summary(tr.glm)
##
## Call:
## glm(formula = excellent ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0503 -0.6704 -0.4143 -0.1738 2.8420
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.058e+02 9.823e+01 6.167 6.95e-10 ***
## fixed.acidity 5.175e-01 9.407e-02 5.501 3.78e-08 ***
## volatile.acidity -4.040e+00 5.166e-01 -7.820 5.28e-15 ***
## citric.acid -6.788e-01 4.172e-01 -1.627 0.10368
## residual.sugar 2.861e-01 3.723e-02 7.683 1.56e-14 ***
## chlorides -1.275e+01 3.998e+00 -3.190 0.00142 **
## free.sulfur.dioxide 7.811e-03 3.261e-03 2.395 0.01660 *
## total.sulfur.dioxide -1.025e-04 1.586e-03 -0.065 0.94847
## density -6.281e+02 9.956e+01 -6.308 2.82e-10 ***
## pH 3.161e+00 4.470e-01 7.072 1.52e-12 ***
## sulphates 2.291e+00 3.647e-01 6.283 3.33e-10 ***
## alcohol 1.774e-01 1.193e-01 1.487 0.13700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4605.5 on 4408 degrees of freedom
## Residual deviance: 3734.5 on 4397 degrees of freedom
## AIC: 3758.5
##
## Number of Fisher Scoring iterations: 6
tr.glm.prob = predict(tr.glm, data.frame(test.x), type="response")
tr.glm.eval = eval_class(tr.glm.prob, test.y, plot=T, title="Logistic Reg: ");tr.glm.eval
## $auc
## [1] 0.803
##
## $confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 305 35
## 1 78 71
##
## Accuracy : 0.7689
## 95% CI : (0.729, 0.8056)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 0.7957
##
## Kappa : 0.4065
## Mcnemar's Test P-Value : 7.782e-05
##
## Sensitivity : 0.6698
## Specificity : 0.7963
## Pos Pred Value : 0.4765
## Neg Pred Value : 0.8971
## Prevalence : 0.2168
## Detection Rate : 0.1452
## Detection Prevalence : 0.3047
## Balanced Accuracy : 0.7331
##
## 'Positive' Class : 1
##
As we can see from the above model summary and plots, the AUC is 0.803, which is not bad. I choose the cutoff so that the kappa statistic is maximized. The model has an overall accuracy of 0.77, sensitivity of 0.67, precision (ppv) of 0.48, kappa of 0.41
Note: When choosing the cutoff, I don’t necessarily need to use the cutoff that maximizes the Kappa statistic. The cutoff choice also largely depends on the clients’ requirement. For example, suppose clients care more about precision(PPV) than sensitivity, i.e., comparing with missing classifying some excellent wines, classifying non-excellent wines to be excellent wines cost much higher, because clients will waste all the preparation process and end up with a wine type with non-excellent quality. In that case, I will be more stringent, i.e., increase the model cutoff so that I sacrifice sensitivity for the gain of higher precision.
Next step I will build a LASSO model with 10-fold cv using the cv.glmnet
function from the glmnet
package. As we can see from the summary below, the LASSO model doesn’t out-perform the logistic model.
set.seed(1)
tr.cvlasso = cv.glmnet(train.x, train.y,
family = "binomial",
type.measure = "auc")
coef(tr.cvlasso, s=tr.cvlasso$lambda.1se) # show model coefficients
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -6.381504e+00
## fixed.acidity .
## volatile.acidity -3.399210e+00
## citric.acid -2.901506e-01
## residual.sugar 3.631815e-02
## chlorides -1.275182e+01
## free.sulfur.dioxide 6.831982e-03
## total.sulfur.dioxide -6.136448e-04
## density -5.368455e+00
## pH 7.389874e-01
## sulphates 1.049691e+00
## alcohol 8.067396e-01
tr.cvlasso.prob = predict(tr.cvlasso,
test.x,
type="response",
s=tr.cvlasso$lambda.1se)
tr.cvlasso.eval = eval_class(tr.cvlasso.prob, test.y, plot = T, title = "LASSO with CV");tr.cvlasso.eval
## $auc
## [1] 0.795
##
## $confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 316 40
## 1 67 66
##
## Accuracy : 0.7812
## 95% CI : (0.7419, 0.8171)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 0.56933
##
## Kappa : 0.4099
## Mcnemar's Test P-Value : 0.01195
##
## Sensitivity : 0.6226
## Specificity : 0.8251
## Pos Pred Value : 0.4962
## Neg Pred Value : 0.8876
## Prevalence : 0.2168
## Detection Rate : 0.1350
## Detection Prevalence : 0.2720
## Balanced Accuracy : 0.7239
##
## 'Positive' Class : 1
##
I will build a decision tree model, first a single decision tree, then using boosting. The boosting decision tree out-performs all my pervious models, with a Kappa of 0.59.
#decision tree
tr.dt = C5.0(train.x, train.y)
tr.dt
##
## Call:
## C5.0.default(x = train.x, y = train.y)
##
## Classification Tree
## Number of samples: 4409
## Number of predictors: 11
##
## Tree size: 76
##
## Non-standard options: attempt to group attributes
tr.dt.pred = predict(tr.dt, test.x)
confMat = confusionMatrix(tr.dt.pred, test.y, positive="1")
tr.dt.eval = list(auc = NA, confusionMatrix = confMat); tr.dt.eval
## $auc
## [1] NA
##
## $confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 338 51
## 1 45 55
##
## Accuracy : 0.8037
## 95% CI : (0.7657, 0.838)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 0.1483
##
## Kappa : 0.4098
## Mcnemar's Test P-Value : 0.6098
##
## Sensitivity : 0.5189
## Specificity : 0.8825
## Pos Pred Value : 0.5500
## Neg Pred Value : 0.8689
## Prevalence : 0.2168
## Detection Rate : 0.1125
## Detection Prevalence : 0.2045
## Balanced Accuracy : 0.7007
##
## 'Positive' Class : 1
##
# using boosting
tr.dtboost = C5.0(train.x, train.y, trials = 10)
tr.dtboost
##
## Call:
## C5.0.default(x = train.x, y = train.y, trials = 10)
##
## Classification Tree
## Number of samples: 4409
## Number of predictors: 11
##
## Number of boosting iterations: 10
## Average tree size: 73.9
##
## Non-standard options: attempt to group attributes
tr.dtboost.pred = predict(tr.dtboost, test.x)
confMat = confusionMatrix(tr.dtboost.pred, test.y, positive="1")
tr.dtboost.eval = list(auc = NA, confusionMatrix = confMat); tr.dtboost.eval
## $auc
## [1] NA
##
## $confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 365 44
## 1 18 62
##
## Accuracy : 0.8732
## 95% CI : (0.8404, 0.9014)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 2.091e-07
##
## Kappa : 0.5903
## Mcnemar's Test P-Value : 0.001498
##
## Sensitivity : 0.5849
## Specificity : 0.9530
## Pos Pred Value : 0.7750
## Neg Pred Value : 0.8924
## Prevalence : 0.2168
## Detection Rate : 0.1268
## Detection Prevalence : 0.1636
## Balanced Accuracy : 0.7690
##
## 'Positive' Class : 1
##
Similar with session 1, I will build a random forest model, and use the train
function to choose best hyper-parameters. The best kappa from this model reaches 0.64, with an overall accuracy of 0.89, which is an improvement compared with the previous models.
tr.rfclass = randomForest(excellent~., data=train, ntree=1000, mtry=sqrt(p))
tr.rfclass
##
## Call:
## randomForest(formula = excellent ~ ., data = train, ntree = 1000, mtry = sqrt(p))
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 10.95%
## Confusion matrix:
## 0 1 class.error
## 0 3339 116 0.03357453
## 1 367 587 0.38469602
tr.rfclass.prob = predict(tr.rfclass, test.x)
confMat = confusionMatrix(tr.rfclass.prob, test.y, positive="1")
tr.rfclass.eval = list(auc = NA,
confusionMatrix = confMat); tr.rfclass.eval
## $auc
## [1] NA
##
## $confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 367 41
## 1 16 65
##
## Accuracy : 0.8834
## 95% CI : (0.8516, 0.9105)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 6.216e-09
##
## Kappa : 0.6247
## Mcnemar's Test P-Value : 0.001478
##
## Sensitivity : 0.6132
## Specificity : 0.9582
## Pos Pred Value : 0.8025
## Neg Pred Value : 0.8995
## Prevalence : 0.2168
## Detection Rate : 0.1329
## Detection Prevalence : 0.1656
## Balanced Accuracy : 0.7857
##
## 'Positive' Class : 1
##
ct = trainControl(method = "repeatedcv", number = 10, repeats = 2)
grid_rf = expand.grid(.mtry = c(2, 3, 6))
set.seed(1)
cl = makePSOCKcluster(4)
registerDoParallel(cl)
tr.cvrfclass = train(excellent~., data = train,
method = 'rf',
metric = "Kappa",
trControl = ct,
tuneGrid = grid_rf)
stopCluster(cl)
save(tr.cvrfclass, file = "~/Downloads/wine_train_cvrfclass.RData")
load(file = "~/Downloads/wine_train_cvrfclass.RData"); tr.cvrfclass
## Random Forest
##
## 4409 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times)
## Summary of sample sizes: 3968, 3967, 3969, 3967, 3969, 3968, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8827441 0.6146437
## 3 0.8842170 0.6216576
## 6 0.8835426 0.6216067
##
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
tr.cvrfclass.pred = predict(tr.cvrfclass, test.x)
confMat = confusionMatrix(tr.cvrfclass.pred, test.y, positive = "1")
tr.cvrfclass.eval = list(auc = NA,
confusionMatrix = confMat); tr.cvrfclass.eval
## $auc
## [1] NA
##
## $confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 369 40
## 1 14 66
##
## Accuracy : 0.8896
## 95% CI : (0.8584, 0.9159)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 5.991e-10
##
## Kappa : 0.6431
## Mcnemar's Test P-Value : 0.0006688
##
## Sensitivity : 0.6226
## Specificity : 0.9634
## Pos Pred Value : 0.8250
## Neg Pred Value : 0.9022
## Prevalence : 0.2168
## Detection Rate : 0.1350
## Detection Prevalence : 0.1636
## Balanced Accuracy : 0.7930
##
## 'Positive' Class : 1
##
SVM model is built in a similar way as in session 1. The model here doesn’t perform as well as the random forest model, however, by choosing a different kernel and other hyper-parameters, the model could potentially reach a better performance.
tr.svmclass = ksvm(excellent ~ . ,
data=train,
kernel='rbfdot',
prob.model = T,
C=1)
tr.svmclass
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0810442293731343
##
## Number of Support Vectors : 1876
##
## Objective Function Value : -1634.86
## Training error : 0.160581
## Probability model included.
tr.svm.pred = predict(tr.svmclass, test.x, type = "probabilities")[,2]
tr.svm.eval = eval_class(tr.svm.pred, test.y, plot = T, title = "SVM: "); tr.svm.eval
## $auc
## [1] 0.857
##
## $confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 326 27
## 1 57 79
##
## Accuracy : 0.8282
## 95% CI : (0.7918, 0.8606)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 0.007917
##
## Kappa : 0.5411
## Mcnemar's Test P-Value : 0.001555
##
## Sensitivity : 0.7453
## Specificity : 0.8512
## Pos Pred Value : 0.5809
## Neg Pred Value : 0.9235
## Prevalence : 0.2168
## Detection Rate : 0.1616
## Detection Prevalence : 0.2781
## Balanced Accuracy : 0.7982
##
## 'Positive' Class : 1
##
In the end, I will put all the classification model performance measures in the following table. The same with my conclusion in session 1, the random forest model out-performs all other models, with the highest accuracy of 0.89, and highest kappa of 0.64.
glm = c(auc = tr.glm.eval$auc,
tr.glm.eval$confusionMatrix$overall,
tr.glm.eval$confusionMatrix$byClass)
cv.lasso = c(auc = tr.cvlasso.eval$auc,
tr.cvlasso.eval$confusionMatrix$overall,
tr.cvlasso.eval$confusionMatrix$byClass)
decision.tree = c(auc = tr.dt.eval$auc,
tr.dt.eval$confusionMatrix$overall,
tr.dt.eval$confusionMatrix$byClass)
decision.tree.boost = c(auc = tr.dtboost.eval$auc,
tr.dtboost.eval$confusionMatrix$overall,
tr.dtboost.eval$confusionMatrix$byClass)
rf = c(auc = tr.rfclass.eval$auc,
tr.rfclass.eval$confusionMatrix$overall,
tr.rfclass.eval$confusionMatrix$byClass)
cv.rf = c(auc = tr.cvrfclass.eval$auc,
tr.cvrfclass.eval$confusionMatrix$overall,
tr.cvrfclass.eval$confusionMatrix$byClass)
svm = c(auc = tr.svm.eval$auc,
tr.svm.eval$confusionMatrix$overall,
tr.svm.eval$confusionMatrix$byClass)
all = cbind(glm, cv.lasso,
decision.tree, decision.tree.boost,
rf, cv.rf, svm) %>% data.frame()
knitr::kable(all %>% round(3),
caption = "comparing all models")
glm | cv.lasso | decision.tree | decision.tree.boost | rf | cv.rf | svm | |
---|---|---|---|---|---|---|---|
auc | 0.803 | 0.795 | NA | NA | NA | NA | 0.857 |
Accuracy | 0.769 | 0.781 | 0.804 | 0.873 | 0.883 | 0.890 | 0.828 |
Kappa | 0.407 | 0.410 | 0.410 | 0.590 | 0.625 | 0.643 | 0.541 |
AccuracyLower | 0.729 | 0.742 | 0.766 | 0.840 | 0.852 | 0.858 | 0.792 |
AccuracyUpper | 0.806 | 0.817 | 0.838 | 0.901 | 0.911 | 0.916 | 0.861 |
AccuracyNull | 0.783 | 0.783 | 0.783 | 0.783 | 0.783 | 0.783 | 0.783 |
AccuracyPValue | 0.796 | 0.569 | 0.148 | 0.000 | 0.000 | 0.000 | 0.008 |
McnemarPValue | 0.000 | 0.012 | 0.610 | 0.001 | 0.001 | 0.001 | 0.002 |
Sensitivity | 0.670 | 0.623 | 0.519 | 0.585 | 0.613 | 0.623 | 0.745 |
Specificity | 0.796 | 0.825 | 0.883 | 0.953 | 0.958 | 0.963 | 0.851 |
Pos Pred Value | 0.477 | 0.496 | 0.550 | 0.775 | 0.802 | 0.825 | 0.581 |
Neg Pred Value | 0.897 | 0.888 | 0.869 | 0.892 | 0.900 | 0.902 | 0.924 |
Precision | 0.477 | 0.496 | 0.550 | 0.775 | 0.802 | 0.825 | 0.581 |
Recall | 0.670 | 0.623 | 0.519 | 0.585 | 0.613 | 0.623 | 0.745 |
F1 | 0.557 | 0.552 | 0.534 | 0.667 | 0.695 | 0.710 | 0.653 |
Prevalence | 0.217 | 0.217 | 0.217 | 0.217 | 0.217 | 0.217 | 0.217 |
Detection Rate | 0.145 | 0.135 | 0.112 | 0.127 | 0.133 | 0.135 | 0.162 |
Detection Prevalence | 0.305 | 0.272 | 0.204 | 0.164 | 0.166 | 0.164 | 0.278 |
Balanced Accuracy | 0.733 | 0.724 | 0.701 | 0.769 | 0.786 | 0.793 | 0.798 |
all$evaluate = rownames(all)
long = gather(all, key="method", value = "measure", 1:7)
ggplot(long, aes(method, evaluate)) +
geom_tile(aes(fill = measure), colour = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
theme(axis.text.x = element_text(angle = 20, hjust = 1)) +
labs(title = "Visualizing model performance measures")+
geom_text(aes(label = round(measure,2))) +
guides(fill = FALSE)
After trying many models and optimizing their hyper-parameters, this study reaches a conclusion that random forest model performs the best for prediction and for classification. For predicting wine quality, RF model gives the best MAE of 0.37 and RMSE of 0.51, which is not bad considering that the wine quality could range from 0 to 10. For classifying high-quality wine, RF model gives the best overall accuracy of 0.89, kappa of 0.64, sensitivity of 0.62 and precision of 0.82.
The random forest models built in this study performs the best, however, it is still far from perfect. To further improve my model performance, I could further try the following steps.
To have a closer look at misclassified observations from the confusion matrix (in the prediction case, check observations with large differences between predicted quality and real quality), and try to understand why these wines are classified wrongly.
Learn some wine quality control knowledge to have a better sense of the prior knowledge in the wine producing business, so that I might be able to properly transform some features or interpret the interaction between features
Spend more time to tune model hyper-parameters, and ask colleagues and my manager for further suggestions.
If computing power is a limit, I could ask my manager if there is a way to gain more computing power, by either using on-demand cloud computing, or upgrade my computer, or using a computer cluster.
Communicate with my clients to know more clearly about their goal: whether to predicting wine quality or to pick up wines with superior quality. Also, try to understand whether they have more tolerance for the type I error or the type II error, so that I can properly set the model cutoff value.
Ask clients if they have more data available, especially for the underrepresented classes, such as wines with high or low quality. Using more data, I could build a more complicated model, such as a neural network with more layers.
Ask clients if they have other related features. The model could be improved by adding these features in.