This report aims to build a machine algorithm that predicts the alcohol content of white wines. 4 types of regression models were considered namely Linear and the regularized models Lasso, Ridge, and Elastic Net. The data was taken from Wine Quality Datasets @ http://www3.dsi.uminho.pt/pcortez/wine/.
set.seed(1234)
library(ggplot2)
library(RCurl)
library(glmnet)
library(corrplot)
library(ModelMetrics)
library(lattice)
#load the data
url <- "https://raw.githubusercontent.com/GrejSegura/DataBank/master/wine.csv"
get <- getURL(url)
wine_data <- read.csv(text = get, sep = ",", header = T)
wine_data <- wine_data[-1, ]
str(wine_data)
## 'data.frame': 4431 obs. of 12 variables:
## $ fixed.acidity : num 6.8 7.2 7.2 8.6 6.5 6.6 7 5.5 7 7.4 ...
## $ volatile.acidity : num 0.32 0.39 0.29 0.37 0.38 0.24 0.17 0.16 0.24 0.28 ...
## $ citric.acid : num 0.21 0.62 0.53 0.7 0.34 0.29 0.31 0.22 0.51 0.36 ...
## $ residual.sugar : num 2.2 11 18.1 12.2 3.4 ...
## $ chlorides : num 0.044 0.047 0.047 0.039 0.036 0.023 0.034 0.03 0.029 0.028 ...
## $ free.sulfur.dioxide : num 15 66 59 21 34 19 34 30 55 42 ...
## $ total.sulfur.dioxide: num 68 178 182 158 200 86 132 102 227 105 ...
## $ density : num 0.993 0.998 0.999 0.998 0.994 ...
## $ pH : num 3.17 3.16 3.09 3 3.14 3.25 3.36 3.24 3.03 2.99 ...
## $ sulphates : num 0.39 0.5 0.52 0.73 0.76 0.45 0.48 0.36 0.61 0.39 ...
## $ alcohol : num 9.4 8.7 9.6 9.3 10 12.5 9.6 9.4 9.5 12.4 ...
## $ quality : int 6 5 5 6 5 6 7 6 5 7 ...
The data has 4431 observations and 12 variables.
Before building a prediction model, we first examine the data and underlying statistical informations.
summary(wine_data$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.56 11.40 14.20
var(wine_data$alcohol)
## [1] 1.531859
densityplot(wine_data$alcohol)
Alcohol has mean 10.56 and variance 1.53. The density plot also shows that the data is skewed to the left.
wine_data$quality <- as.factor(wine_data$quality)
plot(wine_data$quality)
Here we can see that the most number of quality in the data is 6 followed by 5, 7, 8, 4, 3, and 9. It should be noted that higher number means higher quality of wine.
Next we look at the correlation of the variables. Our focus here is the correlation of other variables to Alcohol. It is best to scale the data first before proceeding with the calculation.
## FIND CORRELATIONS ##
wine_scaled <- scale(wine_data[,1:11]) #do not include quality
cor <- cor(wine_scaled)
corrplot(cor, method = "number")
It can be noticed that among the variables, density has the highest negative correlation with alcohol @-0.78. There is no notable positive correlation however.
To examine the effects of this variables to the wine alcohol content, we conduct a linear regression analysis.
## IDENTIFY INFLUENTIAL FACTORS ##
model.1 <- lm(alcohol ~., as.data.frame(wine_data[, -12]))
summary(model.1)
##
## Call:
## lm(formula = alcohol ~ ., data = as.data.frame(wine_data[, -12]))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3959 -0.2601 -0.0381 0.2202 15.8417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.840e+02 5.298e+00 129.091 < 2e-16 ***
## fixed.acidity 5.186e-01 1.027e-02 50.491 < 2e-16 ***
## volatile.acidity 9.567e-01 7.071e-02 13.531 < 2e-16 ***
## citric.acid 3.840e-01 5.953e-02 6.450 1.24e-10 ***
## residual.sugar 2.417e-01 2.927e-03 82.585 < 2e-16 ***
## chlorides -3.910e-01 3.380e-01 -1.157 0.2475
## free.sulfur.dioxide -3.424e-03 5.219e-04 -6.560 6.00e-11 ***
## total.sulfur.dioxide 5.698e-04 2.377e-04 2.397 0.0166 *
## density -6.914e+02 5.447e+00 -126.925 < 2e-16 ***
## pH 2.477e+00 5.409e-02 45.794 < 2e-16 ***
## sulphates 9.859e-01 6.023e-02 16.369 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4479 on 4420 degrees of freedom
## Multiple R-squared: 0.8693, Adjusted R-squared: 0.869
## F-statistic: 2941 on 10 and 4420 DF, p-value: < 2.2e-16
With the result above, we can say that @ alpha = 0.05, only chlorides has no significant effect to the alcohol content among all variables considered. The rest are highly significant based on their p-values. This is a good model considering the R-square values. However, we do not wish to continue on the diagnosis as we are more concerned about finding the best predicting model. This part only shows how and what are the variables influentitial to the alcohol content.
Lastly, we would like to look if alcohol also differs among quality types. We use one-way ANOVA this time.
## ANOVA FOR QUALITY ##
model.2 <- aov(alcohol ~ quality, data = wine_data)
summary(model.2)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 6 1471 245.2 204.1 <2e-16 ***
## Residuals 4424 5315 1.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result shows that there is significant difference in alcohol content among quality types evidenced by the very high F-statistic and small p-value.
Since we have already explored the data, we now proceed on building the best prediction model for alcohol content. To start, we need to prepare the data by creating dummy variables for quality and alloting train and test datasets.
quality_dummy <- model.matrix(~ quality-1, wine_data)
#bind the dummy to the orig data set
wine_final <- data.frame(wine_scaled[, 1:10], quality_dummy[,-3], wine_data[,11]) #remove quality = 5 dummy to avoid dummy trap
names(wine_final)[17] <- 'alcohol'
#create training and test data
t <- 1:nrow(wine_final)
index <- sample(t, round(nrow(wine_final)*.8))
train_wine <- wine_final[index,]
test_wine <- wine_final[-index,]
We will use the package glmnet to build the models for ridge, lasso and elastic net regression. We will then compare their respective Root Mean Square Error for the test dataset.
#create training/test data specifically for glmnet
train_a <- train_wine[,1:16]
train_b <- train_wine[,17]
test_a <- test_wine[,1:16]
test_b <- test_wine[,17]
##ridge
ridge_wine <- cv.glmnet(as.matrix(train_a), train_b, family = 'gaussian', alpha = 0)
#predict ridge
ridge <- predict(ridge_wine, s = ridge_wine$lambda.1se, newx = as.matrix(test_a))
mse_ridge_wine <- mean((test_b - ridge)^2)
ridge_error <- sqrt(mse_ridge_wine)
#plot residuals and prediction
ridge_resid_wine <- test_b - ridge
plot(ridge_resid_wine)
#################################
#################################
#lasso
lasso_wine <- cv.glmnet(as.matrix(train_a), train_b, family = 'gaussian', alpha = 1)
#predict lasso
lasso <- predict(lasso_wine, s = lasso_wine$lambda.1se, newx = as.matrix(test_a))
mse_lasso <- mean((test_b - lasso)^2)
lasso_error <- sqrt(mse_lasso)
#plot residuals
lasso_resid <- test_b - lasso
plot(lasso_resid)
##############################
##############################
#elastic net alpha = 0.1 - 0.9
n = seq(0.1:0.9, by = 0.1)
i = 0.1
for (i in n){
assign(paste('elastic', i, sep = ''), cv.glmnet(as.matrix(train_a), train_b, family = 'gaussian', alpha = i))
}
#predict elasticnet
yhat0.1 <- predict(elastic0.1, elastic0.1$lambda.1se, newx = as.matrix(test_a))
yhat0.2 <- predict(elastic0.2, elastic0.2$lambda.1se, newx = as.matrix(test_a))
yhat0.3 <- predict(elastic0.3, elastic0.3$lambda.1se, newx = as.matrix(test_a))
yhat0.4 <- predict(elastic0.4, elastic0.4$lambda.1se, newx = as.matrix(test_a))
yhat0.5 <- predict(elastic0.5, elastic0.5$lambda.1se, newx = as.matrix(test_a))
yhat0.6 <- predict(elastic0.6, elastic0.6$lambda.1se, newx = as.matrix(test_a))
yhat0.7 <- predict(elastic0.7, elastic0.7$lambda.1se, newx = as.matrix(test_a))
yhat0.8 <- predict(elastic0.8, elastic0.7$lambda.1se, newx = as.matrix(test_a))
yhat0.9 <- predict(elastic0.9, elastic0.9$lambda.1se, newx = as.matrix(test_a))
mselas0.1 <- mean((test_b - yhat0.1)^2)
mselas0.2 <- mean((test_b - yhat0.2)^2)
mselas0.3 <- mean((test_b - yhat0.3)^2)
mselas0.4 <- mean((test_b - yhat0.4)^2)
mselas0.5 <- mean((test_b - yhat0.5)^2)
mselas0.6 <- mean((test_b - yhat0.6)^2)
mselas0.7 <- mean((test_b - yhat0.7)^2)
mselas0.8 <- mean((test_b - yhat0.8)^2)
mselas0.9 <- mean((test_b - yhat0.9)^2)
elastic_error0.1 <- sqrt(mselas0.1)
elastic_error0.2 <- sqrt(mselas0.2)
elastic_error0.3 <- sqrt(mselas0.3)
elastic_error0.4 <- sqrt(mselas0.4)
elastic_error0.5 <- sqrt(mselas0.5)
elastic_error0.6 <- sqrt(mselas0.6)
elastic_error0.7 <- sqrt(mselas0.7)
elastic_error0.8 <- sqrt(mselas0.8)
elastic_error0.9 <- sqrt(mselas0.9)
###alpha = 0.7 has the best rmse
#################################3
################################
#compare with Multiple Linear Regression
lm_wine_train <- cbind(train_a, train_b)
lm_wine_test <- cbind(test_a, test_b)
linear <- lm(train_b~., lm_wine_train)
#predict
pred_wine_lm <- predict(linear, lm_wine_test)
mselm_wine <- mean((lm_wine_test$test_b - pred_wine_lm)^2)
linear_error <- sqrt(mselm_wine)
#plot residuals
resid <- lm_wine_test$test_b - pred_wine_lm
plot(resid)
Alas, we built 12 models. 1 for ridge, 1 for lasso, 9 for elastic net for all levels of alpha, and lastly the multiple linear regression model. Let us check their errors and compare which has the best predicting power amongst all.
errors <- rbind(ridge_error, lasso_error, elastic_error0.1, elastic_error0.2, elastic_error0.3, elastic_error0.4, elastic_error0.5, elastic_error0.6, elastic_error0.7, elastic_error0.8, elastic_error0.9, linear_error)
bestModel <- min(errors)
errors
## [,1]
## ridge_error 0.6582825
## lasso_error 0.6739909
## elastic_error0.1 0.6628782
## elastic_error0.2 0.6576649
## elastic_error0.3 0.6611946
## elastic_error0.4 0.6641517
## elastic_error0.5 0.6661493
## elastic_error0.6 0.6658192
## elastic_error0.7 0.6682074
## elastic_error0.8 0.6700373
## elastic_error0.9 0.6715392
## linear_error 0.6831689
bestModel
## [1] 0.6576649
Upon looking at the errors table, elastic0.2 gives us the least RMSE. Therefore the best model is elastic net with alpha = 0.2. Multiple linear has the biggest which means it has the least prediction accuracy among the models.