I chose data on wine from the UC Irvine Machine Learning Repository. The direct link to the dataset is: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/.
I’m interested in seeing what characteristics influence wine quality.
# Load wine data
red <- read.csv("winequality-red.csv", sep = ";")
white <- read.csv("winequality-white.csv", sep = ";")
wine <- rbind(red, white)
# Print the first few rows
head(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.998 3.51 0.56 9.4
## 2 25 67 0.997 3.20 0.68 9.8
## 3 15 54 0.997 3.26 0.65 9.8
## 4 17 60 0.998 3.16 0.58 9.8
## 5 11 34 0.998 3.51 0.56 9.4
## 6 13 40 0.998 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
# Print the variable names to see what variables we have to work with
colnames(wine)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
Let’s look at some summary statistics for the quality (the dependent or response variable) and residual sugar (the independent or predictor variable) for our model.
attach(wine)
# Check min and max and mean
min(quality); max(quality); mean(quality)
## [1] 3
## [1] 9
## [1] 5.82
min(residual.sugar); max(residual.sugar); mean(residual.sugar)
## [1] 0.6
## [1] 65.8
## [1] 5.44
The stats for residual sugar look a little strange with a max that is far from the mean. Let’s plot it to see what the distribution looks like.
hist(residual.sugar)
Looks like we have some outliers on the high end, but otherwise a pretty smooth left skewed distribution.
Check if a linear model seems appropriate.
# Create scatter plot
plot(residual.sugar, quality, main = "Quality as a Function of Residual Sugar in Wine", ylab = "Quality", xlab = "Residual Sugar")
The relationship does not look linear. In fact quality looks to be normally distributed with wines at each quality with varying amounts of residual sugar. There is also one serious oulier with a value over 60 for residual sugar. I doubt removing it will improve the value of a linear model much however.
# quality as a function of residual.sugar
wine.lm <- lm(quality ~ residual.sugar)
wine.lm
##
## Call:
## lm(formula = quality ~ residual.sugar)
##
## Coefficients:
## (Intercept) residual.sugar
## 5.85532 -0.00679
Our linear regression model is: \[ quality = 5.855 -0.007 \times residual.sugar \]
plot(residual.sugar, quality, main = "Quality as a Function of Residual Sugar in Wine", ylab = "Quality", xlab = "Residual Sugar")
abline(wine.lm)
Again, not a good indication of a strong linear relationship.
summary(wine.lm)
##
## Call:
## lm(formula = quality ~ residual.sugar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.851 -0.808 0.158 0.229 3.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.85532 0.01645 355.89 <0.0000000000000002 ***
## residual.sugar -0.00679 0.00228 -2.98 0.0029 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.873 on 6495 degrees of freedom
## Multiple R-squared: 0.00137, Adjusted R-squared: 0.00121
## F-statistic: 8.89 on 1 and 6495 DF, p-value: 0.00287
hist(wine.lm$residuals)
mean(wine.lm$residuals)
## [1] 0.000000000000000313
The residuals are close to normally distributed around a mean of 0.000000000000000313 (almost exactly zero) which would indicate a good fit.
According to our textbook typically we would want our standard error to be “at least five to ten times smaller than the corresponding coefficient”. In this case the standard error for residual.sugar, 0.00228, is -2.982 times smaller than the coefficient, -0.007. So this indicates a weak fit. The standard error for the intercept, 0.016, is 355.885 times smaller than the coefficient, 5.855.
The p-values for the coefficients, 0.00287 for residual.sugar and <0.0000000000000002 for intercept, indicate that it is highly likely that both the residual.sugar and this specific intercept value are relevant to the model, however, the \(R^2\) value of 0.00137 indicates that the model explains only 0.137% of the variation in quality so not a good fit at all.
plot(fitted(wine.lm), resid(wine.lm))
The plotted residuals are not randomly distributed and show a clear pattern indicating that the linear model is not a good fit.
qqnorm(resid(wine.lm))
qqline(resid(wine.lm))
We can see that overall the sample quantiles follow a linear pattern similar to the theoretical quantiles. The stepped shape of the plot is due to quality being a discrete rather than continuous variable, however you can see a large variation at each step.
Overall the linear model is not a good fit for this data. The initial visualization of quality vs. residual sugar did not indicate a linear relationship. Normally I would not have even bothered continuing with trying to use linear regression to model this relationship, but went ahead anyway just to practice and see what other indications there would be of a bad fit. There are clear patterns in the residuals and the \(R^2\) value is extremely low, both indicating a bad fit with this model. It was surprising to see such low p-values though which just shows how important it is to look at all the other indicators of model quality and not just p-values.