Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Load Data

I chose data on wine from the UC Irvine Machine Learning Repository. The direct link to the dataset is: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/.

I’m interested in seeing what characteristics influence wine quality.

# Load wine data
red <- read.csv("winequality-red.csv", sep = ";")
white <- read.csv("winequality-white.csv", sep = ";")
wine <- rbind(red, white)
# Print the first few rows
head(wine)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34   0.998 3.51      0.56     9.4
## 2                  25                   67   0.997 3.20      0.68     9.8
## 3                  15                   54   0.997 3.26      0.65     9.8
## 4                  17                   60   0.998 3.16      0.58     9.8
## 5                  11                   34   0.998 3.51      0.56     9.4
## 6                  13                   40   0.998 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
# Print the variable names to see what variables we have to work with
colnames(wine)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Sanity Check

Let’s look at some summary statistics for the quality (the dependent or response variable) and residual sugar (the independent or predictor variable) for our model.

attach(wine)
# Check min and max and mean
min(quality); max(quality); mean(quality)
## [1] 3
## [1] 9
## [1] 5.82
min(residual.sugar); max(residual.sugar); mean(residual.sugar)
## [1] 0.6
## [1] 65.8
## [1] 5.44

The stats for residual sugar look a little strange with a max that is far from the mean. Let’s plot it to see what the distribution looks like.

hist(residual.sugar)

Looks like we have some outliers on the high end, but otherwise a pretty smooth left skewed distribution.

Visualize the Data

Check if a linear model seems appropriate.

# Create scatter plot
plot(residual.sugar, quality, main = "Quality as a Function of Residual Sugar in Wine", ylab = "Quality", xlab = "Residual Sugar")

The relationship does not look linear. In fact quality looks to be normally distributed with wines at each quality with varying amounts of residual sugar. There is also one serious oulier with a value over 60 for residual sugar. I doubt removing it will improve the value of a linear model much however.

Build linear Model

# quality as a function of residual.sugar
wine.lm <- lm(quality ~ residual.sugar)
wine.lm
## 
## Call:
## lm(formula = quality ~ residual.sugar)
## 
## Coefficients:
##    (Intercept)  residual.sugar  
##        5.85532        -0.00679

Our linear regression model is: \[ quality = 5.855 -0.007 \times residual.sugar \]

Plot with fitted line

plot(residual.sugar, quality, main = "Quality as a Function of Residual Sugar in Wine", ylab = "Quality", xlab = "Residual Sugar")
abline(wine.lm)

Again, not a good indication of a strong linear relationship.

Evaluate the Quality of the Model

summary(wine.lm)
## 
## Call:
## lm(formula = quality ~ residual.sugar)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.851 -0.808  0.158  0.229  3.217 
## 
## Coefficients:
##                Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)     5.85532    0.01645  355.89 <0.0000000000000002 ***
## residual.sugar -0.00679    0.00228   -2.98              0.0029 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.873 on 6495 degrees of freedom
## Multiple R-squared:  0.00137,    Adjusted R-squared:  0.00121 
## F-statistic: 8.89 on 1 and 6495 DF,  p-value: 0.00287
hist(wine.lm$residuals)

mean(wine.lm$residuals)
## [1] 0.000000000000000313

The residuals are close to normally distributed around a mean of 0.000000000000000313 (almost exactly zero) which would indicate a good fit.

According to our textbook typically we would want our standard error to be “at least five to ten times smaller than the corresponding coefficient”. In this case the standard error for residual.sugar, 0.00228, is -2.982 times smaller than the coefficient, -0.007. So this indicates a weak fit. The standard error for the intercept, 0.016, is 355.885 times smaller than the coefficient, 5.855.

The p-values for the coefficients, 0.00287 for residual.sugar and <0.0000000000000002 for intercept, indicate that it is highly likely that both the residual.sugar and this specific intercept value are relevant to the model, however, the \(R^2\) value of 0.00137 indicates that the model explains only 0.137% of the variation in quality so not a good fit at all.

Plot the Residuals

plot(fitted(wine.lm), resid(wine.lm))

The plotted residuals are not randomly distributed and show a clear pattern indicating that the linear model is not a good fit.

Q-Q Plot

qqnorm(resid(wine.lm))
qqline(resid(wine.lm))

We can see that overall the sample quantiles follow a linear pattern similar to the theoretical quantiles. The stepped shape of the plot is due to quality being a discrete rather than continuous variable, however you can see a large variation at each step.

Conclusion

Overall the linear model is not a good fit for this data. The initial visualization of quality vs. residual sugar did not indicate a linear relationship. Normally I would not have even bothered continuing with trying to use linear regression to model this relationship, but went ahead anyway just to practice and see what other indications there would be of a bad fit. There are clear patterns in the residuals and the \(R^2\) value is extremely low, both indicating a bad fit with this model. It was surprising to see such low p-values though which just shows how important it is to look at all the other indicators of model quality and not just p-values.