Using R, build a multiple regression model for data that interests you.
Include in this model:
at least one quadratic term one dichotomous term one dichotomous vs. quantitative interaction term
Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The data set I choose to explore is called Wine Quality. The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. Our goal is to build a multiple linear regression model to predict wine quality using the variables in the data set.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
Attribute information: Input variables (based on physicochemical tests): 1. fixed acidity 2. volatile acidity 3. citric acid 4. residual sugar 5. chlorides 6. free sulfur dioxide 7. total sulfur dioxide 8. density 9. pH 10. sulphates 11. alcohol Output variable (based on sensory data): 12. quality (score between 0 and 10)
library(tidyverse)
library(psych)
library(ggpubr)
library(GGally)
library(knitr)
white <- read_csv("wine+quality/winequality-white.csv")
red <- read_csv("wine+quality/winequality-red.csv")
Clean each data fram as the file contained all column names and values in one column each one separated but a semi-colon
white_colnames <- colnames(white) %>% str_split(pattern = ";") %>% unlist()
white_colnames <- white_colnames %>% str_replace_all(" ", "_")
white <- white %>% separate_wider_delim(
cols = "fixed acidity;volatile acidity;citric acid;residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;density;pH;sulphates;alcohol;quality",
delim = ";",
names = white_colnames
)
red_colnames <- colnames(red) %>% str_split(pattern = ";") %>% unlist()
red_colnames <- red_colnames %>% str_replace_all(" ", "_")
red <- red %>% separate_wider_delim(
cols = "fixed acidity;volatile acidity;citric acid;residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;density;pH;sulphates;alcohol;quality",
delim = ";",
names = red_colnames
)
head(white)
## # A tibble: 6 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <chr> <chr> <chr> <chr> <chr>
## 1 7 0.27 0.36 20.7 0.045
## 2 6.3 0.3 0.34 1.6 0.049
## 3 8.1 0.28 0.4 6.9 0.05
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.4 6.9 0.05
## # ℹ 7 more variables: free_sulfur_dioxide <chr>, total_sulfur_dioxide <chr>,
## # density <chr>, pH <chr>, sulphates <chr>, alcohol <chr>, quality <chr>
head(red)
## # A tibble: 6 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <chr> <chr> <chr> <chr> <chr>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # ℹ 7 more variables: free_sulfur_dioxide <chr>, total_sulfur_dioxide <chr>,
## # density <chr>, pH <chr>, sulphates <chr>, alcohol <chr>, quality <chr>
We can see that all the data is a character so we would need to convert the appropriate columns to numeric.
# convert data types to numeric
red[1:12] <- lapply(red[1:12],as.numeric)
white[1:12] <- lapply(white[1:12],as.numeric)
Before we combine these data sets we need to denote which observation is red or white and we can create a new column with the respective label in each data frame. I also though it might be a good idea to look at some summary stats separately for red and white wines.
white$color <- "white"
red$color <- "red"
summary(white)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917
## Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937
## Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961
## Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.180 Median :0.4700 Median :10.40 Median :6.000
## Mean :3.188 Mean :0.4898 Mean :10.51 Mean :5.878
## 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000
## color
## Length:4898
## Class :character
## Mode :character
##
##
##
summary(red)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
## color
## Length:1599
## Class :character
## Mode :character
##
##
##
Combine data sets
wine_df <- rbind(red, white)
head(wine_df)
## # A tibble: 6 × 13
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # ℹ 8 more variables: free_sulfur_dioxide <dbl>, total_sulfur_dioxide <dbl>,
## # density <dbl>, pH <dbl>, sulphates <dbl>, alcohol <dbl>, quality <dbl>,
## # color <chr>
dim(wine_df)
## [1] 6497 13
glimpse(wine_df)
## Rows: 6,497
## Columns: 13
## $ fixed_acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile_acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free_sulfur_dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
## $ color <chr> "red", "red", "red", "red", "red", "red", "red", …
Check for missing values - no missing values
any(is.na(wine_df))
## [1] FALSE
summary(wine_df)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05603 Mean : 30.53 Mean :115.7 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9970
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.210 Median :0.5100 Median :10.30 Median :6.000
## Mean :3.219 Mean :0.5313 Mean :10.49 Mean :5.818
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :9.000
## color
## Length:6497
## Class :character
## Mode :character
##
##
##
Distribution of the quality scores between red and white wine
wine_df %>% ggplot(aes(x = color, y = quality, color = color)) +
geom_boxplot()
corrplot::corrplot(cor(wine_df[,1:12]), method = "square", type = 'lower', tl.srt = 45)
## Scatter plots - Relationships with dependent variable (quality)
wine_df %>%
select(where(is.numeric)) %>%
pivot_longer(cols = -quality, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value, y = quality)) +
geom_point(alpha=0.1) +
facet_wrap(~variable, scales = "free")
First we need to code our dichotomous variable color as a 0 for red and 1 for white.
wine_df <- wine_df %>% mutate(color2 = ifelse(color == "red", 0, 1))
wine_lm1 <- lm(quality ~ fixed_acidity + volatile_acidity + citric_acid + residual_sugar + chlorides +
free_sulfur_dioxide + total_sulfur_dioxide + density + pH + sulphates + alcohol + color2, data = wine_df)
summary(wine_lm1)
##
## Call:
## lm(formula = quality ~ fixed_acidity + volatile_acidity + citric_acid +
## residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide +
## density + pH + sulphates + alcohol + color2, data = wine_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7796 -0.4671 -0.0444 0.4561 3.0211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.048e+02 1.414e+01 7.411 1.42e-13 ***
## fixed_acidity 8.507e-02 1.576e-02 5.396 7.05e-08 ***
## volatile_acidity -1.492e+00 8.135e-02 -18.345 < 2e-16 ***
## citric_acid -6.262e-02 7.972e-02 -0.786 0.4322
## residual_sugar 6.244e-02 5.934e-03 10.522 < 2e-16 ***
## chlorides -7.573e-01 3.344e-01 -2.264 0.0236 *
## free_sulfur_dioxide 4.937e-03 7.662e-04 6.443 1.25e-10 ***
## total_sulfur_dioxide -1.403e-03 3.237e-04 -4.333 1.49e-05 ***
## density -1.039e+02 1.434e+01 -7.248 4.71e-13 ***
## pH 4.988e-01 9.058e-02 5.506 3.81e-08 ***
## sulphates 7.217e-01 7.624e-02 9.466 < 2e-16 ***
## alcohol 2.227e-01 1.807e-02 12.320 < 2e-16 ***
## color2 -3.613e-01 5.675e-02 -6.367 2.06e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7331 on 6484 degrees of freedom
## Multiple R-squared: 0.2965, Adjusted R-squared: 0.2952
## F-statistic: 227.8 on 12 and 6484 DF, p-value: < 2.2e-16
From the output of the model the only non significant variable was
citric acid
. The model was run adain with this removed.
wine_lm2 <- update(wine_lm1, .~. - citric_acid, data = wine_df)
summary(wine_lm2)
##
## Call:
## lm(formula = quality ~ fixed_acidity + volatile_acidity + residual_sugar +
## chlorides + free_sulfur_dioxide + total_sulfur_dioxide +
## density + pH + sulphates + alcohol + color2, data = wine_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7708 -0.4696 -0.0418 0.4540 3.0202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.054e+02 1.411e+01 7.471 9.02e-14 ***
## fixed_acidity 8.244e-02 1.541e-02 5.351 9.02e-08 ***
## volatile_acidity -1.471e+00 7.670e-02 -19.180 < 2e-16 ***
## residual_sugar 6.256e-02 5.932e-03 10.548 < 2e-16 ***
## chlorides -8.007e-01 3.298e-01 -2.428 0.0152 *
## free_sulfur_dioxide 4.941e-03 7.662e-04 6.449 1.21e-10 ***
## total_sulfur_dioxide -1.427e-03 3.222e-04 -4.428 9.66e-06 ***
## density -1.046e+02 1.431e+01 -7.307 3.05e-13 ***
## pH 5.044e-01 9.029e-02 5.587 2.40e-08 ***
## sulphates 7.186e-01 7.614e-02 9.439 < 2e-16 ***
## alcohol 2.210e-01 1.794e-02 12.315 < 2e-16 ***
## color2 -3.655e-01 5.651e-02 -6.468 1.07e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7331 on 6485 degrees of freedom
## Multiple R-squared: 0.2965, Adjusted R-squared: 0.2953
## F-statistic: 248.4 on 11 and 6485 DF, p-value: < 2.2e-16
Removing citric acid did not make that much of a change in the output statistics. The adjusted \(R^2\) improved by 0.0001.
I was curious about any interaction effects between the color of the
wine and residual sugars so the model was run again with interaction
term.
wine_lm3 <- update(wine_lm2, .~. + residual_sugar:color2, data = wine_df)
summary(wine_lm3)
##
## Call:
## lm(formula = quality ~ fixed_acidity + volatile_acidity + residual_sugar +
## chlorides + free_sulfur_dioxide + total_sulfur_dioxide +
## density + pH + sulphates + alcohol + color2 + residual_sugar:color2,
## data = wine_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7760 -0.4725 -0.0430 0.4567 3.0166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.047e+02 1.412e+01 7.418 1.34e-13 ***
## fixed_acidity 8.298e-02 1.541e-02 5.385 7.49e-08 ***
## volatile_acidity -1.472e+00 7.669e-02 -19.190 < 2e-16 ***
## residual_sugar 4.426e-02 1.442e-02 3.070 0.00215 **
## chlorides -7.742e-01 3.303e-01 -2.343 0.01914 *
## free_sulfur_dioxide 4.953e-03 7.662e-04 6.464 1.10e-10 ***
## total_sulfur_dioxide -1.411e-03 3.224e-04 -4.376 1.23e-05 ***
## density -1.039e+02 1.432e+01 -7.254 4.51e-13 ***
## pH 5.043e-01 9.028e-02 5.586 2.42e-08 ***
## sulphates 7.153e-01 7.617e-02 9.391 < 2e-16 ***
## alcohol 2.232e-01 1.801e-02 12.392 < 2e-16 ***
## color2 -4.124e-01 6.580e-02 -6.268 3.89e-10 ***
## residual_sugar:color2 1.856e-02 1.333e-02 1.393 0.16372
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.733 on 6484 degrees of freedom
## Multiple R-squared: 0.2967, Adjusted R-squared: 0.2954
## F-statistic: 227.9 on 12 and 6484 DF, p-value: < 2.2e-16
Adding an interaction effect between residual sugar and color did minimal to the adjusted \(R^2\) value, improving is only by 0.0001 to 0.2954. The coefficient also did not met the a significant threshold.
par(mfrow = c(2,2))
plot(wine_lm3)
In the residual plots we see some very interesting features of the residuals. First, the residual vs. fitted plot seems like it is distributed around 0 but the points clearly follow some pattern. The QQ plot also falls along a straight line but several values toward the left side of the slope veer below the diagonal.
With regards to the model of predicting wine quality for the Portuguese “Vinho Verde” wine the linear regression method used may not be the most appropriate. It could be a result of the dependent variable being a discrete data type. While the model passed a significance test the adjusted \(R^2\) at 0.2954 accounted for a small amount of variation. There was no interaction effect of color and residual sugars but perhaps a stronger one exists. This was explored as the white wine in this data set had a larger spread of values compared to the red type. I was unable to identify any quadratic term that made sense with this data. The color of the wine had a negative impact on the quality score with white wines tending to reduce the quality score. Volatility acidity seemed to have the most impact on quality, for every 1 unit of acidity it reduced the quality by 1.472 points. In our data set the mean volatility acidity score was 0.3397.