1 PROBLEM:

Using R, build a multiple regression model for data that interests you.

Include in this model:

at least one quadratic term one dichotomous term one dichotomous vs. quantitative interaction term

Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

The data set I choose to explore is called Wine Quality. The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. Our goal is to build a multiple linear regression model to predict wine quality using the variables in the data set.

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:

@Elsevier

Pre-press (pdf)

bib

Attribute information: Input variables (based on physicochemical tests): 1. fixed acidity 2. volatile acidity 3. citric acid 4. residual sugar 5. chlorides 6. free sulfur dioxide 7. total sulfur dioxide 8. density 9. pH 10. sulphates 11. alcohol Output variable (based on sensory data): 12. quality (score between 0 and 10)

2 Load and tidy data

library(tidyverse)
library(psych)
library(ggpubr)
library(GGally)
library(knitr)
white <- read_csv("wine+quality/winequality-white.csv")
red <- read_csv("wine+quality/winequality-red.csv")

Clean each data fram as the file contained all column names and values in one column each one separated but a semi-colon

white_colnames <- colnames(white) %>% str_split(pattern = ";") %>% unlist()
white_colnames <- white_colnames %>% str_replace_all(" ", "_")

white <- white %>% separate_wider_delim(
  cols = "fixed acidity;volatile acidity;citric acid;residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;density;pH;sulphates;alcohol;quality",
  delim = ";", 
  names = white_colnames
)
red_colnames <- colnames(red) %>% str_split(pattern = ";") %>% unlist()
red_colnames <- red_colnames %>% str_replace_all(" ", "_")

red <- red %>% separate_wider_delim(
  cols = "fixed acidity;volatile acidity;citric acid;residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;density;pH;sulphates;alcohol;quality",
  delim = ";", 
  names = red_colnames
)

3 Data exploration

head(white)
## # A tibble: 6 × 12
##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
##   <chr>         <chr>            <chr>       <chr>          <chr>    
## 1 7             0.27             0.36        20.7           0.045    
## 2 6.3           0.3              0.34        1.6            0.049    
## 3 8.1           0.28             0.4         6.9            0.05     
## 4 7.2           0.23             0.32        8.5            0.058    
## 5 7.2           0.23             0.32        8.5            0.058    
## 6 8.1           0.28             0.4         6.9            0.05     
## # ℹ 7 more variables: free_sulfur_dioxide <chr>, total_sulfur_dioxide <chr>,
## #   density <chr>, pH <chr>, sulphates <chr>, alcohol <chr>, quality <chr>
head(red)
## # A tibble: 6 × 12
##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
##   <chr>         <chr>            <chr>       <chr>          <chr>    
## 1 7.4           0.7              0           1.9            0.076    
## 2 7.8           0.88             0           2.6            0.098    
## 3 7.8           0.76             0.04        2.3            0.092    
## 4 11.2          0.28             0.56        1.9            0.075    
## 5 7.4           0.7              0           1.9            0.076    
## 6 7.4           0.66             0           1.8            0.075    
## # ℹ 7 more variables: free_sulfur_dioxide <chr>, total_sulfur_dioxide <chr>,
## #   density <chr>, pH <chr>, sulphates <chr>, alcohol <chr>, quality <chr>

We can see that all the data is a character so we would need to convert the appropriate columns to numeric.

# convert data types to numeric
red[1:12] <- lapply(red[1:12],as.numeric)
white[1:12] <- lapply(white[1:12],as.numeric)

Before we combine these data sets we need to denote which observation is red or white and we can create a new column with the respective label in each data frame. I also though it might be a good idea to look at some summary stats separately for red and white wines.

white$color <- "white"

red$color <- "red"

3.1 Summary - White wine

summary(white)
##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0        Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00      Median :134.0        Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4        Mean   :0.9940  
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000  
##     color          
##  Length:4898       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

3.2 Summary - Red wine

summary(red)
##  fixed_acidity   volatile_acidity  citric_acid    residual_sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000  
##     color          
##  Length:1599       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Combine data sets

wine_df <- rbind(red, white)
head(wine_df)
## # A tibble: 6 × 13
##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
##           <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
## 1           7.4             0.7         0               1.9     0.076
## 2           7.8             0.88        0               2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.7         0               1.9     0.076
## 6           7.4             0.66        0               1.8     0.075
## # ℹ 8 more variables: free_sulfur_dioxide <dbl>, total_sulfur_dioxide <dbl>,
## #   density <dbl>, pH <dbl>, sulphates <dbl>, alcohol <dbl>, quality <dbl>,
## #   color <chr>
dim(wine_df)
## [1] 6497   13
glimpse(wine_df)
## Rows: 6,497
## Columns: 13
## $ fixed_acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile_acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric_acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual_sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free_sulfur_dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality              <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
## $ color                <chr> "red", "red", "red", "red", "red", "red", "red", …

Check for missing values - no missing values

any(is.na(wine_df))
## [1] FALSE

3.3 Summary - Combined data set

summary(wine_df)
##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0        Min.   :0.9871  
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0        1st Qu.:0.9923  
##  Median :0.04700   Median : 29.00      Median :118.0        Median :0.9949  
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7        Mean   :0.9947  
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0        3rd Qu.:0.9970  
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.210   Median :0.5100   Median :10.30   Median :6.000  
##  Mean   :3.219   Mean   :0.5313   Mean   :10.49   Mean   :5.818  
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :9.000  
##     color          
##  Length:6497       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Distribution of the quality scores between red and white wine

wine_df %>% ggplot(aes(x = color, y = quality, color = color)) + 
  geom_boxplot()

3.4 Correlation plots

corrplot::corrplot(cor(wine_df[,1:12]), method = "square", type = 'lower', tl.srt = 45)

## Scatter plots - Relationships with dependent variable (quality)

wine_df %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = -quality, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, y = quality)) +
    geom_point(alpha=0.1) +
    facet_wrap(~variable, scales = "free")

4 Build model

First we need to code our dichotomous variable color as a 0 for red and 1 for white.

wine_df <- wine_df %>% mutate(color2 = ifelse(color == "red", 0, 1))
wine_lm1 <- lm(quality ~ fixed_acidity + volatile_acidity + citric_acid + residual_sugar + chlorides + 
                 free_sulfur_dioxide  + total_sulfur_dioxide + density + pH + sulphates + alcohol + color2, data = wine_df)

summary(wine_lm1)
## 
## Call:
## lm(formula = quality ~ fixed_acidity + volatile_acidity + citric_acid + 
##     residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + 
##     density + pH + sulphates + alcohol + color2, data = wine_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7796 -0.4671 -0.0444  0.4561  3.0211 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.048e+02  1.414e+01   7.411 1.42e-13 ***
## fixed_acidity         8.507e-02  1.576e-02   5.396 7.05e-08 ***
## volatile_acidity     -1.492e+00  8.135e-02 -18.345  < 2e-16 ***
## citric_acid          -6.262e-02  7.972e-02  -0.786   0.4322    
## residual_sugar        6.244e-02  5.934e-03  10.522  < 2e-16 ***
## chlorides            -7.573e-01  3.344e-01  -2.264   0.0236 *  
## free_sulfur_dioxide   4.937e-03  7.662e-04   6.443 1.25e-10 ***
## total_sulfur_dioxide -1.403e-03  3.237e-04  -4.333 1.49e-05 ***
## density              -1.039e+02  1.434e+01  -7.248 4.71e-13 ***
## pH                    4.988e-01  9.058e-02   5.506 3.81e-08 ***
## sulphates             7.217e-01  7.624e-02   9.466  < 2e-16 ***
## alcohol               2.227e-01  1.807e-02  12.320  < 2e-16 ***
## color2               -3.613e-01  5.675e-02  -6.367 2.06e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7331 on 6484 degrees of freedom
## Multiple R-squared:  0.2965, Adjusted R-squared:  0.2952 
## F-statistic: 227.8 on 12 and 6484 DF,  p-value: < 2.2e-16

From the output of the model the only non significant variable was citric acid. The model was run adain with this removed.

wine_lm2 <- update(wine_lm1, .~. - citric_acid, data = wine_df)
summary(wine_lm2)
## 
## Call:
## lm(formula = quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
##     chlorides + free_sulfur_dioxide + total_sulfur_dioxide + 
##     density + pH + sulphates + alcohol + color2, data = wine_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7708 -0.4696 -0.0418  0.4540  3.0202 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.054e+02  1.411e+01   7.471 9.02e-14 ***
## fixed_acidity         8.244e-02  1.541e-02   5.351 9.02e-08 ***
## volatile_acidity     -1.471e+00  7.670e-02 -19.180  < 2e-16 ***
## residual_sugar        6.256e-02  5.932e-03  10.548  < 2e-16 ***
## chlorides            -8.007e-01  3.298e-01  -2.428   0.0152 *  
## free_sulfur_dioxide   4.941e-03  7.662e-04   6.449 1.21e-10 ***
## total_sulfur_dioxide -1.427e-03  3.222e-04  -4.428 9.66e-06 ***
## density              -1.046e+02  1.431e+01  -7.307 3.05e-13 ***
## pH                    5.044e-01  9.029e-02   5.587 2.40e-08 ***
## sulphates             7.186e-01  7.614e-02   9.439  < 2e-16 ***
## alcohol               2.210e-01  1.794e-02  12.315  < 2e-16 ***
## color2               -3.655e-01  5.651e-02  -6.468 1.07e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7331 on 6485 degrees of freedom
## Multiple R-squared:  0.2965, Adjusted R-squared:  0.2953 
## F-statistic: 248.4 on 11 and 6485 DF,  p-value: < 2.2e-16

Removing citric acid did not make that much of a change in the output statistics. The adjusted \(R^2\) improved by 0.0001.

I was curious about any interaction effects between the color of the wine and residual sugars so the model was run again with interaction term.

wine_lm3 <- update(wine_lm2, .~. + residual_sugar:color2, data = wine_df)
summary(wine_lm3)
## 
## Call:
## lm(formula = quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
##     chlorides + free_sulfur_dioxide + total_sulfur_dioxide + 
##     density + pH + sulphates + alcohol + color2 + residual_sugar:color2, 
##     data = wine_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7760 -0.4725 -0.0430  0.4567  3.0166 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.047e+02  1.412e+01   7.418 1.34e-13 ***
## fixed_acidity          8.298e-02  1.541e-02   5.385 7.49e-08 ***
## volatile_acidity      -1.472e+00  7.669e-02 -19.190  < 2e-16 ***
## residual_sugar         4.426e-02  1.442e-02   3.070  0.00215 ** 
## chlorides             -7.742e-01  3.303e-01  -2.343  0.01914 *  
## free_sulfur_dioxide    4.953e-03  7.662e-04   6.464 1.10e-10 ***
## total_sulfur_dioxide  -1.411e-03  3.224e-04  -4.376 1.23e-05 ***
## density               -1.039e+02  1.432e+01  -7.254 4.51e-13 ***
## pH                     5.043e-01  9.028e-02   5.586 2.42e-08 ***
## sulphates              7.153e-01  7.617e-02   9.391  < 2e-16 ***
## alcohol                2.232e-01  1.801e-02  12.392  < 2e-16 ***
## color2                -4.124e-01  6.580e-02  -6.268 3.89e-10 ***
## residual_sugar:color2  1.856e-02  1.333e-02   1.393  0.16372    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.733 on 6484 degrees of freedom
## Multiple R-squared:  0.2967, Adjusted R-squared:  0.2954 
## F-statistic: 227.9 on 12 and 6484 DF,  p-value: < 2.2e-16

Adding an interaction effect between residual sugar and color did minimal to the adjusted \(R^2\) value, improving is only by 0.0001 to 0.2954. The coefficient also did not met the a significant threshold.

5 Residual analysis

par(mfrow = c(2,2))
plot(wine_lm3)

In the residual plots we see some very interesting features of the residuals. First, the residual vs. fitted plot seems like it is distributed around 0 but the points clearly follow some pattern. The QQ plot also falls along a straight line but several values toward the left side of the slope veer below the diagonal.

6 Conclusion

With regards to the model of predicting wine quality for the Portuguese “Vinho Verde” wine the linear regression method used may not be the most appropriate. It could be a result of the dependent variable being a discrete data type. While the model passed a significance test the adjusted \(R^2\) at 0.2954 accounted for a small amount of variation. There was no interaction effect of color and residual sugars but perhaps a stronger one exists. This was explored as the white wine in this data set had a larger spread of values compared to the red type. I was unable to identify any quadratic term that made sense with this data. The color of the wine had a negative impact on the quality score with white wines tending to reduce the quality score. Volatility acidity seemed to have the most impact on quality, for every 1 unit of acidity it reduced the quality by 1.472 points. In our data set the mean volatility acidity score was 0.3397.