DATA 605 Wk 12 Discussion

Background

The purpose of this week’s discussion topic is to build out a regression model and conduct residual analysis for any data set that interests us.

Dataset

The source of data is cited APA-style below:

UCI Machine Learning. (2018). Red Wine Quality [Data file]. Retrieved from https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009?select=winequality-red.csv

Being that last week I’d explored the tie between wine and happiness and there appeared to be a correlation, this week I thought I’d take it a step further and explore wine quality data.

What characteristic sets great wine apart from average wine?

Load data

After downloading the .csv file from Kaggle and uploading it to Github, we read the corresponding data (in raw form) and then familiarize ourselves with the dataset by displaying column names, column number, row number, the 1st 6 observations.

#Read .csv data
wine_quality <- read_csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-605/Support-Files/winequality-red.csv")

## 
## -- Column specification --------------------------------------------------------------------
## cols(
##   `fixed acidity` = col_double(),
##   `volatile acidity` = col_double(),
##   `citric acid` = col_double(),
##   `residual sugar` = col_double(),
##   chlorides = col_double(),
##   `free sulfur dioxide` = col_double(),
##   `total sulfur dioxide` = col_double(),
##   density = col_double(),
##   pH = col_double(),
##   sulphates = col_double(),
##   alcohol = col_double(),
##   quality = col_double()
## )

#Familiarize ourselves with the dataset
colnames(wine_quality)

##  [1] "fixed acidity"        "volatile acidity"     "citric acid"         
##  [4] "residual sugar"       "chlorides"            "free sulfur dioxide" 
##  [7] "total sulfur dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

ncol(wine_quality)

## [1] 12

nrow(wine_quality)

## [1] 1599

head(wine_quality)

## # A tibble: 6 x 12
##   `fixed acidity` `volatile acidi~ `citric acid` `residual sugar` chlorides
##             <dbl>            <dbl>         <dbl>            <dbl>     <dbl>
## 1             7.4             0.7           0                 1.9     0.076
## 2             7.8             0.88          0                 2.6     0.098
## 3             7.8             0.76          0.04              2.3     0.092
## 4            11.2             0.28          0.56              1.9     0.075
## 5             7.4             0.7           0                 1.9     0.076
## 6             7.4             0.66          0                 1.8     0.075
## # ... with 7 more variables: `free sulfur dioxide` <dbl>, `total sulfur
## #   dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>, alcohol <dbl>,
## #   quality <dbl>

#visualize relationships in data
pairs(wine_quality, gap=0.5)

Our data has 12 columns and 1599 rows.

We’re going to explore the correlation between wine quality and any of the remaining 11 variables. The pairwise plot (above) provides a quick, high level view of these relationships.

Regression model

While we’re supposed to be able to see any obviously significant relationships between variables through the earlier plot, I had trouble doing so and thought the next step (backward elimination) might help paint a clearer picture regarding these relationships and the strength of our linear model:

#Start the backward elimination process. Put all potential predictors into a model for the wine_quality data frame using lm() function:
wine_quality.lm <- lm(quality ~ `fixed acidity` + `volatile acidity` + `citric acid` + `residual sugar` + chlorides + `free sulfur dioxide` + `total sulfur dioxide` + density + pH + sulphates + alcohol, data=wine_quality)

#Explore summary statistics
summary(wine_quality.lm)

## 
## Call:
## lm(formula = quality ~ `fixed acidity` + `volatile acidity` + 
##     `citric acid` + `residual sugar` + chlorides + `free sulfur dioxide` + 
##     `total sulfur dioxide` + density + pH + sulphates + alcohol, 
##     data = wine_quality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.197e+01  2.119e+01   1.036   0.3002    
## `fixed acidity`         2.499e-02  2.595e-02   0.963   0.3357    
## `volatile acidity`     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## `citric acid`          -1.826e-01  1.472e-01  -1.240   0.2150    
## `residual sugar`        1.633e-02  1.500e-02   1.089   0.2765    
## chlorides              -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## `free sulfur dioxide`   4.361e-03  2.171e-03   2.009   0.0447 *  
## `total sulfur dioxide` -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## density                -1.788e+01  2.163e+01  -0.827   0.4086    
## pH                     -4.137e-01  1.916e-01  -2.159   0.0310 *  
## sulphates               9.163e-01  1.143e-01   8.014 2.13e-15 ***
## alcohol                 2.762e-01  2.648e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

Our R^2 value is relatively low @ 0.3606 so it does not appear promising but we’ll continue exploring, eliminating variables that appear to be poor predictors (those with high p-values), and then revisit our summary statistics.

We’ll apply backward elimination and eliminate all predictors with p-values greater than 0.05. The variables with stars marked along the right side appear to hold promise as predictors:

#We're going to remove one colum
wine_quality.lm <- update(wine_quality.lm, .~. - `fixed acidity`, data = wine_quality)
wine_quality.lm <- update(wine_quality.lm, .~. - `residual sugar`, data = wine_quality)
wine_quality.lm <- update(wine_quality.lm, .~. - density, data = wine_quality)
wine_quality.lm <- update(wine_quality.lm, .~. - `citric acid`, data = wine_quality)

summary(wine_quality.lm)

## 
## Call:
## lm(formula = quality ~ `volatile acidity` + chlorides + `free sulfur dioxide` + 
##     `total sulfur dioxide` + pH + sulphates + alcohol, data = wine_quality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68918 -0.36757 -0.04653  0.46081  2.02954 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.4300987  0.4029168  10.995  < 2e-16 ***
## `volatile acidity`     -1.0127527  0.1008429 -10.043  < 2e-16 ***
## chlorides              -2.0178138  0.3975417  -5.076 4.31e-07 ***
## `free sulfur dioxide`   0.0050774  0.0021255   2.389    0.017 *  
## `total sulfur dioxide` -0.0034822  0.0006868  -5.070 4.43e-07 ***
## pH                     -0.4826614  0.1175581  -4.106 4.23e-05 ***
## sulphates               0.8826651  0.1099084   8.031 1.86e-15 ***
## alcohol                 0.2893028  0.0167958  17.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared:  0.3595, Adjusted R-squared:  0.3567 
## F-statistic: 127.6 on 7 and 1591 DF,  p-value: < 2.2e-16

While the p-value associated with our F-statistic is quite low @ 2.2e^-16, the corresponding R^2 is indicative of the opposite. The R^2 value is relatively low @ 0.3595.

At this point, we may have enough to reject the null hypothesis but we’ll explore residual plots to observe whether or not they support this early conclusion.

Residual analysis

#Residuals plot
plot(fitted(wine_quality.lm),resid(wine_quality.lm))

#Q-Q plot
qqnorm(resid(wine_quality.lm))
qqline(resid(wine_quality.lm))

From our residuals plot, we see obvious patterns. We were exploring the relationship between quality and 7 different variables and what we observe here is 6 different lines with a consistent negative slope.

The residuals are not well behaved. They consistently decrease from left to right and are spread about 6 different lines with those toward the middle holding a greater proportion of residuals. The residuals plot gives us reason to believe that we’ve produced a poor model.

From our Q-Q plot, we observe a slight bend below the plotted line early and a slight bend above the plotted line later on but in general our residuals roughly follow the plotted line and support that we may have a strong model.

Conclusion

Was the linear model appropriate? Why or why not?

No.

While our p-value and Q-Q plot supported the strength of our linear model, the fact that our R^2 value was so low (@ 0.3595) and our residuals plot had patterns provides enough of a basis to throw this model out.

Industry specific knowledge likely would have been useful for electing predictors and plotting variables against one another.

Maybe plotting all of these variables against quality was not the manner to do it, maybe certain variables needed to be transformed to be properly considered, or maybe there’s some other explanation. Needless to say, this was a poor model that did not answer the question at hand: