Week 2 Challenge

Import Dataset and View

library(readr)
WineQT <- read_csv("WineQT.csv")

## Rows: 1143 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

wineData <- data.frame(WineQT)
head(wineData)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality Id
## 1       5  0
## 2       5  1
## 3       5  2
## 4       6  3
## 5       5  4
## 6       5  5

##Data Cleanup

wineData <- na.omit(wineData)
str(wineData)

## 'data.frame':    1143 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 6.7 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.58 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.08 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 1.8 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.097 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 15 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 65 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.28 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.54 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 9.2 ...
##  $ quality             : num  5 5 5 6 5 5 5 7 7 5 ...
##  $ Id                  : num  0 1 2 3 4 5 6 7 8 10 ...

We can now make the preliminary model and view its summary statistics (Residuals, Coefficients, etc)

lm_model <- lm(quality ~ ., data = wineData)
summary(lm_model)

## 
## Call:
## lm(formula = quality ~ ., data = wineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49730 -0.37125 -0.04815  0.44220  1.97744 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.303e+01  2.482e+01   0.928 0.353614    
## fixed.acidity         1.882e-02  3.054e-02   0.616 0.537799    
## volatile.acidity     -1.125e+00  1.408e-01  -7.994 3.20e-15 ***
## citric.acid          -1.221e-01  1.733e-01  -0.704 0.481357    
## residual.sugar        1.400e-02  1.846e-02   0.758 0.448432    
## chlorides            -1.721e+00  4.976e-01  -3.458 0.000564 ***
## free.sulfur.dioxide   2.890e-03  2.607e-03   1.109 0.267723    
## total.sulfur.dioxide -2.977e-03  8.608e-04  -3.458 0.000564 ***
## density              -1.880e+01  2.532e+01  -0.743 0.457880    
## pH                   -4.342e-01  2.244e-01  -1.935 0.053255 .  
## sulphates             8.643e-01  1.340e-01   6.452 1.64e-10 ***
## alcohol               2.830e-01  3.139e-02   9.016  < 2e-16 ***
## Id                   -4.528e-05  4.576e-05  -0.990 0.322604    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6405 on 1130 degrees of freedom
## Multiple R-squared:  0.3748, Adjusted R-squared:  0.3681 
## F-statistic: 56.45 on 12 and 1130 DF,  p-value: < 2.2e-16

The Summary above showcases the Residuals, which indicate the shape/distribution of our graph. From the above we can see that the median is relatively close to 0 (which is ideal) and that the min/max values are not distributed evenly, as the -2.49 Min value has an absolute value whose distance to 0 is much larger than the max value, whose absolute value is closer. This tells us that the model is not symmetrical as it skews towards the left

We can see this visually by showcasing the Norm QQ graph to match the statement above:

qqnorm(resid(lm_model))
qqline(resid(lm_model))

We can adjust our regression model by developing a co-relation matrix and test which values have the most significant impact on quality

corrgram(wineData, lower.panel=panel.shade, upper.panel=panel.ellipse)

From the correlogram we see that the variables alcohol, volatile.acidity, citric.acid, and sulphates play a significant role to the quality values we have for wine

We can also test for multi-collinearity, which we want to have low values for. R provides that by allowing you to use the Variance Inflation Factor (VIF). We want each variable to have VIF’s < 10, with the lower the values to be more ideal.

vif_values <- vif(lm_model)
print(vif_values)

##        fixed.acidity     volatile.acidity          citric.acid 
##             7.930042             1.779962             3.233412 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             1.744487             1.539453             1.987134 
## total.sulfur.dioxide              density                   pH 
##             2.216224             6.614473             3.440539 
##            sulphates              alcohol                   Id 
##             1.450501             3.211699             1.254879

We can see that none of the values have concerning multi-collinearity, as every value is below 10.

Now, we can create a new model only using the alcohol, volatile.acidity, citric.acid, and sulphates to see if we get improvements within the summary:

linearTestNew = lm(quality ~ alcohol + volatile.acidity + citric.acid + sulphates, data = wineData)
summary(linearTestNew)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + citric.acid + 
##     sulphates, data = wineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.42794 -0.38637 -0.06418  0.46223  2.15857 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.69392    0.23250  11.587  < 2e-16 ***
## alcohol           0.30795    0.01816  16.959  < 2e-16 ***
## volatile.acidity -1.28979    0.13036  -9.894  < 2e-16 ***
## citric.acid      -0.02632    0.11954  -0.220    0.826    
## sulphates         0.66874    0.12054   5.548  3.6e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6495 on 1138 degrees of freedom
## Multiple R-squared:  0.3526, Adjusted R-squared:  0.3503 
## F-statistic: 154.9 on 4 and 1138 DF,  p-value: < 2.2e-16

We see that while the median strayed further from zero, our tails (the min and max values) from the residuals show more parity, therefore it is more evenly distributed. We can again visually see this using the functions below:

qqnorm(resid(linearTestNew))
qqline(resid(linearTestNew))

SUMMARY

There are more ways and other interpretations to train datasets in order to create forecasts for multi- linear regression models. Here we can tell from the dataset trained that the variables alcohol, volatile.acidity, citric.acid, and sulphates play a key role in determining wine quality. This can be analyzed using other packages and statistic R offers. But the current analysis shows the vast capabilities R already possesses in determining forecast results.

I hope to learn more packages and statistical analytics that are available for use

Week 2 Challenge

2023-09-12

Import Dataset and View