library(readr)
WineQT <- read_csv("WineQT.csv")
## Rows: 1143 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
wineData <- data.frame(WineQT)
head(wineData)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality Id
## 1 5 0
## 2 5 1
## 3 5 2
## 4 6 3
## 5 5 4
## 6 5 5
##Data Cleanup
wineData <- na.omit(wineData)
str(wineData)
## 'data.frame': 1143 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 6.7 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.58 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.08 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 1.8 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.097 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 15 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 65 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.28 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.54 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 9.2 ...
## $ quality : num 5 5 5 6 5 5 5 7 7 5 ...
## $ Id : num 0 1 2 3 4 5 6 7 8 10 ...
We can now make the preliminary model and view its summary statistics (Residuals, Coefficients, etc)
lm_model <- lm(quality ~ ., data = wineData)
summary(lm_model)
##
## Call:
## lm(formula = quality ~ ., data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49730 -0.37125 -0.04815 0.44220 1.97744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.303e+01 2.482e+01 0.928 0.353614
## fixed.acidity 1.882e-02 3.054e-02 0.616 0.537799
## volatile.acidity -1.125e+00 1.408e-01 -7.994 3.20e-15 ***
## citric.acid -1.221e-01 1.733e-01 -0.704 0.481357
## residual.sugar 1.400e-02 1.846e-02 0.758 0.448432
## chlorides -1.721e+00 4.976e-01 -3.458 0.000564 ***
## free.sulfur.dioxide 2.890e-03 2.607e-03 1.109 0.267723
## total.sulfur.dioxide -2.977e-03 8.608e-04 -3.458 0.000564 ***
## density -1.880e+01 2.532e+01 -0.743 0.457880
## pH -4.342e-01 2.244e-01 -1.935 0.053255 .
## sulphates 8.643e-01 1.340e-01 6.452 1.64e-10 ***
## alcohol 2.830e-01 3.139e-02 9.016 < 2e-16 ***
## Id -4.528e-05 4.576e-05 -0.990 0.322604
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6405 on 1130 degrees of freedom
## Multiple R-squared: 0.3748, Adjusted R-squared: 0.3681
## F-statistic: 56.45 on 12 and 1130 DF, p-value: < 2.2e-16
The Summary above showcases the Residuals, which indicate the shape/distribution of our graph. From the above we can see that the median is relatively close to 0 (which is ideal) and that the min/max values are not distributed evenly, as the -2.49 Min value has an absolute value whose distance to 0 is much larger than the max value, whose absolute value is closer. This tells us that the model is not symmetrical as it skews towards the left
We can see this visually by showcasing the Norm QQ graph to match the statement above:
qqnorm(resid(lm_model))
qqline(resid(lm_model))
We can adjust our regression model by developing a co-relation matrix and test which values have the most significant impact on quality
corrgram(wineData, lower.panel=panel.shade, upper.panel=panel.ellipse)
From the correlogram we see that the variables alcohol, volatile.acidity, citric.acid, and sulphates play a significant role to the quality values we have for wine
We can also test for multi-collinearity, which we want to have low values for. R provides that by allowing you to use the Variance Inflation Factor (VIF). We want each variable to have VIF’s < 10, with the lower the values to be more ideal.
vif_values <- vif(lm_model)
print(vif_values)
## fixed.acidity volatile.acidity citric.acid
## 7.930042 1.779962 3.233412
## residual.sugar chlorides free.sulfur.dioxide
## 1.744487 1.539453 1.987134
## total.sulfur.dioxide density pH
## 2.216224 6.614473 3.440539
## sulphates alcohol Id
## 1.450501 3.211699 1.254879
We can see that none of the values have concerning multi-collinearity, as every value is below 10.
Now, we can create a new model only using the alcohol, volatile.acidity, citric.acid, and sulphates to see if we get improvements within the summary:
linearTestNew = lm(quality ~ alcohol + volatile.acidity + citric.acid + sulphates, data = wineData)
summary(linearTestNew)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + citric.acid +
## sulphates, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.42794 -0.38637 -0.06418 0.46223 2.15857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.69392 0.23250 11.587 < 2e-16 ***
## alcohol 0.30795 0.01816 16.959 < 2e-16 ***
## volatile.acidity -1.28979 0.13036 -9.894 < 2e-16 ***
## citric.acid -0.02632 0.11954 -0.220 0.826
## sulphates 0.66874 0.12054 5.548 3.6e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6495 on 1138 degrees of freedom
## Multiple R-squared: 0.3526, Adjusted R-squared: 0.3503
## F-statistic: 154.9 on 4 and 1138 DF, p-value: < 2.2e-16
We see that while the median strayed further from zero, our tails (the min and max values) from the residuals show more parity, therefore it is more evenly distributed. We can again visually see this using the functions below:
qqnorm(resid(linearTestNew))
qqline(resid(linearTestNew))
SUMMARY
There are more ways and other interpretations to train datasets in order to create forecasts for multi- linear regression models. Here we can tell from the dataset trained that the variables alcohol, volatile.acidity, citric.acid, and sulphates play a key role in determining wine quality. This can be analyzed using other packages and statistic R offers. But the current analysis shows the vast capabilities R already possesses in determining forecast results.
I hope to learn more packages and statistical analytics that are available for use