mod = lm(quality ~ alcohol, data = wine)
summary(mod)
##
## Call:
## lm(formula = quality ~ alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87497 0.17471 10.73 <2e-16 ***
## alcohol 0.36084 0.01668 21.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
ggplot(data = wine, aes(x = alcohol, y=quality))+
geom_point()+
theme_bw()+
geom_abline(intercept = coef(mod)[1], slope = coef(mod)[2])+
ggtitle("Quality vs Alcohol(%)")
There seems to be some sort of positive relationship between alcohol and quality.
mod2 = lm(quality ~ sweet, data = wine)
summary(mod2)
##
## Call:
## lm(formula = quality ~ sweet, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7143 -0.6592 0.3408 0.4081 2.4081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.59194 0.03242 172.491 <2e-16 ***
## sweet2 0.06728 0.04218 1.595 0.111
## sweet3 0.12235 0.09385 1.304 0.193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8072 on 1596 degrees of freedom
## Multiple R-squared: 0.002112, Adjusted R-squared: 0.0008616
## F-statistic: 1.689 on 2 and 1596 DF, p-value: 0.185
anova(mod2)
## Analysis of Variance Table
##
## Response: quality
## Df Sum Sq Mean Sq F value Pr(>F)
## sweet 2 2.2 1.10056 1.689 0.185
## Residuals 1596 1040.0 0.65161
boxplot(wine$sweet, wine$quality)
The F value is not particularly large but it is not one so we can expect some difference.
mod3 = lm(quality ~ sweet + alcohol, data = wine)
summary(mod3)
##
## Call:
## lm(formula = quality ~ sweet + alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8286 -0.3932 -0.1755 0.5333 2.6068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.874828 0.174763 10.728 <2e-16 ***
## sweet2 -0.037244 0.037443 -0.995 0.320
## sweet3 0.007555 0.082785 0.091 0.927
## alcohol 0.362818 0.016829 21.559 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7106 on 1595 degrees of freedom
## Multiple R-squared: 0.2273, Adjusted R-squared: 0.2258
## F-statistic: 156.4 on 3 and 1595 DF, p-value: < 2.2e-16
aug_mod3 = augment(mod3)
ggplot(data = aug_mod3, aes(x = alcohol, y = quality, col = sweet))+
geom_point()+
theme_bw()+
ggtitle("MLR of Quality vs Sweetness and Alcohol")+
geom_line(aes(x = alcohol, y = .fitted))
The models for these estimates are as follows:
Dry \[y = 0.363x + 1.875\] Off-Dry \[ y = 0.363x +(1.875 -0.0372)\] Off-Dry \[ y = 0.363x +(1.875 +0.0075)\]
mod4 = lm(quality ~ sweet * alcohol, data = wine)
summary(mod4)
##
## Call:
## lm(formula = quality ~ sweet * alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8280 -0.3942 -0.1755 0.5335 2.6058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.74657 0.29187 5.984 2.68e-09 ***
## sweet2 0.10492 0.37396 0.281 0.779
## sweet3 0.71818 0.75445 0.952 0.341
## alcohol 0.37534 0.02835 13.239 < 2e-16 ***
## sweet2:alcohol -0.01384 0.03594 -0.385 0.700
## sweet3:alcohol -0.06766 0.07134 -0.948 0.343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7108 on 1593 degrees of freedom
## Multiple R-squared: 0.2277, Adjusted R-squared: 0.2253
## F-statistic: 93.95 on 5 and 1593 DF, p-value: < 2.2e-16
aug_mod4 = augment(mod4)
ggplot(data = aug_mod4, aes(x = alcohol, y = quality, col = sweet))+
geom_point()+
theme_bw()+
ggtitle("MLR of Quality vs Sweetness and Alcohol")+
geom_line(aes(x = alcohol, y = .fitted))
The models of these estimates are as follows:
Dry \[y = (0.363)x + 1.875\] Off-Dry \[ y = (0.363-0.01384)x +(1.875 -0.0372)\] Off-Dry \[ y = (0.363-0.06766)x +(1.875 +0.0075)\]
mean(summary(mod)$residuals^2)
## [1] 0.503984
mean(summary(mod2)$residuals^2)
## [1] 0.650384
mean(summary(mod3)$residuals^2)
## [1] 0.5036272
mean(summary(mod4)$residuals^2)
## [1] 0.5033403
The MSE for all the models are pretty similar except for model two which was the categorical model.
In summary, there wasn’t a large difference between different levels of sweetness when it with and without the interaction with alcohol percentage. However, the relationship between alcohol and quality is signifigant since the p value was nearly 0. I am surpised that sweetness was not a more impactful factor in determining quality. It is a misconception that sweet wines are lower quality.