HW 9

B) Fit a simple linear model with a response variable and the numeric predictor that you chose. Does the relationship appear to be significant? Make sure to also include a graphic.

mod = lm(quality ~ alcohol, data = wine)
summary(mod)
## 
## Call:
## lm(formula = quality ~ alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.87497    0.17471   10.73   <2e-16 ***
## alcohol      0.36084    0.01668   21.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16
ggplot(data = wine, aes(x = alcohol, y=quality))+
  geom_point()+
  theme_bw()+
  geom_abline(intercept = coef(mod)[1], slope = coef(mod)[2])+
  ggtitle("Quality vs Alcohol(%)")

There seems to be some sort of positive relationship between alcohol and quality.

C) Now, write the “dummy” variable coding for your categorical variable. (Hint: the contrasts() function might help).

D) Fit a linear model with response variable and the categorical variable. Does it appear that there are differences among the means of levels of the categorical variable? (Hint: Look at the ANOVA F-test). Be sure to include an appropriate graphic (i.e. side-by-side boxplot)

mod2 = lm(quality ~ sweet, data = wine)
summary(mod2)
## 
## Call:
## lm(formula = quality ~ sweet, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7143 -0.6592  0.3408  0.4081  2.4081 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.59194    0.03242 172.491   <2e-16 ***
## sweet2       0.06728    0.04218   1.595    0.111    
## sweet3       0.12235    0.09385   1.304    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8072 on 1596 degrees of freedom
## Multiple R-squared:  0.002112,   Adjusted R-squared:  0.0008616 
## F-statistic: 1.689 on 2 and 1596 DF,  p-value: 0.185
anova(mod2)
## Analysis of Variance Table
## 
## Response: quality
##             Df Sum Sq Mean Sq F value Pr(>F)
## sweet        2    2.2 1.10056   1.689  0.185
## Residuals 1596 1040.0 0.65161
boxplot(wine$sweet, wine$quality)

The F value is not particularly large but it is not one so we can expect some difference.

E) Now fit a multiple linear model that combines parts (b) and (d), with both the numeric and categorical variables. What are the estimated models for the different levels? Include a graphic of the scatter plot with lines overlaid for each level.

mod3 = lm(quality ~ sweet + alcohol, data = wine)
summary(mod3)
## 
## Call:
## lm(formula = quality ~ sweet + alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8286 -0.3932 -0.1755  0.5333  2.6068 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.874828   0.174763  10.728   <2e-16 ***
## sweet2      -0.037244   0.037443  -0.995    0.320    
## sweet3       0.007555   0.082785   0.091    0.927    
## alcohol      0.362818   0.016829  21.559   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7106 on 1595 degrees of freedom
## Multiple R-squared:  0.2273, Adjusted R-squared:  0.2258 
## F-statistic: 156.4 on 3 and 1595 DF,  p-value: < 2.2e-16
aug_mod3 = augment(mod3)

ggplot(data = aug_mod3, aes(x = alcohol, y = quality, col = sweet))+
  geom_point()+
  theme_bw()+
  ggtitle("MLR of Quality vs Sweetness and Alcohol")+
  geom_line(aes(x = alcohol, y = .fitted))

The models for these estimates are as follows:

Dry \[y = 0.363x + 1.875\] Off-Dry \[ y = 0.363x +(1.875 -0.0372)\] Off-Dry \[ y = 0.363x +(1.875 +0.0075)\]

F) Finally, fit a multiple linear model that includes also the interaction between the numeric and categorical variables, which allows for different slopes. What are the estimated models for the different levels? Include a graphic of the scatter plot with lines overlaid for each level.

mod4 = lm(quality ~ sweet * alcohol, data = wine)
summary(mod4)
## 
## Call:
## lm(formula = quality ~ sweet * alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8280 -0.3942 -0.1755  0.5335  2.6058 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.74657    0.29187   5.984 2.68e-09 ***
## sweet2          0.10492    0.37396   0.281    0.779    
## sweet3          0.71818    0.75445   0.952    0.341    
## alcohol         0.37534    0.02835  13.239  < 2e-16 ***
## sweet2:alcohol -0.01384    0.03594  -0.385    0.700    
## sweet3:alcohol -0.06766    0.07134  -0.948    0.343    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7108 on 1593 degrees of freedom
## Multiple R-squared:  0.2277, Adjusted R-squared:  0.2253 
## F-statistic: 93.95 on 5 and 1593 DF,  p-value: < 2.2e-16
aug_mod4 = augment(mod4)

ggplot(data = aug_mod4, aes(x = alcohol, y = quality, col = sweet))+
  geom_point()+
  theme_bw()+
  ggtitle("MLR of Quality vs Sweetness and Alcohol")+
  geom_line(aes(x = alcohol, y = .fitted))

The models of these estimates are as follows:

Dry \[y = (0.363)x + 1.875\] Off-Dry \[ y = (0.363-0.01384)x +(1.875 -0.0372)\] Off-Dry \[ y = (0.363-0.06766)x +(1.875 +0.0075)\]

G) Compare the models from parts (B), (D), (E), and (F).

mean(summary(mod)$residuals^2)
## [1] 0.503984
mean(summary(mod2)$residuals^2)
## [1] 0.650384
mean(summary(mod3)$residuals^2)
## [1] 0.5036272
mean(summary(mod4)$residuals^2)
## [1] 0.5033403

The MSE for all the models are pretty similar except for model two which was the categorical model.

H) Conclusion: What did you learn from this exercise? Were any of the relationships significant? (Note: This would be great to include in your final project write up!)

In summary, there wasn’t a large difference between different levels of sweetness when it with and without the interaction with alcohol percentage. However, the relationship between alcohol and quality is signifigant since the p value was nearly 0. I am surpised that sweetness was not a more impactful factor in determining quality. It is a misconception that sweet wines are lower quality.